一种主题知识自增长的聚焦网络爬虫

李东晖; 廖晓兰; 范辅桥; 黄九鸣; 陈雪刚

首页> 中文期刊> 《计算机应用与软件》 >一种主题知识自增长的聚焦网络爬虫

一种主题知识自增长的聚焦网络爬虫

开具论文收录证明 >>

期刊封面封底目录下载 >>

文献代查 >>

页面导航

摘要
著录项
相似文献
相关主题

摘要

Focused network crawler is a processing step necessary for various types of Internet text mining and information retrieval applications.Existing focused network crawlers encounter the challenges of knowledge description difficulty and susceptible magnification of errors.We find some properties of the topic knowledge hidden in webpage,and propose a focused network crawler with topic knowledge automatically growing (KAG-crawler).It constantly extends its topic knowledge in crawling process using an unsupervised learning technology,so as to make the crawler crawl a quantity of web pages with high accuracy under the condition of a simple initial topic description.Meanwhile,in order to help the extension of topic knowledge,we also propose a new topic representation model,and based on this model we construct a new webpage topic and a new URL-topic correlation degree means.Finally,the experiments in real environment show that the performance of KAG-Crawler is significantly better than the traditional focused network crawler based on text similarity.%聚焦网络爬虫是各类因特网文本挖掘和信息检索应用必需的处理步骤。现有聚焦网络爬虫面临着知识描述困难、误差易被放大等挑战。发现网页中主题知识存在的若干性质，提出一种主题知识自增长的聚焦网络爬虫KAG-Crawler，在网页爬取过程中采用一种无监督的学习技术不断扩展主题知识，从而使爬虫在一个简单的初始主题描述条件下，能够以较高正确率爬取大量网页。同时为便于主题知识的扩展，还提出一种新的主题表示模型，并基于该模型构建了新的网页主题和U RL主题相关度方法。最后在真实环境下的实验表明，KAG-Crawler的性能显著高于传统基于文本相似度的聚焦网络爬虫。

著录项

来源
《计算机应用与软件》 |2014年第5期|29-3388|共6页
作者
李东晖; 廖晓兰; 范辅桥; 黄九鸣; 陈雪刚;
展开▼
作者单位

湖南农业大学信息科学与技术学院湖南长沙410128;

湖南农业大学植物保护学院湖南长沙410128;

厦门通融软件科技有限公司福建厦门361008;

解放军73111 部队博士后工作站福建厦门361025;

湘南学院计算机科学系湖南郴州423000;

展开▼
原文格式 PDF
正文语种 chi
中图分类操作系统;
关键词
聚焦网络爬虫; 无监督学习; 知识扩展; 主题相关度;

相似文献

中文文献
外文文献
专利

1. 一种基于Heritrix可配置主题的聚焦爬虫方法 [J] . 王松 ,刘洪基 ,叶晓波 . 楚雄师范学院学报 . 2020,第006期
2. 一种主题自适应聚焦爬虫方法 [J] . 林椹尠 ,袁柱 ,李小平 . 计算机应用与软件 . 2019,第005期
3. 基于TF-IDF改进算法的聚焦主题网络爬虫 [J] . 王景中 ,邱铜相 . 计算机应用 . 2015,第010期
4. 一种新的主题网络爬虫爬行策略 [J] . 宋海洋 ,刘晓然 ,钱海俊 . 计算机应用与软件 . 2011,第011期
5. 一种面向BBS信息检索的主题网络爬虫算法 [J] . 刘佐达 ,张久岭 ,陈茂科 . 郑州大学学报（理学版） . 2010,第002期
6. 一种图像主题网络爬虫的实现方法研究 [C] . 朱学芳 ,韩占校 . 第三届江苏计算机大会 . 2008
7. 基于本体的主题知识自增长聚焦爬虫研究 [A] . 陈海燕 . 2015

一种主题知识自增长的聚焦网络爬虫

摘要

著录项

相似文献

相关主题

期刊订阅