首页> 中文期刊> 《计算机应用与软件》 >一种主题知识自增长的聚焦网络爬虫

一种主题知识自增长的聚焦网络爬虫

         

摘要

Focused network crawler is a processing step necessary for various types of Internet text mining and information retrieval applications.Existing focused network crawlers encounter the challenges of knowledge description difficulty and susceptible magnification of errors.We find some properties of the topic knowledge hidden in webpage,and propose a focused network crawler with topic knowledge automatically growing (KAG-crawler).It constantly extends its topic knowledge in crawling process using an unsupervised learning technology,so as to make the crawler crawl a quantity of web pages with high accuracy under the condition of a simple initial topic description.Meanwhile,in order to help the extension of topic knowledge,we also propose a new topic representation model,and based on this model we construct a new webpage topic and a new URL-topic correlation degree means.Finally,the experiments in real environment show that the performance of KAG-Crawler is significantly better than the traditional focused network crawler based on text similarity.%聚焦网络爬虫是各类因特网文本挖掘和信息检索应用必需的处理步骤。现有聚焦网络爬虫面临着知识描述困难、误差易被放大等挑战。发现网页中主题知识存在的若干性质,提出一种主题知识自增长的聚焦网络爬虫KAG-Crawler,在网页爬取过程中采用一种无监督的学习技术不断扩展主题知识,从而使爬虫在一个简单的初始主题描述条件下,能够以较高正确率爬取大量网页。同时为便于主题知识的扩展,还提出一种新的主题表示模型,并基于该模型构建了新的网页主题和U RL主题相关度方法。最后在真实环境下的实验表明,KAG-Crawler的性能显著高于传统基于文本相似度的聚焦网络爬虫。

著录项

相似文献

  • 中文文献
  • 外文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号