首页> 中文期刊> 《北京工业大学学报》 >面向蒙古文主题的网络爬虫采集策略模型

面向蒙古文主题的网络爬虫采集策略模型

         

摘要

Forecast of collecting URL and tunnel discovery are two core issues in Focused crawler for Mongolian website. Therefore, a collecting model was proposed based on topic group of site clustering, ordering and tunnel discovery. First, through the topic identification text, to be crawling URL was divided into the site links and non site links. Second, a URL priority ordering algorithm was established by using the text similarity and the hyperlink graph analysis, and an adaptive tunnel discovery algorithm based on website was designed. Finally, the system of focused crawler for the Mongolian website was constructed. The experimental results show that the accurate rate of collecting, the amount of information and the collection rate have been improved significantly compared than the baseline algorithm.%针对蒙古文主题爬虫主要面临的预测采集URL和发现隧道2个核心问题,提出一种基于主题团的站点聚类、排序和隧道发现的采集模型。通过站点的主题识别,将待爬行URL分为站点链接和非站点链接,使用文本相似度和超链图分析建立了预测URL优先级排序算法,基于站点粒度设计了站点自适应隧道发现算法,最后,构建了一个面向蒙古文主题的网络爬虫系统。实验结果表明:该算法在采准率、信息总量与采集速率上都得到了提高,明显优于基线算法。

著录项

相似文献

  • 中文文献
  • 外文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号