首页> 外文期刊>Chinese Journal of Electronics >Mining and Harvesting High Quality Topical Resources from the Web
【24h】

Mining and Harvesting High Quality Topical Resources from the Web

机译:从Web上挖掘和收集高质量的主题资源

获取原文
获取原文并翻译 | 示例
           

摘要

Focused crawlers aim to effectively prioritize uncrawled URLs to harvest relevant pages while avoiding irrelevant ones. In practice, harvesting high quality topical Web resources is more important due to the explosion of Web information. Our study shows that the popular focused crawling strategy cannot achieve this goal. In this paper we develop a new focused crawler, namely On-line topical quality estimation (OTQE), which intelligently evaluates the topical quality of uncrawled pages by the observed link and content evidences and prioritize their URLs accordingly. The new crawler is scalable and requires fewer additional resources to do link-based analysis. The experimental results on crawling 3.6 million Web pages demonstrate the advantages of our proposed method over traditional focused crawlers.
机译:重点爬网程序旨在有效地对未爬网的URL进行优先级排序,以获取相关的页面,同时避免不相关的页面。在实践中,由于Web信息的爆炸性增长,收获高质量的主题Web资源更为重要。我们的研究表明,流行的集中爬网策略无法实现此目标。在本文中,我们开发了一种新的聚焦爬虫,即在线主题质量评估(OTQE),它可以通过观察到的链接和内容证据智能地评估未爬行页面的主题质量,并相应地对URL进行优先级排序。新的搜寻器具有可伸缩性,并且需要较少的额外资源来进行基于链接的分析。对360万个网页进行爬网的实验结果证明了我们提出的方法优于传统的集中式爬网程序的优势。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号