首页> 外文期刊>Mathematical Problems in Engineering >An Improved Focused Crawler: Using Web Page Classification and Link Priority Evaluation
【24h】

An Improved Focused Crawler: Using Web Page Classification and Link Priority Evaluation

机译:改进的集中爬网程序:使用网页分类和链接优先级评估

获取原文
获取原文并翻译 | 示例
           

摘要

A focused crawler is topic-specific and aims selectively to collect web pages that are relevant to a given topic from the Internet. However, the performance of the current focused crawling can easily suffer the impact of the environments of web pages and multiple topic web pages. In the crawling process, a highly relevant region may be ignored owing to the low overall relevance of that page, and anchor text or link-context may misguide crawlers. In order to solve these problems, this paper proposes a new focused crawler. First, we build a web page classifier based on improved term weighting approach (ITFIDF), in order to gain highly relevant web pages. In addition, this paper introduces an evaluation approach of the link, link priority evaluation (LPE), which combines web page content block partition algorithm and the strategy of joint feature evaluation (JFE), to better judge the relevance between URLs on the web page and the given topic. The experimental results demonstrate that the classifier using ITFIDF outperforms TFIDF, and our focused crawler is superior to other focused crawlers based on breadth-first, best-first, anchor text only, link-context only, and content block partition in terms of harvest rate and target recall. In conclusion, our methods are significant and effective for focused crawler.
机译:重点爬虫是特定于主题的,旨在有选择地从Internet收集与给定主题相关的网页。但是,当前集中爬网的性能很容易受到网页环境和多个主题网页的影响。在搜寻过程中,由于该页面的整体相关性较低,因此可能会忽略高度相关的区域,并且锚文本或链接上下文可能会误导搜寻器。为了解决这些问题,本文提出了一种新的聚焦爬虫。首先,我们基于改进的术语加权方法(ITFIDF)构建网页分类器,以获取高度相关的网页。此外,本文介绍了一种链接评估方法,即链接优先级评估(LPE),该方法结合了网页内容块分割算法和联合特征评估策略(JFE),以更好地判断网页上URL之间的相关性和给定的主题。实验结果表明,使用ITFIDF的分类器优于TFIDF,并且基于广度优先,最佳优先,仅锚文本,仅链接上下文和内容块分区的收获率,我们的集中式爬虫优于其他集中式爬虫和目标召回。综上所述,我们的方法对于集中式爬虫非常重要且有效。

著录项

  • 来源
    《Mathematical Problems in Engineering》 |2016年第5期|6406901.1-6406901.10|共10页
  • 作者单位

    PLA Univ Sci & Technol, Coll Field Engn, Nanjing 210007, Jiangsu, Peoples R China;

    PLA Univ Sci & Technol, Coll Field Engn, Nanjing 210007, Jiangsu, Peoples R China;

    Baicheng Ordnance Test Ctr China, Baicheng 137000, Peoples R China;

    PLA Univ Sci & Technol, Coll Command Informat Syst, Nanjing 210007, Jiangsu, Peoples R China;

  • 收录信息
  • 原文格式 PDF
  • 正文语种 eng
  • 中图分类
  • 关键词

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号