...
首页> 外文期刊>Information retrieval >A General Evaluation Framework for Topical Crawlers
【24h】

A General Evaluation Framework for Topical Crawlers

机译:主题搜寻者的一般评估框架

获取原文
获取原文并翻译 | 示例
           

摘要

Topical crawlers are becoming important tools to support applications such as specialized Web portals, online searching, and competitive intelligence. As the Web mining field matures, the disparate crawling strategies proposed in the literature will have to be evaluated and compared on common tasks through well-defined performance measures. This paper presents a general framework to evaluate topical crawlers. We identify a Class of tasks that model crawling applications of different nature and difficulty. We then introduce a set of performance measures for fair comparative evaluations of crawlers along several dimensions including generalized notions of precision, recall, and efficiency that are appropriate and practical for the Web. The framework relies on independent relevance judgements compiled by human editors and available from public directories. Two sources of evidence are proposed to assess crawled pages, capturing different relevance criteria. Finally we introduce a set of topic characterizations to analyze the variability in crawling effectiveness across topics. The proposed evaluation framework synthesizes a number of methodologies in the topical crawlers literature and many lessons learned from several studies conducted by our group. The general framework is described in detail and then illustrated in practice by a case study that evaluates four public crawling algorithms. We found that the proposed framework is effective at evaluating, comparing, differentiating and interpreting the performance of the four crawlers. For example, we found the IS crawler to be most sensitive to the popularity of topics.
机译:主题搜寻器正在成为支持诸如专用Web门户,在线搜索和竞争情报之类的应用程序的重要工具。随着Web挖掘领域的成熟,将必须通过定义明确的性能指标评估和比较文献中提出的不同爬网策略。本文提出了评估主题爬虫的通用框架。我们确定了一类任务,该任务为不同性质和难度的爬网应用程序建模。然后,我们引入了一组性能度量,用于沿多个维度对爬虫进行公平的比较评估,包括适用于Web的,实用的广义精度,召回率和效率概念。该框架依赖于人工编辑汇编的独立相关性判断,并且可以从公共目录中获得。提出了两种证据来评估爬网的页面,以捕获不同的相关性标准。最后,我们介绍了一组主题特征,以分析跨主题爬网有效性的可变性。拟议的评估框架综合了局部爬虫文献中的许多方法,以及从我们小组进行的多项研究中吸取的许多教训。对该通用框架进行了详细描述,然后通过案例研究对实际框架进行了说明,该案例评估了四个公共爬网算法。我们发现,提出的框架可以有效地评估,比较,区分和解释这四个爬虫的性能。例如,我们发现IS搜寻器对主题的流行度最为敏感。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号