...
首页> 外文期刊>Journal of digital information management >Soft-404 Pages, A Crawling Problem
【24h】

Soft-404 Pages, A Crawling Problem

机译:Soft-404页面,一个正在爬行的问题

获取原文
获取原文并翻译 | 示例
           

摘要

During its traversal of the Web, crawler systems have to deal with multiple challenges. Some of them are related with detecting garbage content to avoid wasting resources processing it. Soft-404 pages are a type of garbage content generated when some web servers do not use the appropriate HTTP response code for death links making them to be incorrectly identified. Our analysis of the Web has revealed that 7.35% of web servers send a 200 HTTP code when a request for an unknown document is received, instead of a 404 code, which indicates that the document is not found. This paper presents a system called Soft404Detector, based on web content analysis to identify web pages that are Soft-404 pages. Our system uses a set of content-based heuristics and combines them with a C4.5 classifier. For testing purposes, we built a Soft-404 pages dataset. Our experiments indicate that our system is very effective, achieving a precision of 0.992 and a recall of 0.980 at Soft-404 pages.
机译:在遍历Web期间,搜寻器系统必须应对多个挑战。其中一些与检测垃圾内容有关,以避免浪费资源对其进行处理。当某些Web服务器未对死亡链接使用适当的HTTP响应代码而导致无法正确识别它们时,Soft-404页面是一种垃圾内容。我们对Web的分析表明,当收到未知文档的请求时,有7.35%的Web服务器发送200 HTTP代码,而不是404代码,这表明未找到该文档。本文提出了一个名为Soft404Detector的系统,该系统基于Web内容分析来识别属于Soft-404页面的网页。我们的系统使用了一组基于内容的启发式方法,并将它们与C4.5分类器结合在一起。为了进行测试,我们构建了Soft-404页面数据集。我们的实验表明,我们的系统非常有效,在Soft-404页面上的精度为0.992,召回率为0.980。

著录项

  • 来源
    《Journal of digital information management》 |2014年第2期|73-92|共20页
  • 作者单位

    Comunications and Information Technologies Department Facultade de Informatica Universidade da Coruna (University of A Coruna)Campus de A Coruna, 15071 (A Coruna), Spain;

    Comunications and Information Technologies Department Facultade de Informatica Universidade da Coruna (University of A Coruna)Campus de A Coruna, 15071 (A Coruna), Spain;

    Comunications and Information Technologies Department Facultade de Informatica Universidade da Coruna (University of A Coruna)Campus de A Coruna, 15071 (A Coruna), Spain;

  • 收录信息
  • 原文格式 PDF
  • 正文语种 eng
  • 中图分类
  • 关键词

    Soft-404 Error; Web Spam; Web Decay; Link Analysis; Data Mining; Statistical Properties of the Web;

    机译:Soft-404错误;网络垃圾邮件;网络衰减链接分析;数据挖掘;网络的统计属性;

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号