【24h】

Finding pages on the unarchived Web

机译:在未归档的Web上查找页面

获取原文
获取原文并翻译 | 示例

摘要

Web archives preserve the fast changing Web, yet are highly incomplete due to crawling restrictions, crawling depth and frequency, or restrictive selection policies-most of the Web is unarchived and therefore lost to posterity. In this paper, we propose an approach to recover significant parts of the unarchived Web, by reconstructing descriptions of these pages based on links and anchors in the set of crawled pages, and experiment with this approach on the DutchWeb archive. Our main findings are threefold. First, the crawled Web contains evidence of a remarkable number of unarchived pages and websites, potentially dramatically increasing the coverage of the Web archive. Second, the link and anchor descriptions have a highly skewed distribution: popular pages such as home pages have more terms, but the richness tapers off quickly. Third, the succinct representation is generally rich enough to uniquely identify pages on the unarchived Web: in a known-item search setting we can retrieve these pages within the first ranks on average.
机译:Web存档保留了快速变化的Web,但是由于爬网限制,爬网深度和频率或限制性的选择策略而高度不完整-大多数Web尚未归档,因此被后代所忽略。在本文中,我们提出了一种方法,该方法通过基于已爬网页面集中的链接和锚点重建这些页面的描述,来恢复未归档Web的重要部分,并在DutchWeb档案库中尝试这种方法。我们的主要发现有三点。首先,已爬网的Web包含大量未归档页面和网站的证据,有可能极大地增加Web存档的覆盖范围。其次,链接和锚点说明的分布高度不正确:诸如主页之类的流行页面包含更多术语,但是丰富度迅速下降。第三,简洁的表示形式通常足够丰富,可以唯一地标识未归档Web上的页面:在已知项目的搜索设置中,我们可以平均检索第一行中的这些页面。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号