Finding pages on the unarchived Web

机译：在未归档的Web上查找页面

获取原文

获取原文并翻译 | 示例

页面导航

摘要
著录项
相似文献
相关主题

摘要

Web archives preserve the fast changing Web, yet are highly incomplete due to crawling restrictions, crawling depth and frequency, or restrictive selection policies-most of the Web is unarchived and therefore lost to posterity. In this paper, we propose an approach to recover significant parts of the unarchived Web, by reconstructing descriptions of these pages based on links and anchors in the set of crawled pages, and experiment with this approach on the DutchWeb archive. Our main findings are threefold. First, the crawled Web contains evidence of a remarkable number of unarchived pages and websites, potentially dramatically increasing the coverage of the Web archive. Second, the link and anchor descriptions have a highly skewed distribution: popular pages such as home pages have more terms, but the richness tapers off quickly. Third, the succinct representation is generally rich enough to uniquely identify pages on the unarchived Web: in a known-item search setting we can retrieve these pages within the first ranks on average.

机译：Web存档保留了快速变化的Web，但是由于爬网限制，爬网深度和频率或限制性的选择策略而高度不完整-大多数Web尚未归档，因此被后代所忽略。在本文中，我们提出了一种方法，该方法通过基于已爬网页面集中的链接和锚点重建这些页面的描述，来恢复未归档Web的重要部分，并在DutchWeb档案库中尝试这种方法。我们的主要发现有三点。首先，已爬网的Web包含大量未归档页面和网站的证据，有可能极大地增加Web存档的覆盖范围。其次，链接和锚点说明的分布高度不正确：诸如主页之类的流行页面包含更多术语，但是丰富度迅速下降。第三，简洁的表示形式通常足够丰富，可以唯一地标识未归档Web上的页面：在已知项目的搜索设置中，我们可以平均检索第一行中的这些页面。

著录项

来源
《2014 IEEE/ACM Joint Conference on Digital Libraries》|2014年|331-340|共10页
会议地点 London(GB)
作者
Huurdeman H.C.; Ben-David A.; Kamps J.; Samar T.; de Vries A.P.;
展开▼
作者单位

Univ. of Amsterdam, Amsterdam, Netherlands;

展开▼
会议组织
原文格式 PDF
正文语种 eng
中图分类
关键词
Web sites; information retrieval; search engines; DutchWeb archive; Web archives; Web sites; anchor descriptions; crawling depth; crawling restrictions; known-item search setting; page retrieval; restrictive selection policies; skewed distribution; Context; Crawlers; Cultural differences; Internet; Libraries; Materials; Uniform resource locators; Anchor text; Information retrieval; Link evidence; Web archives; Web archiving; Web crawlers;

机译：网站;信息检索;搜索引擎; DutchWeb档案; Web档案;网站;锚定说明;爬行深度;爬行限制;已知项搜索设置;页面检索;限制性选择策略;偏斜分布;上下文;爬行者;文化差异;互联网;图书馆;材料;统一资源定位符;锚文本;信息检索;链接证据; Web档案; Web归档; Web爬虫;;

相似文献

外文文献
中文文献
专利

1. Lost but not forgotten: finding pages on the unarchived web [J] . Hugo C. Huurdeman, Jaap Kamps, Thaer Samar, International Journal on Digital Libraries . 2015,第3a4期

机译：丢失但未被遗忘：在未存档的网络上查找页面
2. Lost but not forgotten: finding pages on the unarchived web [J] . Hugo C. Huurdeman, Jaap Kamps, Thaer Samar, International journal on digital libraries . 2015,第3a4期

机译：丢失但未被遗忘：在未存档的网络上查找页面
3. Webportal vs google for finding government information on the web: From a website-centric approach to a web ecology perspective [J] . Henman Paul, Graham Tim Information Polity: The International Journal of Government and Democracy in the Information Age . 2018,第4期

机译：WebPortal vs谷歌寻找有关网络的政府信息：从以网站为中心的Web生态角度来看
4. Finding pages on the unarchived Web [C] . Huurdeman H.C., Ben-David A., Kamps J., IEEE/ACM Joint Conference on Digital Libraries . 2014

机译：在未定位的网络上查找页面
5. Large scale information integration on the Web: Finding, understanding and querying Web databases. [D] . Zhang, Zhen. 2007

机译：Web上的大规模信息集成：查找，理解和查询Web数据库。
6. Short- and mid-term effects of covered stent implantation on extremity findings and heart failure in Parkes Weber syndrome: a case report [O] . Zeydin Acar, Abdulkadir Kırış, Hüseyin Bektaş, 2020

机译：有盖支架植入术对帕克斯韦伯综合征四肢发现和心力衰竭的短期和中期影响：一例报告
7. Lost but not forgotten: finding pages on the unarchived web [O] . Hugo C. Huurdeman, Jaap Kamps, Thaer Samar, 2015

机译：丢失但未被遗忘：在未存档的网络上查找页面

Finding pages on the unarchived Web

摘要

著录项

相似文献

相关主题

期刊订阅