NEAR-Miner: Mining Evolution Associations of Web Site Directories for Efficient Maintenance of Web Archives

机译：NEAR-Miner：有效维护Web档案的网站目录挖掘演化协会

获取原文

页面导航

摘要
著录项
相似文献
相关主题

摘要

Web archives preserve the history of autonomous Web sites and are potential gold mines for all kinds of media and business analysts. The most common Web archiving technique uses crawlers to automate the process of collecting Web pages. However, (re)downloading entire collection of pages periodically from a large Web site is unfeasible. In this paper, we take a step towards addressing this problem. We devise a data mining-driven policy for selectively (re)downloading Web pages that are located in hierarchical directory structures which are believed to have changed significantly (e.g., a substantial percentage of pages are inserted to/removed from the directory). Consequently, there is no need to download and maintain pages that have not changed since the last crawl as they can be easily retrieved from the archive.In our approach, we propose an off-line data mining algorithm called NEAR-Miner that analyzes the evolution history of Web directory structures of the original Web site stored in the archive and mines negatively correlated association rules (near) between ancestor-descendant Web directories. These rules indicate the evolution correlations between Web directories. Using the discovered rules, we propose an efficient Web archive maintenance algorithm called WARM that optimally skips the subdirectories (during the next crawl) which are negatively correlated with it in undergoing significant changes. Our experimental results with real data show that our approach improves the efficiency of the archive maintenance process significantly while sacrificing slightly in keeping the "freshness" of the archives. Furthermore, our experiments demonstrate that it is not necessary to discover NEARS frequently as the mining rules can be utilized effectively for archive maintenance over multiple versions.

机译：网络档案保留了自治网站的历史，并且是各种媒体和业务分析师的潜在金矿。最常见的Web归档技术使用搜寻器来自动化收集Web页面的过程。但是，从大型网站定期（重新）下载整个页面集是不可行的。在本文中，我们朝着解决这个问题迈出了一步。我们设计了一种数据挖掘驱动的策略，用于有选择地（重新）下载位于分层目录结构中的Web页面，这些页面被认为已经发生了重大变化（例如，相当大比例的页面被插入到目录中或从目录中删除）。因此，无需下载和维护自上次爬网以来未更改的页面，因为可以轻松地从存档中检索它们。在我们的方法中，我们提出了一种称为NEAR-Miner的离线数据挖掘算法，该算法分析了存档中存储的原始网站的Web目录结构的演变历史，并挖掘了祖先后代Web目录之间的负相关关联规则（附近）。。这些规则指示了Web目录之间的演进关系。使用发现的规则，我们提出了一种称为WARM的有效Web存档维护算法，该算法可以最佳地跳过与子目录负相关的子目录（在下一次爬网期间），这些子目录在进行重大更改时会与其负相关。我们的真实数据实验结果表明，我们的方法显着提高了存档维护过程的效率，同时略微降低了存档的“新鲜度”。此外，我们的实验表明，无需频繁发现NEARS，因为可以有效地利用挖掘规则来对多个版本进行存档维护。

著录项

来源
《International conference on very large data bases;VLDB 2009》|2009年|P.1128-1139|共12页
会议地点
作者
Ling Chen; Sourav S Bhowmick; Wolfgang Nejdl;
展开▼
作者单位

展开▼
会议组织
原文格式 PDF
正文语种
中图分类 TP311.13;
关键词

相似文献

外文文献
中文文献
专利

1. AUTOMATIC MAINTENANCE OF WEB DIRECTORIES BY MINING WEB BROWSING DATA [J] . CARLOS HURTADO, MARCELO MENDOZA Journal of web engineering . 2011,第2期

机译：通过挖掘Web浏览数据来自动维护Web目录
2. A data mining based method for web site maintenance [J] . K.E. Burn-Thornton, M. Carrington, T. Burman Intelligent data analysis . 2006,第6期

机译：一种基于数据挖掘的网站维护方法
3. Assessing web sites quality: A systematic literature review by text and association rules mining [J] . Rekik Rim, Kallel Ilhem, Casillas Jorge, International Journal of Information Management . 2018,第1期

机译：评估网站质量：通过文本和关联规则挖掘进行系统的文献综述
4. NEAR-Miner: Mining Evolution Associations of Web Site Directories for Efficient Maintenance of Web Archives [C] . International conference on very large data bases . 2009

机译：近矿工：网站目录的矿业演进关联，以便高效维护Web Archives
5. Association rule based data mining approaches for Web Cache Maintenance and adaptive Intrusion Detection systems. [D] . Mohan, Sujaa Rani. 2005

机译：Web缓存维护和自适应入侵检测系统的基于关联规则的数据挖掘方法。
6. Soil Food Web Changes during Spontaneous Succession at Post Mining Sites: A Possible Ecosystem Engineering Effect on Food Web Organization? [O] . Jan Frouz, Elisa Thébault, Václav Pižl, -1

机译：采矿现场自发演替过程中土壤食物网的变化：生态系统对食物网组织的影响？
7. Mirror Site Maintenance Based on Evolution Associations of Web Directories [O] . Ling Chen 2008

机译：基于Web目录进化关联的镜像站点维护

NEAR-Miner: Mining Evolution Associations of Web Site Directories for Efficient Maintenance of Web Archives

摘要

著录项

相似文献

相关主题

期刊订阅