首页> 外文会议>International conference on very large data bases;VLDB 2009 >NEAR-Miner: Mining Evolution Associations of Web Site Directories for Efficient Maintenance of Web Archives
【24h】

NEAR-Miner: Mining Evolution Associations of Web Site Directories for Efficient Maintenance of Web Archives

机译:NEAR-Miner:有效维护Web档案的网站目录挖掘演化协会

获取原文

摘要

Web archives preserve the history of autonomous Web sites and are potential gold mines for all kinds of media and business analysts. The most common Web archiving technique uses crawlers to automate the process of collecting Web pages. However, (re)downloading entire collection of pages periodically from a large Web site is unfeasible. In this paper, we take a step towards addressing this problem. We devise a data mining-driven policy for selectively (re)downloading Web pages that are located in hierarchical directory structures which are believed to have changed significantly (e.g., a substantial percentage of pages are inserted to/removed from the directory). Consequently, there is no need to download and maintain pages that have not changed since the last crawl as they can be easily retrieved from the archive.In our approach, we propose an off-line data mining algorithm called NEAR-Miner that analyzes the evolution history of Web directory structures of the original Web site stored in the archive and mines negatively correlated association rules (near) between ancestor-descendant Web directories. These rules indicate the evolution correlations between Web directories. Using the discovered rules, we propose an efficient Web archive maintenance algorithm called WARM that optimally skips the subdirectories (during the next crawl) which are negatively correlated with it in undergoing significant changes. Our experimental results with real data show that our approach improves the efficiency of the archive maintenance process significantly while sacrificing slightly in keeping the "freshness" of the archives. Furthermore, our experiments demonstrate that it is not necessary to discover NEARS frequently as the mining rules can be utilized effectively for archive maintenance over multiple versions.
机译:网络档案保留了自治网站的历史,并且是各种媒体和业务分析师的潜在金矿。最常见的Web归档技术使用搜寻器来自动化收集Web页面的过程。但是,从大型网站定期(重新)下载整个页面集是不可行的。在本文中,我们朝着解决这个问题迈出了一步。我们设计了一种数据挖掘驱动的策略,用于有选择地(重新)下载位于分层目录结构中的Web页面,这些页面被认为已经发生了重大变化(例如,相当大比例的页面被插入到目录中或从目录中删除)。因此,无需下载和维护自上次爬网以来未更改的页面,因为可以轻松地从存档中检索它们。 在我们的方法中,我们提出了一种称为NEAR-Miner的离线数据挖掘算法,该算法分析了存档中存储的原始网站的Web目录结构的演变历史,并挖掘了祖先后代Web目录之间的负相关关联规则(附近)。 。这些规则指示了Web目录之间的演进关系。使用发现的规则,我们提出了一种称为WARM的有效Web存档维护算法,该算法可以最佳地跳过与子目录负相关的子目录(在下一次爬网期间),这些子目录在进行重大更改时会与其负相关。我们的真实数据实验结果表明,我们的方法显着提高了存档维护过程的效率,同时略微降低了存档的“新鲜度”。此外,我们的实验表明,无需频繁发现NEARS,因为可以有效地利用挖掘规则来对多个版本进行存档维护。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号