xCrawl: a high-recall crawling method for Web mining

Kostyantyn Shchekotykhin; Dietmar Jannach; Gerhard Friedrich

首页> 外文期刊>Knowledge and Information Systems >xCrawl: a high-recall crawling method for Web mining

【24h】

xCrawl: a high-recall crawling method for Web mining

机译：xCrawl：一种用于Web挖掘的高调用爬网方法

获取原文

获取原文并翻译 | 示例

获取外文期刊封面封底 >>

开具论文收录证明 >>

文献代查 >>

页面导航

摘要
著录项
相似文献
相关主题

摘要

Web mining systems exploit the redundancy of data published on the Web to automatically extract information from existing Web documents. The first step in the Information Extraction process is thus to locate as many Web pages as possible that contain relevant information within a limited period of time, a task which is commonly accomplished by applying focused crawling techniques. The performance of such a crawler can be measured by its “recall”, i.e., the percentage of documents found and identified as relevant compared to the total number of existing documents. A higher recall value implies that more redundant data are available, which in turn leads to better results in the subsequent fact extraction phase of the Web mining process. In this paper, we propose xCrawl, a new focused crawling method which outperforms state-of-the-art approaches with respect to the recall values achievable within a given period of time. This method is based on a new combination of ideas and techniques used to identify and exploit the navigational structures of Web sites, such as hierarchies, lists, or maps. In addition, automatic query generation is applied to rapidly collect Web sources containing target documents. The proposed crawling technique was inspired by the requirements of a Web mining system developed to extract product and service descriptions given in tabular form and was evaluated in different application scenarios. Comparisons with existing focused crawling techniques reveal that the new crawling method leads to a significant increase in recall while maintaining precision.

机译：Web挖掘系统利用Web上发布的数据的冗余来自动从现有Web文档中提取信息。因此，信息提取过程的第一步是在有限的时间内找到尽可能多的包含相关信息的网页，这通常是通过应用集中爬网技术来完成的。这种履带的性能可以通过其“召回率”来衡量，即与现有文件总数相比，找到并确定为相关文件的百分比。较高的召回值意味着有更多的冗余数据可用，从而在Web挖掘过程的后续事实提取阶段中带来了更好的结果。在本文中，我们提出了xCrawl，这是一种新的集中式爬网方法，在给定时间内可以实现的召回值方面，它优于最新方法。此方法基于用于识别和利用网站的导航结构（例如层次结构，列表或地图）的思想和技术的新组合。此外，自动查询生成可用于快速收集包含目标文档的Web源。 Web挖掘系统的需求启发了提出的爬网技术，该系统的开发目的是提取以表格形式给出的产品和服务描述，并在不同的应用场景中对其进行评估。与现有的集中爬网技术的比较表明，新的爬网方法可在保持精度的同时显着提高召回率。

著录项

来源
《Knowledge and Information Systems》 |2010年第2期|p.303-326|共24页
作者
Kostyantyn Shchekotykhin; Dietmar Jannach; Gerhard Friedrich;
展开▼
作者单位

展开▼
收录信息
原文格式 PDF
正文语种 eng
中图分类
关键词
Web mining; Information retrieval; Web crawling; Information extraction;

机译：网络挖掘;信息检索;网络爬行;信息提取;

相似文献

外文文献
中文文献
专利

1. xCrawl: a high-recall crawling method for Web mining [J] . Kostyantyn Shchekotykhin, Dietmar Jannach, Gerhard Friedrich Knowledge and information systems . 2010,第2期

机译：xCrawl：一种用于Web挖掘的高调用爬网方法
2. Medical informatics labor market analysis using web crawling, web scraping, and text mining [J] . Schedlbauer Jurgen, Raptis Georgios, Ludwig Bernd International journal of medical informatics . 2021,第Juna期

机译：医疗信息学劳动力市场分析使用Web爬行，网页刮擦和文本挖掘
3. Mining the web with hierarchical crawlers - a resource sharing based crawling approach [J] . Anirban Kundu, Ruma Dutta, Rana Dattagupta, International journal of intelligent information and database systems . 2009,第1期

机译：使用分层爬网程序挖掘Web-一种基于资源共享的爬网方法
4. xCrawl: A High-Recall Crawling Method for Web Mining [C] . Shchekotykhin Kostyantyn, Jannach Dietmar, Friedrich Gerhard International Conference on Data Mining . 2008

机译：Xcrawl：用于网挖掘的高召回爬网方法
5. A Study on Composite Data Mining Methods for Linking Real-World Information with Web Resources [D] . Liao Chenyi, 廖宸一 2019

机译：用于将真实信息链接到Web资源的复合数据挖掘方法的研究
6. Clustering as a Data Mining Method in a Web-based System for Thorarcic Surgery [O] . Örjan Dahlström, Ankica Babic, Johan Antonsson, 2001

机译：聚类作为基于Web的胸外科手术系统中的数据挖掘方法
7. Board Forum Crawling: A Web Crawling Method for Web Forum [O] . Yan Guo, Kui Li, Kai Zhang, 2006

机译：Board Forum Crawling：Web论坛的Web爬行方法

xCrawl: a high-recall crawling method for Web mining

摘要

著录项

相似文献

相关主题

期刊订阅