首页> 外文会议>International conference on web information systems engineering >NEXIR: A Novel Web Extraction Rule Language toward a Three-Stage Web Data Extraction Model
【24h】

NEXIR: A Novel Web Extraction Rule Language toward a Three-Stage Web Data Extraction Model

机译:Nexir:一种新的Web提取规则语言,朝向三级Web数据提取模型

获取原文

摘要

As the most popular information publishing platform, the Web contains a lot of valued data information of interests to users or applications. Nowadays, although a lot of data mining or analysis techniques have been studied in last decade, there are still not many easy-to-use web data mining tools available for users to extract useful data information from the Web. The web information extraction is a whole process involving web page navigation, data extraction and data integration. Unfortunately most of existing studies or systems lack of sufficient consideration toward the three-stage process. Also most of them lack the powerful rules to express the flexible extraction logic to extract data records with complicate structure. In this paper, we propose a novel web data extraction language, NEXIR, toward a three-stage web data extraction model. First of all, the language can define rules for system to automate the navigation process of the web pages, including deep web pages that need interactions from users. Then the language allows users to define flexible and complicated rules to extract data records from web pages and integrate extracted data into a pre-defined structure. A language engine and a prototype extraction system have been implemented based on the proposed language. The experimental results show that our language and system work effective and powerful compared with existing data extraction approaches.
机译:作为最受欢迎的信息出版平台,Web包含对用户或应用程序的许多有价值的数据信息。如今,虽然在过去十年中已经研究了许多数据挖掘或分析技术,但仍然没有许多易于使用的Web数据挖掘工具,可供用户从Web中提取有用的数据信息。 Web信息提取是涉及网页导航,数据提取和数据集成的整个过程。不幸的是,大多数现有的研究或系统对三阶段过程缺乏充分的考虑。它们中的大多数也缺乏强大的规则来表达灵活的提取逻辑以提取具有复杂结构的数据记录。在本文中,我们提出了一种新颖的网络数据提取语言NEXIR,朝向三级Web数据提取模型。首先,语言可以定义系统的规则,以自动化网页的导航过程,包括需要与用户交互的深网络页面。然后,该语言允许用户定义灵活且复杂的规则,以从网页中提取数据记录并将提取的数据集成到预定义的结构中。基于所提出的语言实现了语言引擎和原型提取系统。实验结果表明,与现有数据提取方法相比,我们的语言和系统工作有效和强大。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号