NEXIR: A Novel Web Extraction Rule Language toward a Three-Stage Web Data Extraction Model

机译：Nexir：一种新的Web提取规则语言，朝向三级Web数据提取模型

获取原文

页面导航

摘要
著录项
相似文献
相关主题

摘要

As the most popular information publishing platform, the Web contains a lot of valued data information of interests to users or applications. Nowadays, although a lot of data mining or analysis techniques have been studied in last decade, there are still not many easy-to-use web data mining tools available for users to extract useful data information from the Web. The web information extraction is a whole process involving web page navigation, data extraction and data integration. Unfortunately most of existing studies or systems lack of sufficient consideration toward the three-stage process. Also most of them lack the powerful rules to express the flexible extraction logic to extract data records with complicate structure. In this paper, we propose a novel web data extraction language, NEXIR, toward a three-stage web data extraction model. First of all, the language can define rules for system to automate the navigation process of the web pages, including deep web pages that need interactions from users. Then the language allows users to define flexible and complicated rules to extract data records from web pages and integrate extracted data into a pre-defined structure. A language engine and a prototype extraction system have been implemented based on the proposed language. The experimental results show that our language and system work effective and powerful compared with existing data extraction approaches.

机译：作为最受欢迎的信息出版平台，Web包含对用户或应用程序的许多有价值的数据信息。如今，虽然在过去十年中已经研究了许多数据挖掘或分析技术，但仍然没有许多易于使用的Web数据挖掘工具，可供用户从Web中提取有用的数据信息。 Web信息提取是涉及网页导航，数据提取和数据集成的整个过程。不幸的是，大多数现有的研究或系统对三阶段过程缺乏充分的考虑。它们中的大多数也缺乏强大的规则来表达灵活的提取逻辑以提取具有复杂结构的数据记录。在本文中，我们提出了一种新颖的网络数据提取语言NEXIR，朝向三级Web数据提取模型。首先，语言可以定义系统的规则，以自动化网页的导航过程，包括需要与用户交互的深网络页面。然后，该语言允许用户定义灵活且复杂的规则，以从网页中提取数据记录并将提取的数据集成到预定义的结构中。基于所提出的语言实现了语言引擎和原型提取系统。实验结果表明，与现有数据提取方法相比，我们的语言和系统工作有效和强大。

著录项

来源
《International conference on web information systems engineering》|2013年||共14页
会议地点
作者
Shengsheng Shi; Wu Wei; Yulong Liu; Haitao Wang; Lei Luo; Chunfeng Yuan; Yihua Huang;
展开▼
作者单位

展开▼
会议组织
原文格式 PDF
正文语种
中图分类计算机网络;
关键词
Web data extraction; Extraction Rule language; Data record; Web page navigation; Web data integration;

机译：Web数据提取;提取规则语言;数据记录;网页导航;Web数据集成;

相似文献

外文文献
中文文献
专利

1. Extraction of Frequent Sequential Patterns From Web Usage Data and Their Applications In Pre-Fetching Rules Generation For Effective Web Latency Reduction [J] . Badong Chen, Yueqin Zhu Advances in applied computational mechanics . 2018,第1期

机译：提取Web使用数据的频繁顺序模式及其在预取规则生成中的应用程序，以实现有效的Web等待时间
2. Extraction of Frequent Sequential Patterns From Web Usage Data and Their Applications In Pre-Fetching Rules Generation For Effective Web Latency Reduction [J] . Nooredin Ghadiri Massoom Advances in applied computational mechanics . 2017,第1期

机译：提取Web使用数据的频繁顺序模式及其在预取规则生成中的应用程序，以实现有效的Web等待时间
3. Monadic Datalog and the Expressive Power of Languages for Web Information Extraction [J] . Georg Gottlob, Christoph Koch Journal of the Association for Computing Machinery . 2004,第1期

机译：Monadic Datalog和语言在Web信息提取中的表现力
4. NEXIR: A Novel Web Extraction Rule Language toward a Three-Stage Web Data Extraction Model [C] . Shengsheng Shi, Wu Wei, Yulong Liu, International conference on web information systems engineering . 2013

机译：NEXIR：一种针对三阶段Web数据提取模型的新颖Web提取规则语言
5. Heuristic rules for extraction of ontology from Web pages in WebOntEx. [D] . Jain, Bhanu Chaturvedi. 2000

机译：从WebOntEx中的网页提取本体的启发式规则。
6. BioBayesNet: a web server for feature extraction and Bayesian network modeling of biological sequence data [O] . Swetlana Nikolajewa, Rainer Pudimat, Michael Hiller, 2007

机译：BioBayesNet：用于生物序列数据特征提取和贝叶斯网络建模的Web服务器
7. Logic, languages, and rules for web data extraction and reasoning over data [O] . Gottlob, G, Koch, C, Pieris, A 2017

机译：Web数据提取和数据推理的逻辑，语言和规则

NEXIR: A Novel Web Extraction Rule Language toward a Three-Stage Web Data Extraction Model

摘要

著录项

相似文献

相关主题

期刊订阅