Post-Supervised Template Induction for Information Extraction from Lists and Tables in Dynamic Web Sources

Z. SHI; E. MILIOS; N. ZINCIR-HEYWOOD

首页> 外文期刊>Journal of Intelligent Information Systems >Post-Supervised Template Induction for Information Extraction from Lists and Tables in Dynamic Web Sources

【24h】

Post-Supervised Template Induction for Information Extraction from Lists and Tables in Dynamic Web Sources

机译：用于从动态Web源中的列表和表中提取信息的监督后模板归纳

获取原文

获取原文并翻译 | 示例

掌桥外文数据库（机构版） >>

开具论文收录证明 >>

页面导航

摘要
著录项
相似文献
相关主题

摘要

Dynamic web sites commonly return information in the form of lists and tables. Although hand crafting an extraction program for a specific template is time-consuming but straightforward, it is desirable to automatically generate template extraction programs from examples of lists and tables in html documents. Supervised approaches have been shown to achieve high accuracy, but they require manual labelling of training examples, which is also time consuming. Fully unsupervised approaches, which extract rows and columns by detecting regularities in the data, cannot provide sufficient accuracy for practical domains. We describe a novel technique, Post-supervised Learning, which exploits unsupervised learning to avoid the need for training examples, while minimally involving the user to achieve high accuracy. We have developed unsupervised algorithms to extract the number of rows and adopted a dynamic programming algorithm for extracting columns. Our method achieves high performance with minimal user input compared to fully supervised techniques.

机译：动态网站通常以列表和表格的形式返回信息。尽管为特定模板手工制作提取程序非常耗时但简单，但是希望从html文档中的列表和表格示例自动生成模板提取程序。已证明有监督的方法可以达到很高的准确性，但是它们需要人工标记训练示例，这也很耗时。通过检测数据中的规律性来提取行和列的完全不受监督的方法无法为实际领域提供足够的准确性。我们描述了一种新技术，即后监督学习，它利用无监督学习来避免训练示例的需要，同时最少地让用户参与以实现高精度。我们已经开发了无监督算法来提取行数，并采用了动态编程算法来提取列。与完全监督的技术相比，我们的方法以最少的用户输入即可实现高性能。

著录项

来源
《Journal of Intelligent Information Systems》 |2005年第1期|p.69-93|共25页
作者
Z. SHI; E. MILIOS; N. ZINCIR-HEYWOOD;
展开▼
作者单位

Faculty of Computer Science, Dalhousie University, Halifax, N.S., Canada B3H 1W5;

展开▼
收录信息
原文格式 PDF
正文语种 eng
中图分类计算技术、计算机技术;
关键词
information extraction; grammar induction; template induction; unsupervised learning;

机译：信息提取;语法归纳;模板归纳;无监督学习;

相似文献

外文文献
中文文献
专利

1. Optimized Template Detection and Extraction Algorithm for Web Scraping of Dynamic Web Pages [J] . Xin Luo Journal of wavelet theory and applications . 2017,第2期

机译：动态网页网页抓取的优化模板检测与提取算法
2. FIRST-ORDER LOGIC RULE INDUCTION FOR INFORMATION EXTRACTION IN WEB RESOURCES [J] . JOSE IGNACIO FERNANDEZ-VILLAMOR, CARLOS ANGEL IGLESIAS, MERCEDES GARIJO International Journal of Artificial Intelligence Tools: Architectures, Languages, Algorithms . 2012,第6期

机译：Web资源中信息提取的一阶逻辑规则诱导
3. Implementation of a weblog extraction system with an improved template extraction technique [J] . E CHANG 中国文献情报（英文刊） . 2013,第001期

机译：利用改进的模板提取技术实现Weblog提取系统
4. Post-supervised Template Induction for Dynamic Web Sources [C] . Zhongmin Shi, Evangelos Milios, Nur Zincir-Heywood Advances in Artificial Intelligence . 2003

机译：动态Web源的监督后模板归纳
5. Post-supervised template induction for information extraction from lists and tables in Web sources. [D] . Shi, Zhongmin. 2002

机译：监督后的模板归纳，用于从Web源中的列表和表中提取信息。
6. Web Thermo Tables – an On-Line Version of the TRC Thermodynamic Tables [O] . Andrei Kazakov, Chris D Muzny, Robert D Chirico, 2008

机译：Web温度表– TRC热力学表的在线版本
7. Post-supervised Template Induction for Information Extraction from Lists and Tables in Dynamic Web Sources [O] . Z. Shi E. Milios, N. Zincir-heywood 2008

机译：从动态Web源中的列表和表中提取信息的后监督模板归纳

Post-Supervised Template Induction for Information Extraction from Lists and Tables in Dynamic Web Sources

摘要

著录项

相似文献

相关主题

期刊订阅