首页> 外文期刊>Journal of Intelligent Information Systems >Post-Supervised Template Induction for Information Extraction from Lists and Tables in Dynamic Web Sources
【24h】

Post-Supervised Template Induction for Information Extraction from Lists and Tables in Dynamic Web Sources

机译:用于从动态Web源中的列表和表中提取信息的监督后模板归纳

获取原文
获取原文并翻译 | 示例
       

摘要

Dynamic web sites commonly return information in the form of lists and tables. Although hand crafting an extraction program for a specific template is time-consuming but straightforward, it is desirable to automatically generate template extraction programs from examples of lists and tables in html documents. Supervised approaches have been shown to achieve high accuracy, but they require manual labelling of training examples, which is also time consuming. Fully unsupervised approaches, which extract rows and columns by detecting regularities in the data, cannot provide sufficient accuracy for practical domains. We describe a novel technique, Post-supervised Learning, which exploits unsupervised learning to avoid the need for training examples, while minimally involving the user to achieve high accuracy. We have developed unsupervised algorithms to extract the number of rows and adopted a dynamic programming algorithm for extracting columns. Our method achieves high performance with minimal user input compared to fully supervised techniques.
机译:动态网站通常以列表和表格的形式返回信息。尽管为特定模板手工制作提取程序非常耗时但简单,但是希望从html文档中的列表和表格示例自动生成模板提取程序。已证明有监督的方法可以达到很高的准确性,但是它们需要人工标记训练示例,这也很耗时。通过检测数据中的规律性来提取行和列的完全不受监督的方法无法为实际领域提供足够的准确性。我们描述了一种新技术,即后监督学习,它利用无监督学习来避免训练示例的需要,同时最少地让用户参与以实现高精度。我们已经开发了无监督算法来提取行数,并采用了动态编程算法来提取列。与完全监督的技术相比,我们的方法以最少的用户输入即可实现高性能。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号