首页> 外文学位 >Graph-based weakly-supervised methods for information extraction & integration.
【24h】

Graph-based weakly-supervised methods for information extraction & integration.

机译:基于图的弱监督方法,用于信息提取和集成。

获取原文
获取原文并翻译 | 示例

摘要

The variety and complexity of potentially-related data resources available for querying---webpages, databases, data warehouses---has been growing ever more rapidly. There is a growing need to pose integrative queries across multiple such sources, exploiting foreign keys and other means of interlinking data to merge information from diverse sources. This has traditionally been the focus of research within Information Extraction (IE) and Information Integration (II) communities, with IE focusing on converting unstructured sources into structured sources, and II focusing on providing a unified view of diverse structured data sources. However, most of the current IE and II methods, which can potentially be applied to the problem of integration across sources, require large amounts of human supervision, often in the form of annotated data. This need for extensive supervision makes existing methods expensive to deploy and difficult to maintain. In this thesis, we develop techniques that generalize from limited human input, via weakly-supervised methods for IE and II. In particular, we argue that graph-based representation of data and learning over such graphs can result in effective and scalable methods for large-scale Information Extraction and Integration.;Within IE, we focus on the problem of assigning semantic classes to entities. First we develop a context pattern induction method to extend small initial entity lists of various semantic classes. We also demonstrate that features derived from such extended entity lists can significantly improve performance of state-of-the-art discriminative taggers.;The output of pattern-based class-instance extractors is often high-precision and low-recall in nature, which is inadequate for many real world applications. We use Adsorption, a graph based label propagation algorithm, to significantly increase recall of an initial high-precision, low-recall pattern-based extractor by combining evidences from unstructured and structured text corpora. Building on Adsorption, we propose a new label propagation algorithm, Modified Adsorption (MAD), and demonstrate its effectiveness on various real-world datasets. Additionally, we also show how class-instance acquisition performance in the graph-based SSL setting can be improved by incorporating additional semantic constraints available in independently developed knowledge bases.;Within Information Integration, we develop a novel system, Q, which draws ideas from machine learning and databases to help a non-expert user construct data-integrating queries based on keywords (across databases) and interactive feedback on answers. We also present an information need-driven strategy for automatically incorporating new sources and their information in Q. We also demonstrate that Q's learning strategy is highly effective in combining the outputs of "black box" schema matchers and in re-weighting bad alignments. This removes the need to develop an expensive mediated schema which has been necessary for most previous systems.
机译:可用于查询的潜在相关数据资源的多样性和复杂性-网页,数据库,数据仓库-一直以越来越快的速度增长。越来越需要在多个此类源之间进行集成查询,利用外键和其他互连数据的方式来合并来自不同源的信息。传统上,这一直是信息提取(IE)和信息集成(II)社区中研究的重点,IE致力于将非结构化源转换为结构化源,而II则致力于提供各种结构化数据源的统一视图。但是,大多数当前的IE和II方法可以潜在地应用于跨源集成的问题,需要大量的人工监督,通常采用带注释的数据的形式。对大量监督的需求使得现有方法的部署成本昂贵且难以维护。在本文中,我们开发了一些技术,这些技术可以通过对IE和II进行弱监督的方法,从有限的人为输入进行推广。特别是,我们认为基于图形的数据表示和在此类图形上的学习可以为大规模信息提取和集成提供有效且可扩展的方法。在IE中,我们关注于将语义类分配给实体的问题。首先,我们开发了一种上下文模式归纳方法来扩展各种语义类的小型初始实体列表。我们还证明了从此类扩展实体列表中派生的功能可以显着提高最新的区分性标签器的性能。基于模式的类实例提取器的输出实际上通常是高精度和低召回率的,这不足以用于许多实际应用中。我们结合基于非结构化和结构化文本语料的证据,使用基于图形的标签传播算法Adsorption,显着提高了最初基于高精度,低召回率模式的提取器的召回率。我们以吸附为基础,提出了一种新的标签传播算法,即改良吸附(MAD),并证明了其在各种实际数据集上的有效性。此外,我们还展示了如何通过合并独立开发的知识库中可用的附加语义约束来提高基于图的SSL设置中类实例的获取性能。在信息集成中,我们开发了一种新颖的系统Q,该系统从机器学习和数据库,以帮助非专业用户构建基于关键字的数据集成查询(跨数据库)以及对答案的交互式反馈。我们还提出了一种信息需求驱动的策略,用于自动在Q中合并新的来源及其信息。我们还证明了Q的学习策略在组合“黑匣子”模式匹配器的输出以及重新加权不良对齐方式方面非常有效。这就消除了开发昂贵的中介模式的需要,而这对于大多数先前的系统是必需的。

著录项

  • 作者

    Talukdar, Partha Pratim.;

  • 作者单位

    University of Pennsylvania.;

  • 授予单位 University of Pennsylvania.;
  • 学科 Information Technology.;Information Science.;Computer Science.
  • 学位 Ph.D.
  • 年度 2010
  • 页码 170 p.
  • 总页数 170
  • 原文格式 PDF
  • 正文语种 eng
  • 中图分类
  • 关键词

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号