首页> 外文期刊>Information Processing & Management >Sampling strategies for information extraction over the deep web
【24h】

Sampling strategies for information extraction over the deep web

机译:通过深层网络提取信息的采样策略

获取原文
获取原文并翻译 | 示例
           

摘要

Information extraction systems discover structured information in natural language text. Having information in structured form enables much richer querying and data mining than possible over the natural language text. However, information extraction is a computationally expensive task, and hence improving the efficiency of the extraction process over large text collections is of critical interest. In this paper, we focus on an especially valuable family of text collections, namely, the so-called deep-web text collections, whose contents are not crawlable and are only available via querying. Important steps for efficient information extraction over deep-web text collections (e.g., selecting the collections on which to focus the extraction effort, based on their contents; or learning which documents within these collections-and in which order-to process, based on their words and phrases) require having a representative document sample from each collection. These document samples have to be collected by querying the deep-web text collections, an expensive process that renders impractical the existing sampling approaches developed for other data scenarios. In this paper, we systematically study the space of query-based document sampling techniques for information extraction over the deep web. Specifically, we consider (ⅰ) alternative query execution schedules, which vary on how they account for the query effectiveness, and (ⅱ) alternative document retrieval and processing schedules, which vary on how they distribute the extraction effort over documents. We report the results of the first large-scale experimental evaluation of sampling techniques for information extraction over the deep web. Our results show the merits and limitations of the alternative query execution and document retrieval and processing strategies, and provide a roadmap for addressing this critically important building block for efficient, scalable information extraction.
机译:信息提取系统以自然语言文本发现结构化信息。具有结构化形式的信息可以实现比自然语言文本更丰富的查询和数据挖掘。但是,信息提取是一项计算量巨大的任务,因此,在大型文本集合上提高提取过程的效率至关重要。在本文中,我们集中于一个特别有价值的文本集合系列,即所谓的深层网络文本集合,其内容不可爬网,只能通过查询获得。在深层网络文本集合上进行有效信息提取的重要步骤(例如,根据其内容选择要集中提取工作的集合;或者根据它们的集合了解这些集合中的哪些文档以及要处理的顺序)单词和词组)要求每个集合都具有代表性的文档样本。这些文档样本必须通过查询深层网络文本集合来收集,这是一个昂贵的过程,使得为其他数据场景开发的现有采样方法不切实际。在本文中,我们系统地研究了基于查询的文档采样技术的空间,该技术可用于深层网络中的信息提取。具体来说,我们考虑(ⅰ)替代查询执行计划,这取决于它们如何考虑查询有效性;以及(ⅱ)替代文档检索和处理计划,其取决于如何在文档上分配提取工作量。我们报告了对通过深层网络提取信息的采样技术进行的首次大规模实验评估的结果。我们的结果表明了替代查询执行以及文档检索和处理策略的优点和局限性,并提供了解决这一至关重要的构建块的路线图,以实现高效,可扩展的信息提取。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号