Sampling strategies for information extraction over the deep web

Pablo Barrio; Luis Gravano

首页> 外文期刊>Information Processing & Management >Sampling strategies for information extraction over the deep web

【24h】

Sampling strategies for information extraction over the deep web

机译：通过深层网络提取信息的采样策略

获取原文

获取原文并翻译 | 示例

掌桥外文数据库（机构版） >>

开具论文收录证明 >>

文献代查 >>

页面导航

摘要
著录项
相似文献
相关主题

摘要

Information extraction systems discover structured information in natural language text. Having information in structured form enables much richer querying and data mining than possible over the natural language text. However, information extraction is a computationally expensive task, and hence improving the efficiency of the extraction process over large text collections is of critical interest. In this paper, we focus on an especially valuable family of text collections, namely, the so-called deep-web text collections, whose contents are not crawlable and are only available via querying. Important steps for efficient information extraction over deep-web text collections (e.g., selecting the collections on which to focus the extraction effort, based on their contents; or learning which documents within these collections-and in which order-to process, based on their words and phrases) require having a representative document sample from each collection. These document samples have to be collected by querying the deep-web text collections, an expensive process that renders impractical the existing sampling approaches developed for other data scenarios. In this paper, we systematically study the space of query-based document sampling techniques for information extraction over the deep web. Specifically, we consider (ⅰ) alternative query execution schedules, which vary on how they account for the query effectiveness, and (ⅱ) alternative document retrieval and processing schedules, which vary on how they distribute the extraction effort over documents. We report the results of the first large-scale experimental evaluation of sampling techniques for information extraction over the deep web. Our results show the merits and limitations of the alternative query execution and document retrieval and processing strategies, and provide a roadmap for addressing this critically important building block for efficient, scalable information extraction.

机译：信息提取系统以自然语言文本发现结构化信息。具有结构化形式的信息可以实现比自然语言文本更丰富的查询和数据挖掘。但是，信息提取是一项计算量巨大的任务，因此，在大型文本集合上提高提取过程的效率至关重要。在本文中，我们集中于一个特别有价值的文本集合系列，即所谓的深层网络文本集合，其内容不可爬网，只能通过查询获得。在深层网络文本集合上进行有效信息提取的重要步骤（例如，根据其内容选择要集中提取工作的集合；或者根据它们的集合了解这些集合中的哪些文档以及要处理的顺序）单词和词组）要求每个集合都具有代表性的文档样本。这些文档样本必须通过查询深层网络文本集合来收集，这是一个昂贵的过程，使得为其他数据场景开发的现有采样方法不切实际。在本文中，我们系统地研究了基于查询的文档采样技术的空间，该技术可用于深层网络中的信息提取。具体来说，我们考虑（ⅰ）替代查询执行计划，这取决于它们如何考虑查询有效性；以及（ⅱ）替代文档检索和处理计划，其取决于如何在文档上分配提取工作量。我们报告了对通过深层网络提取信息的采样技术进行的首次大规模实验评估的结果。我们的结果表明了替代查询执行以及文档检索和处理策略的优点和局限性，并提供了解决这一至关重要的构建块的路线图，以实现高效，可扩展的信息提取。

著录项

来源
《Information Processing & Management》 |2017年第2期|309-331|共23页
作者
Pablo Barrio; Luis Gravano;
展开▼
作者单位

Columbia University, Computer Science Department, 500 West 120th Street, Room 405, MC0401, New York, NY 10027, USA;

Columbia University, Computer Science Department, 500 West 120th Street, Room 405, MC0401, New York, NY 10027, USA;

展开▼
收录信息
原文格式 PDF
正文语种 eng
中图分类
关键词
Information extraction; Sampling; Deep web; Text mining; Scalability;

机译：信息提取;采样;深网;文本挖掘;可扩展性;

相似文献

外文文献
中文文献
专利

1. Quantification of palladium in wastewater samples by matrix-matching calibration strategy assisted deep eutectic solvent based microextraction [J] . Elif Seda Koçoğlu, Özge Yılmaz, Emine Gülhan Bakırdere, Environmental Monitoring and Assessment . 2021,第6期

机译：基质匹配校准策略辅助深凝固溶剂基钯样品中钯的定量辅助基于溶剂的微萃取
2. Simultaneous determination of polycyclic aromatic hydrocarbons and benzene, toluene, ethylbenzene and xylene in water samples using a new sampling strategy combining different extraction modes and temperatures in a single extraction solid-phase microextraction-gas chromatography-mass spectrometry procedure [J] . Bianchin J.N., Nardini G., Merib J., Journal of chromatography, A: Including electrophoresis and other separation methods . 2012,第Null期

机译：采用新的采样策略，在一次萃取固相微萃取-气相色谱-质谱法中结合不同的萃取模式和温度，同时测定水样中的多环芳烃和苯，甲苯，乙苯和二甲苯
3. Combination of dispersive solid phase extraction with solidification organic drop-dispersive liquid-liquid microextraction based on deep eutectic solvent for extraction of organophosphorous pesticides from edible oil samples [J] . Zahiri Elham, Khandaghi Jalil, Farajzadeh Mir Ali, Journal of chromatography, A: Including electrophoresis and other separation methods . 2020,第1期

机译：基于深对共晶溶剂的分散性固相萃取与凝固有机滴分散液 - 液体微萃取的组合从食用油样品中萃取有机磷农药的萃取
4. A novel ensemble vision based deep web data extraction technique for web mining applications [C] . Aysha Banu B., Chitra M. 2012 IEEE International Conference on Advanced Communication, Control and Computing Technologies. . 2012

机译：一种新颖的基于集合视觉的深度Web数据挖掘技术，用于Web挖掘应用
5. Evaluating the Treatment Preferences of Adults with Opioid Use Disorders Using Web-Based Sampling Strategies [D] . Saunders, Elizabeth Carlson. 2020

机译：使用基于Web的抽样策略评估阿片类药物使用障碍的成人治疗偏好
6. Innovative Recruitment Using Online Networks: Lessons Learned From an Online Study of Alcohol and Other Drug Use Utilizing a Web-Based Respondent-Driven Sampling (webRDS) Strategy [O] . José A. Bauermeister, Marc A. Zimmerman, Michelle M. Johns, -1

机译：使用在线网络进行创新招聘：使用基于Web的响应者驱动抽样（webRDS）策略从酒精和其他毒品使用的在线研究中吸取的教训
7. Curating the web’s deep past – Migration strategies for the German Continental Deep Drilling Program web content [O] . Klump Jens, Ulbricht Damian, Conze Ronald 2015

机译：策划网络的悠久历史-德国大陆深层钻探计划网络内容的迁移策略

Sampling strategies for information extraction over the deep web

摘要

著录项

相似文献

相关主题

期刊订阅