首页> 外文会议>Data integration in the life sciences >Site-Wide Wrapper Induction for Life Science Deep Web Databases
【24h】

Site-Wide Wrapper Induction for Life Science Deep Web Databases

机译:生命科学深层Web数据库的站点范围包装器归纳

获取原文
获取原文并翻译 | 示例

摘要

We present a novel approach to automatic information extraction from Deep Web Life Science databases using wrapper induction. Traditional wrapper induction techniques focus on learning wrappers based on examples from one class of Web pages, i.e. from Web pages that are all similar in structure and content. Thereby, traditional wrapper induction targets the understanding of Web pages generated from a database using the same generation template as observed in the example set. However, Life Science Web sites typically contain structurally diverse web pages from multiple classes making the problem more challenging. Furthermore, we observed that such Life Science Web sites do not just provide mere data, but they also tend to provide schema information in terms of data labels - giving further cues for solving the Web site wrapping task. Our solution to this novel challenge of Site-Wide wrapper induction consists of a sequence of steps: 1. classification of similar Web pages into classes, 2. discovery of these classes and 3. wrapper induction for each class. Our approach thus allows us to perform unsupervised information retrieval from across an entire Web site. We test our algorithm against three real-world biochemical deep Web sources and report our preliminary results, which are very promising.
机译:我们提出了一种使用包装归纳法从Deep Web Life Science数据库自动提取信息的新颖方法。传统的包装器归纳技术着重于基于一类网页(即,从结构和内容都相似的网页)中的示例学习包装器。因此,传统的包装器归纳的目标是理解使用示例集中观察到的相同生成模板从数据库生成的网页。但是,生命科学网站通常包含结构多样的网页,这些网页来自多个类别,这使问题更具挑战性。此外,我们观察到,这样的生命科学网站不仅提供纯数据,而且还倾向于提供数据标签方面的架构信息,从而为解决网站包装任务提供了更多线索。我们解决这个新的站点范围包装器挑战的解决方案包括以下步骤:1.将相似的Web页面分类为类; 2.发现这些类;以及3.每个类的包装器归纳。因此,我们的方法允许我们从整个网站上执行无监督的信息检索。我们针对三个现实世界中的生物化学深层Web资源测试了我们的算法,并报告了我们的初步结果,这是非常有希望的。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号