首页> 外文会议>PRICAI 2010: Trends in artificial intelligence >A Unified Approach for Extracting Multiple News Attributes from News Pages
【24h】

A Unified Approach for Extracting Multiple News Attributes from News Pages

机译:从新闻页面提取多个新闻属性的统一方法

获取原文
获取原文并翻译 | 示例

摘要

Most previous woks on web news article extraction only focus on its content and title. To meet the growing demand for the various web data integration applications, more useful news attributes, such as publication date, author, etc., need to be extracted structured stored for further processing. In this paper, we study the problem of automatically extracting multiple news attributes from news pages. Unlike the traditional ways(e.g. extracting news attributes separately or generating template-dependent wrappers), we propose an automatic, unified approach to extract them based on the visual features of news attributes which includes independent visual features and dependent visual features. The basic idea of our approach is that, first, the candidates of each news attribute are extracted from the news page based on their independent visual features, and then, the true value of each attribute is identified from the candidates based on dependent visual features(the layout relations among news attributes). The extensive experiments using a large number of news pages show that the proposed approach is highly effective and efficient.
机译:以前有关网络新闻文章提取的大多数工具只关注其内容和标题。为了满足对各种Web数据集成应用程序不断增长的需求,需要提取结构化存储的更多有用的新闻属性(例如出版日期,作者等)以进行进一步处理。在本文中,我们研究了从新闻页面自动提取多个新闻属性的问题。与传统方式(例如分别提取新闻属性或生成依赖模板的包装器)不同,我们提出了一种自动,统一的方法来根据新闻属性的视觉特征来提取它们,包括独立的视觉特征和相关的视觉特征。我们的方法的基本思想是,首先,根据新闻页面的独立视觉特征从新闻页面中提取每个新闻属性的候选者,然后根据依赖的视觉特征从候选者中识别每个属性的真实值(新闻属性之间的布局关系)。使用大量新闻页面进行的广泛实验表明,该方法非常有效。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号