A Very Efficient Approach to News Title and Content Extraction on the Web

机译：Web上新闻标题和内容提取的一种非常有效的方法

获取原文

获取原文并翻译 | 示例

页面导航

摘要
著录项
相似文献
相关主题

摘要

We consider the problem of efficient and template-independent news extraction on the Web. The popular news extraction methods are based on visual information, and they can achieve good accuracy performance, but the computational efficiency is poor, because it is very time-consuming to render web page to obtain visual information. In this paper we propose an efficient and effective news extraction approach based on novel features. Our approach neither needs training nor needs visual information, so it is simple and very efficient. And it can extract news information from various news sites without using templates. In our experiments, the proposed approach achieves 99% accuracy over 5,671 news pages from 20 different news sites. And the efficiency is much faster than the baseline machine learning method using visual information.

机译：我们考虑在Web上高效且独立于模板的新闻提取问题。流行的新闻提取方法是基于视觉信息的，虽然可以达到较好的准确性，但是计算效率很差，因为渲染网页以获得视觉信息非常耗时。在本文中，我们提出了一种基于新颖特征的高效有效的新闻提取方法。我们的方法既不需要培训也不需要视觉信息，因此它既简单又非常有效。而且它可以从各个新闻站点中提取新闻信息，而无需使用模板。在我们的实验中，所提出的方法在来自20个不同新闻站点的5,671个新闻页面上实现了99％的准确性。而且效率比使用视觉信息的基准机器学习方法快得多。

著录项

来源
《Proceedings of the 2011 ACM/IEEE on joint conference on digital libraries.》|2011年|p.389-390|共2页
会议地点 Ottawa(CA);Ottawa(CA)
作者
Hualiang Yan; Jianwu Yang;
展开▼
作者单位

Institute of Computer Science Technology, Peking University China, 100871;

Institute of Computer Science Technology, Peking University China, 100871;

展开▼
会议组织
原文格式 PDF
正文语种 eng
中图分类电子图书馆、数字图书馆;电子图书馆、数字图书馆;
关键词
data extraction; web mining; web news;

机译：数据提取；网络挖掘；网络新闻;

相似文献

外文文献
中文文献
专利

1. Content extraction from news web pages using tag tree [J] . Chandrakala Arya, Sanjay K. Dwivedi International Journal of Autonomic Computing . 2018,第1期

机译：使用标签树从新闻网页提取的内容提取
2. An Efficient Technique for Tag Extraction and Content Retrieval from Web Pages [J] . S.Sathya, Dr. B.Srinivasan International Journal of Computer Trends and Technology . 2013,第9期

机译：一种从网页中进行标签提取和内容检索的有效技术
3. A novel text mining approach for scholar information extraction from web content in Chinese [J] . Xia Xie, Yu Fu, Hai Jin, Future generation computer systems . 2020,第Octa期

机译：中文翻译手机版从Web内容中的学者信息提取的新文本挖掘方法
4. A Very Efficient Approach to News Title and Content Extraction on the Web [C] . Hualiang Yan, Jianwu Yang ACM/IEEE on joint conference on digital libraries . 2011

机译：对网上的新闻标题和内容提取的一种非常有效的方法
5. The state of women's sports on the web: Content analyses of international sports news websites and athletes' Twitter profiles. [D] . Coche, Roxane. 2013

机译：网络上的女子体育状况：国际体育新闻网站的内容分析和运动员的Twitter个人资料。
6. An Efficient Approach for Web Indexing of Big Data through Hyperlinks in Web Crawling [O] . R. Suganya Devi, D. Manjula, R. K. Siddharth 2015

机译：通过Web爬网中的超链接对大数据进行Web索引的一种有效方法
7. Improving Webpage Content Extraction by extending a novel single page extraction approach: A case study with Thai websites [O] . Thanadechteemapat W., Fung C.C. 2012

机译：通过扩展新颖的单页提取方法来改善网页内容提取：以泰国网站为例

A Very Efficient Approach to News Title and Content Extraction on the Web

摘要

著录项

相似文献

相关主题

期刊订阅