分块布局下的主题型网页的内容抽取

聂卉; 张津华

首页> 中文期刊> 《情报学报》 >分块布局下的主题型网页的内容抽取

分块布局下的主题型网页的内容抽取

开具论文收录证明 >>

期刊封面封底目录下载 >>

文献代查 >>

页面导航

摘要
著录项
相似文献
相关主题

摘要

A Web page extraetion method based on the layout of Web page is proposed in this paper to implementtasks of page cleaning and content extraction. Firstly, a tag-tree is constructed by analyzing the corresponding DOM structure of original page. Then the tree is partitioned into a set of blocks from bottom to up in terms of categories of tags and concerning information of nodes, furthermore, blocks are classified on the basis of the proportion of word, link and image in blocks. Next, by using VSM (Vector Space Model) , text eigenvector of page's subject is abstracted, which has been used to calculate degree of correlation between block ' s content and page ' s subject. In the light of degree of correlation, we can judge which blocks should be got rid of and which ones should be kept. The content blocks with high degree of correlation are kept to reconstruct the description of Web page. The method has been applied in a project concerning Talent Information Collection. Test results indicate effectiveness of the method in page cleaning and contentextraction.%本篇论文以去除网页噪声,整合网页内容为目标,提出了面向主题型网页,根据网页规划布局抽取网页内容的方法.算法首先分析原始网页的DOM结构生成标签树,再根据标签分类和对应节点的信息对标签树自底向上进行划分,并依据划分块的文字密度,链接密度及图片密度,分类信息块.进一步,提炼网页主题的文本特征向量,采用基于词条空间的文本相似度计算,获取划分块的主题相关度,以主题相关度为量化基准剔除噪声,识别网页主旨内容,重构页面描述.这一算法被应用于面向人才资讯的信息采集项目中,实验表明,算法适用于主题型网页的"去噪"及内容提取,具体应用中有较理想的表现.

著录项

来源
《情报学报》 |2012年第1期|31-39|共9页
作者
聂卉; 张津华;
展开▼
作者单位

中山大学资讯管理学院;

广州;

510275;

中山大学资讯管理学院;

广州;

510275;

展开▼
原文格式 PDF
正文语种 chi
中图分类
关键词
网页内容抽取; 网页分块; 网页去噪;

相似文献

中文文献
外文文献
专利

1. 一种校园网环境下的网页正文内容抽取算法 [J] . 林强 . 湖北成人教育学院学报 . 2012,第004期
2. 基于分块的新闻网页信息抽取算法 [J] . 姬鑫 ,钟诚 . 计算机应用与软件 . 2015,第004期
3. 基于分块的网页主题文本抽取 [J] . 任玉 ,樊勇 ,郑家恒 . 广西师范大学学报（自然科学版） . 2009,第001期
4. 基于视觉特征的主题型网页信息抽取 [J] . 胡瑞 ,郭星 ,黄永聪 . 赤峰学院学报（自然科学版） . 2016,第006期
5. 美化我们的网页--"表格布局之后的网页内容填充"教学设计 [J] . 毕文慧 . 中国信息技术教育 . 2004,第002期
6. 基于布局特征与语言特征的网页主要内容块发现 [C] . 韩先培 ,刘康 ,赵军 . 第三届全国信息检索与内容安全学术会议 . 2007
7. 主题型网页的信息抽取技术研究 [A] . 欧杰 . 2011

分块布局下的主题型网页的内容抽取

摘要

著录项

相似文献

相关主题

期刊订阅