首页> 中文期刊> 《情报学报》 >分块布局下的主题型网页的内容抽取

分块布局下的主题型网页的内容抽取

         

摘要

A Web page extraetion method based on the layout of Web page is proposed in this paper to implementtasks of page cleaning and content extraction. Firstly, a tag-tree is constructed by analyzing the corresponding DOM structure of original page. Then the tree is partitioned into a set of blocks from bottom to up in terms of categories of tags and concerning information of nodes, furthermore, blocks are classified on the basis of the proportion of word, link and image in blocks. Next, by using VSM (Vector Space Model) , text eigenvector of page's subject is abstracted, which has been used to calculate degree of correlation between block ' s content and page ' s subject. In the light of degree of correlation, we can judge which blocks should be got rid of and which ones should be kept. The content blocks with high degree of correlation are kept to reconstruct the description of Web page. The method has been applied in a project concerning Talent Information Collection. Test results indicate effectiveness of the method in page cleaning and contentextraction.%本篇论文以去除网页噪声,整合网页内容为目标,提出了面向主题型网页,根据网页规划布局抽取网页内容的方法.算法首先分析原始网页的DOM结构生成标签树,再根据标签分类和对应节点的信息对标签树自底向上进行划分,并依据划分块的文字密度,链接密度及图片密度,分类信息块.进一步,提炼网页主题的文本特征向量,采用基于词条空间的文本相似度计算,获取划分块的主题相关度,以主题相关度为量化基准剔除噪声,识别网页主旨内容,重构页面描述.这一算法被应用于面向人才资讯的信息采集项目中,实验表明,算法适用于主题型网页的"去噪"及内容提取,具体应用中有较理想的表现.

著录项

相似文献

  • 中文文献
  • 外文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号