首页> 中文期刊> 《情报学报》 >基于HTML树的网页结构相似度研究

基于HTML树的网页结构相似度研究

         

摘要

HTML web information is a kind of semi-structured data, and different web pages always have some kind of similarity in structure. From the perspective of information structure, this paper has studied the similarity between two different blocks of web information, and proposed a new model of calculating structural similarity based on optimally free matching on sub trees and a method of extracting web inornation by using structural similarity. All of algorithms in this paper are implemented by Python. We have calculated and analyzed the similarity between different web pages through experiment, which shows that our model of calculating structural similarity is of stronger systematicness and applicability.Compared with traditional method which relies on the monotony text information, the new structural-similarity-method makes full use of the relationship between different elements within a page or different pages, which makes web information extracting quicker and more accurate.%HTML网页信息是一种半结构化的数据,而且不同网页之间在其结构特征方面都具有一定的相似性.本文就是从信息的结构性角度来研究不同网页信息块之间的相似性,并提出了基于子树最优自由匹配规则的结构相似度度量模型以及利用网页结构相似性提取网页信息的方法.本文中的计算方法都用python语言实现.通过实验,本文对不同网页之间的相似度进行了计算和分析,实验数据表明,基于子树最优自由匹配规则的树结构相似度度量模型具有较好的系统性和适用性;通过树结构相似度来确定网页内部元素及两个网页之间的联系,也弥补了传统方法中依赖单调的文本信息比较的不足,使得网页信息提取更加准确,更加迅速.

著录项

相似文献

  • 中文文献
  • 外文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号