HTML web information is a kind of semi-structured data, and different web pages always have some kind of similarity in structure. From the perspective of information structure, this paper has studied the similarity between two different blocks of web information, and proposed a new model of calculating structural similarity based on optimally free matching on sub trees and a method of extracting web inornation by using structural similarity. All of algorithms in this paper are implemented by Python. We have calculated and analyzed the similarity between different web pages through experiment, which shows that our model of calculating structural similarity is of stronger systematicness and applicability.Compared with traditional method which relies on the monotony text information, the new structural-similarity-method makes full use of the relationship between different elements within a page or different pages, which makes web information extracting quicker and more accurate.%HTML网页信息是一种半结构化的数据,而且不同网页之间在其结构特征方面都具有一定的相似性.本文就是从信息的结构性角度来研究不同网页信息块之间的相似性,并提出了基于子树最优自由匹配规则的结构相似度度量模型以及利用网页结构相似性提取网页信息的方法.本文中的计算方法都用python语言实现.通过实验,本文对不同网页之间的相似度进行了计算和分析,实验数据表明,基于子树最优自由匹配规则的树结构相似度度量模型具有较好的系统性和适用性;通过树结构相似度来确定网页内部元素及两个网页之间的联系,也弥补了传统方法中依赖单调的文本信息比较的不足,使得网页信息提取更加准确,更加迅速.
展开▼