主题网络爬虫是垂直搜索引擎的重要组成部分,传统主题爬虫的网页内容相似度算法只考虑词频,忽略了关键词的位置信息.本文在分析基于网页内容相似度的主题爬虫的基础之上,提出利用网页HTML标签的特点改进相似度的计算方法.实验结果表明,改进算法抓取的平均准确率为64.99%,相比原始方法提高了15.37%.%Focused crawler is an important part of the vertical search engine. The Web content relevance algorithm of traditional focused crawler only considers term frequency, ignores the location information of key terms. After the analysis of the focused crawler based on the Web content relevance, this paper proposes an improved method of calculating relevance using the features of HTML tags. Experimental results show that the average accuracy of improved algorithm is 64.99% and increases 13.37% compared to the original method.
展开▼