...
首页> 外文期刊>International Journal of Performability Engineering >An Improved Focused Web Crawler based on Hybrid Similarity
【24h】

An Improved Focused Web Crawler based on Hybrid Similarity

机译:基于混合相似性的改进的聚焦网络履带

获取原文
获取原文并翻译 | 示例
           

摘要

Web crawler is an efficient strategy for downloading data automatically from the Internet. Focused web crawler is a special kind of web crawler that is responsible for getting certain information from webpages and making them available to users. The most important problem of focused web crawler is to confirm the similarity between the target webpages and the topics. Therefore, this paper proposes an improved focused web crawler algorithm, whose similarity calculating methods derive from three aspects: anchor text, content, and structure of the webpages. This improved algorithm is called hybrid similarity. If the anchor text similarity is bigger than the threshold, the target webpages are downloaded directly; otherwise, the target webpages' similarity is analyzed by using the TF-Gini feature weighting algorithm and the improved cosine similarity algorithm. The experimental results in this paper have proven that the hybrid similarity algorithm is more effective than the traditional algorithm. The precision increases by nearly 10% compared with the traditional algorithm.
机译:Web爬网程序是从Internet自动下载数据的有效策略。聚焦的Web爬网程序是一种特殊的Web爬网程序,负责从网页获取某些信息并使它们提供给用户。聚焦Web爬网的最重要问题是确认目标网页与主题之间的相似性。因此,本文提出了一种改进的聚焦Web爬网轨迹算法,其相似度计算方法从三个方面导出:锚文本,内容和网页的结构。这种改进的算法称为混合相似性。如果锚文本相似度大于阈值,则目标网页直接下载;否则,通过使用TF-GINI特征加权算法和改进的余弦相似性算法来分析目标网页相似度。本文的实验结果证明了混合相似性算法比传统算法更有效。与传统算法相比,精度升高近10%。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号