An Improved Focused Web Crawler based on Hybrid Similarity

Songtao Shang; Huaiguang Wu; Jiangtao Ma

首页> 外文期刊>International Journal of Performability Engineering >An Improved Focused Web Crawler based on Hybrid Similarity

【24h】

An Improved Focused Web Crawler based on Hybrid Similarity

机译：基于混合相似性的改进的聚焦网络履带

获取原文

获取原文并翻译 | 示例

掌桥外文数据库（机构版） >>

开具论文收录证明 >>

文献代查 >>

页面导航

摘要
著录项
相似文献
相关主题

摘要

Web crawler is an efficient strategy for downloading data automatically from the Internet. Focused web crawler is a special kind of web crawler that is responsible for getting certain information from webpages and making them available to users. The most important problem of focused web crawler is to confirm the similarity between the target webpages and the topics. Therefore, this paper proposes an improved focused web crawler algorithm, whose similarity calculating methods derive from three aspects: anchor text, content, and structure of the webpages. This improved algorithm is called hybrid similarity. If the anchor text similarity is bigger than the threshold, the target webpages are downloaded directly; otherwise, the target webpages' similarity is analyzed by using the TF-Gini feature weighting algorithm and the improved cosine similarity algorithm. The experimental results in this paper have proven that the hybrid similarity algorithm is more effective than the traditional algorithm. The precision increases by nearly 10% compared with the traditional algorithm.

机译：Web爬网程序是从Internet自动下载数据的有效策略。聚焦的Web爬网程序是一种特殊的Web爬网程序，负责从网页获取某些信息并使它们提供给用户。聚焦Web爬网的最重要问题是确认目标网页与主题之间的相似性。因此，本文提出了一种改进的聚焦Web爬网轨迹算法，其相似度计算方法从三个方面导出：锚文本，内容和网页的结构。这种改进的算法称为混合相似性。如果锚文本相似度大于阈值，则目标网页直接下载;否则，通过使用TF-GINI特征加权算法和改进的余弦相似性算法来分析目标网页相似度。本文的实验结果证明了混合相似性算法比传统算法更有效。与传统算法相比，精度升高近10％。

著录项

来源
《International Journal of Performability Engineering》 |2019年第10期|共12页
作者
Songtao Shang; Huaiguang Wu; Jiangtao Ma;
展开▼
作者单位

School of Computer and Communication Engineering Zhengzhou University of Light Industry;

School of Computer and Communication Engineering Zhengzhou University of Light Industry;

School of Computer and Communication Engineering Zhengzhou University of Light Industry;

展开▼
收录信息
原文格式 PDF
正文语种 eng
中图分类工程设计与测绘;
关键词
Focused web crawler; TF-Gini; Similarity; Hybrid similarity;

机译：专注的网履带;TF-GINI;相似;混合相似之处;

相似文献

外文文献
中文文献
专利

1. An Improved Focused Web Crawler based on Hybrid Similarity [J] . Songtao Shang, Huaiguang Wu, Jiangtao Ma International Journal of Performability Engineering . 2019,第10期

机译：基于混合相似性的改进的聚焦网络履带
2. An improved focused crawler based on Semantic Similarity Vector Space Model [J] . Du Yajun, Liu Wenjun, Lv Xianjing, Applied Soft Computing . 2015,第Null期

机译：基于语义相似度向量空间模型的改进型聚焦爬虫
3. Research on model of network information extraction based on improved topic-focused Web crawler key technology [J] . Chen Mo, Yang Xiao-Ping Technical Gazette . 2016,第4期

机译：基于改进的以主题为中心的Web爬虫关键技术的网络信息提取模型研究
4. Similarity Computation of Web Pages of Focused Crawler [C] . Yu Huo Ling, Bingwu Liu, Fang Yan 2010 International Forum on Information Technology and Applications . 2010

机译：重点履带网页的相似度计算
5. Web based content and hybrid teaching: Student perceptions of the effectiveness of using web based content and hyper-linked teaching units in teaching hybrid business and marketing post secondary classes. [D] . Richardson, W. Tim G. 2007

机译：基于Web的内容和混合教学：学生对使用基于Web的内容和超链接教学单元在混合商务和市场营销中学后课程教学中的有效性的看法。
6. Improving the Measurement of Semantic Similarity between Gene Ontology Terms and Gene Products: Insights from an Edge- and IC-Based Hybrid Method [O] . Xiaomei Wu, Erli Pang, Kui Lin, -1

机译：改进基因本体术语和基因产物之间语义相似性的度量：基于边缘和基于IC的混合方法的见解
7. An Improved Focused Web Crawler based on Hybrid Similarity [O] . Shang Songtao, Wu Huaiguang, Ma Jiangtao 2019

机译：一种基于混合相似性的改进的聚焦网络履带

An Improved Focused Web Crawler based on Hybrid Similarity

摘要

著录项

相似文献

相关主题

期刊订阅