基于Heritrix的面向特定主题的聚焦爬虫研究

朱敏; 罗省贤

首页> 中文期刊> 《计算机技术与发展》 >基于Heritrix的面向特定主题的聚焦爬虫研究

基于Heritrix的面向特定主题的聚焦爬虫研究

开具论文收录证明 >>

期刊封面封底目录下载 >>

文献代查 >>

页面导航

摘要
著录项
相似文献
相关主题

摘要

By analyzing the Heritrix open-source crawler' s component architecture, on account of the existed problems of the Heritrix open-source project,the project designs specific capture logics and classes that can directly crawl particular content pages,implements search for particular topic pages; And introduce the BKDRHash algorithms to URL hashing to achieve a particular topic pages for information search and improve the efficiency of the search data, and achieve the purpose of multi-threaded web crawler. Finally, analyse a particular topic pages and capture content,use HTMLParser tool to crawl the web data source into a specific format,the search can provide a data source for the topic-oriented information systems and data mining,prepare a good potential for further research.%通过分析Heritirx开源爬虫的组件结构,针对Heritrix开源爬虫项目存在的问题,项目设计了特定的抓取逻辑和定向抓取包含某一特定内容的网页的类,并引入BKDRHash算法进行URL散列,实现了面向特定主题的网页信息搜索,达到了提高搜索数据的效率以及多线程抓取网页的目的.最后对某一特定主题的网页进行分析,并进行网页内容抓取,采用HTMLParser工具将抓取的网页数据源转换成特定的格式,可为面向主题的搜索信息系统以及数据挖掘提供数据源,为下一步研究工作做好准备.

著录项

来源
《计算机技术与发展》 |2012年第2期|65-68|共4页
作者
朱敏; 罗省贤;
展开▼
作者单位

成都理工大学信息科学与技术学院;

四川成都610059;

成都理工大学信息科学与技术学院;

四川成都610059;

展开▼
原文格式 PDF
正文语种 chi
中图分类计算机软件;
关键词
聚焦爬虫; Heritrix; BKDRHash算法; HTMLParser; 搜索引擎;

相似文献

中文文献
外文文献
专利

1. 一种基于Heritrix可配置主题的聚焦爬虫方法 [J] . 王松 ,刘洪基 ,叶晓波 . 楚雄师范学院学报 . 2020,第006期
2. 基于主题的Deep Web聚焦爬虫研究与设计 [J] . 姚双良 . 西北师范大学学报（自然科学版） . 2013,第002期
3. 基于Heritrix与Solr的就业主题搜索引擎的研究与优化 [J] . 郑燕娥 ,郑志明 . 齐齐哈尔大学学报（自然科学版） . 2018,第004期
4. 基于Heritrix的网络主题爬虫算法研究与应用——以粮食网站交易信息为例 [J] . 樊多妮 ,李禹生 . 现代物业 . 2012,第009期
5. 基于Heritrix的网络主题爬虫算法研究与应用——以粮食网站交易信息为例 [J] . 樊多妮 ,李禹生 . 新建设：现代物业上旬刊 . 2012,第009期
6. 基于Heritrix的面向特定主题的聚焦爬虫研究 [C] . 朱敏 ,罗省贤 . 2011嵌入式技术开发论坛 . 2011
7. 基于SNA面向特定主题的意见领袖发现研究 [A] . 朱义生 . 2012

基于Heritrix的面向特定主题的聚焦爬虫研究

摘要

著录项

相似文献

相关主题

期刊订阅