全自动网页信息采集系统

徐春凤; 王艳春; 翟宏宇

首页> 中文期刊> 《长春理工大学学报（自然科学版）》 >全自动网页信息采集系统

全自动网页信息采集系统

开具论文收录证明 >>

期刊封面封底目录下载 >>

文献代查 >>

页面导航

摘要
著录项
相似文献
相关主题

摘要

With the rapid development of the internet age, users have put forward more requirements for search en-gines,content of webpage and large data processing etc. Selecting the required information from the internet information with mass data has become a new hotspot. In this paper, extensible webcrawler project- Heritrix, which is an open source and developed by Java, is extended to capture user webpage. The information collection technology is further studied. Extendibility of Heritrix is used to realize a user’s capture. Through the analysis of the working process of Heritrix, module allocation and source code design, based on webpage extraction facing product information with Heri-trix extendibility and webpage content analysis with HtmlParser, key product information is extracted effectively, which is stored in the database for retrieval.%随着网络时代的快速发展，用户对搜索引擎、网页的内容和大数据处理等有了更多的要求。从海量的互联网信息中选取最符合要求的信息成为了新的热点。基于一个开源的、Java开发的、可扩展的Web爬虫项目—Heritrix，进行扩展抓取用户需要的网页，深入研究了信息采集技术。利用Heritrix的可扩展性，来实现用户的抓取。通过分析Heritrix的工作流程，模块划分以及源码设计，基于Heritrix扩展抽取面向商品信息的网页，配合HtmlParser对网页内容进行解析，有效的提取商品关键信息后存入数据库以供检索。

著录项

来源
《长春理工大学学报（自然科学版）》 |2015年第2期|151-154|共4页
作者
徐春凤; 王艳春; 翟宏宇;
展开▼
作者单位

长春理工大学计算机科学技术学院;

长春 130022;

长春理工大学计算机科学技术学院;

长春 130022;

长春理工大学计算机科学技术学院;

长春 130022;

展开▼
原文格式 PDF
正文语种 chi
中图分类 TP393.02;
关键词
Heritrix; HtmlParser; 网络爬虫; 信息提取;

相似文献

中文文献
外文文献
专利

1. 基于Nutch的增量网页信息采集系统的设计与实现 [J] . 代鹏 . 软件 . 2015,第011期
2. 基于网页分块的科技信息采集系统的设计与实现 [J] . 李珊 ,马静 ,邱广华 . 价值工程 . 2011,第002期
3. 基于URL和网页类型的网页信息采集研究 [J] . 张锋 . 电子制作 . 2017,第002期
4. 高校重要网页信息采集归档实践探析 [J] . 刘赟博 . 城建档案 . 2021,第009期
5. 自定规则的AJAX网页信息采集功能的设计 [J] . 胡越 ,张源伟 ,雷军 . 物联网技术 . 2016,第009期
6. 通用网页信息采集系统的研究与设计 [C] . 吴瑰 ,陶俊 . 中国电子学会第十一届青年学术年会 . 2005
7. 网页分类与信息采集方法研究 [A] . 文友枥 . 2017

全自动网页信息采集系统

摘要

著录项

相似文献

相关主题

期刊订阅