全自动网页信息采集系统

         

摘要

With the rapid development of the internet age, users have put forward more requirements for search en-gines,content of webpage and large data processing etc. Selecting the required information from the internet information with mass data has become a new hotspot. In this paper, extensible webcrawler project- Heritrix, which is an open source and developed by Java, is extended to capture user webpage. The information collection technology is further studied. Extendibility of Heritrix is used to realize a user’s capture. Through the analysis of the working process of Heritrix, module allocation and source code design, based on webpage extraction facing product information with Heri-trix extendibility and webpage content analysis with HtmlParser, key product information is extracted effectively, which is stored in the database for retrieval.%随着网络时代的快速发展,用户对搜索引擎、网页的内容和大数据处理等有了更多的要求。从海量的互联网信息中选取最符合要求的信息成为了新的热点。基于一个开源的、Java开发的、可扩展的Web爬虫项目—Heritrix,进行扩展抓取用户需要的网页,深入研究了信息采集技术。利用Heritrix的可扩展性,来实现用户的抓取。通过分析Heritrix的工作流程,模块划分以及源码设计,基于Heritrix扩展抽取面向商品信息的网页,配合HtmlParser对网页内容进行解析,有效的提取商品关键信息后存入数据库以供检索。

著录项

相似文献

  • 中文文献
  • 外文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号