With the rapid development of the internet age, users have put forward more requirements for search en-gines,content of webpage and large data processing etc. Selecting the required information from the internet information with mass data has become a new hotspot. In this paper, extensible webcrawler project- Heritrix, which is an open source and developed by Java, is extended to capture user webpage. The information collection technology is further studied. Extendibility of Heritrix is used to realize a user’s capture. Through the analysis of the working process of Heritrix, module allocation and source code design, based on webpage extraction facing product information with Heri-trix extendibility and webpage content analysis with HtmlParser, key product information is extracted effectively, which is stored in the database for retrieval.%随着网络时代的快速发展,用户对搜索引擎、网页的内容和大数据处理等有了更多的要求。从海量的互联网信息中选取最符合要求的信息成为了新的热点。基于一个开源的、Java开发的、可扩展的Web爬虫项目—Heritrix,进行扩展抓取用户需要的网页,深入研究了信息采集技术。利用Heritrix的可扩展性,来实现用户的抓取。通过分析Heritrix的工作流程,模块划分以及源码设计,基于Heritrix扩展抽取面向商品信息的网页,配合HtmlParser对网页内容进行解析,有效的提取商品关键信息后存入数据库以供检索。
展开▼