首页> 中文期刊> 《计算机应用与软件》 >基于Hadoop平台的农产品价格数据爬取和存储系统的研究

基于Hadoop平台的农产品价格数据爬取和存储系统的研究

         

摘要

At present, many large farm product markets and agricultural information commerce platforms release the information of agricultural product prices from different regions in real-time each day.Because of a large number of various fast-updating data, the data crawling and storage as well as the following analysis work come to be difficult.Therefore, we put forward a data crawling and storage system of agricultural product price based on Hadoop.We implement multi-threaded crawling by HttpClient framework combined with thread pool and finish integrity checking.After filtering out the web pages whose information is incomplete, we crawl again until the information comes to be complete.We analyze and clean the crawled web pages by regular expression, and save the useful extracted data in the form of text file into HDFS (Hadoop Distributed File System).The data crawled later is supplemented into HDFS.Experiment shows that the writing performance of HDFS can satisfy the incremental crawling data.The less duplicates are, the bigger the data block is, then the better the writing performance is.%目前许多大型农贸市场和农业信息商务平台都在实时发布每天各地区不同农产品的价格数据.针对数据更新快、数据量大、数据形式多样,使数据的爬取和存储以及后续的分析工作变得困难,提出基于Hadoop的农产品价格爬取及存储系统.利用HttpClient框架结合线程池通过多线程爬取,爬取结束后执行完整性检查,过滤出信息不完整的网页,进行二次爬取直到信息完整.对爬取到的网页使用正则表达式进行解析和清洗,提取有用的数据,以文本文件的形式存入HDFS(Hadoop Distributed File System),此后爬取到的数据以追加的方式写入HDFS 文件中.实验表明HDFS的写入性能满足爬取数据不断递增的现状,副本数越少,数据块越大,写入性能越好.

著录项

相似文献

  • 中文文献
  • 外文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号