基于Hadoop平台的农产品价格数据爬取和存储系统的研究

杨晓东; 郜鲁涛; 杨林楠; 刘建阳

首页> 中文期刊> 《计算机应用与软件》 >基于Hadoop平台的农产品价格数据爬取和存储系统的研究

基于Hadoop平台的农产品价格数据爬取和存储系统的研究

开具论文收录证明 >>

期刊封面封底目录下载 >>

文献代查 >>

页面导航

摘要
著录项
相似文献
相关主题

摘要

At present, many large farm product markets and agricultural information commerce platforms release the information of agricultural product prices from different regions in real-time each day.Because of a large number of various fast-updating data, the data crawling and storage as well as the following analysis work come to be difficult.Therefore, we put forward a data crawling and storage system of agricultural product price based on Hadoop.We implement multi-threaded crawling by HttpClient framework combined with thread pool and finish integrity checking.After filtering out the web pages whose information is incomplete, we crawl again until the information comes to be complete.We analyze and clean the crawled web pages by regular expression, and save the useful extracted data in the form of text file into HDFS (Hadoop Distributed File System).The data crawled later is supplemented into HDFS.Experiment shows that the writing performance of HDFS can satisfy the incremental crawling data.The less duplicates are, the bigger the data block is, then the better the writing performance is.%目前许多大型农贸市场和农业信息商务平台都在实时发布每天各地区不同农产品的价格数据.针对数据更新快、数据量大、数据形式多样,使数据的爬取和存储以及后续的分析工作变得困难,提出基于Hadoop的农产品价格爬取及存储系统.利用HttpClient框架结合线程池通过多线程爬取,爬取结束后执行完整性检查,过滤出信息不完整的网页,进行二次爬取直到信息完整.对爬取到的网页使用正则表达式进行解析和清洗,提取有用的数据,以文本文件的形式存入HDFS(Hadoop Distributed File System),此后爬取到的数据以追加的方式写入HDFS 文件中.实验表明HDFS的写入性能满足爬取数据不断递增的现状,副本数越少,数据块越大,写入性能越好.

著录项

来源
《计算机应用与软件》 |2017年第3期|76-80|共5页
作者
杨晓东; 郜鲁涛; 杨林楠; 刘建阳;
展开▼
作者单位

云南农业大学基础与信息工程学院云南昆明 650201;

云南农业大学基础与信息工程学院云南昆明 650201;

云南农业大学基础与信息工程学院云南昆明 650201;

云南省信息技术发展中心云南昆明 650228;

展开▼
原文格式 PDF
正文语种 chi
中图分类计算机网络;
关键词
分布式系统; 爬虫; Hadoop; HDFS; 正则表达式;

相似文献

中文文献
外文文献
专利

1. 基于Scrapy和Hadoop平台的房屋价格数据爬取和存储系统 [J] . 丁志毅1 . 电子技术与软件工程 . 2019,第017期
2. 基于Python的电影数据爬取与数据可视化分析研究 [J] . 成文莹 ,李秀敏 . 电脑知识与技术 . 2019,第031期
3. 基于Python的美食数据爬取及可视化研究 [J] . 依力·吐尔孙 ,艾孜尔古丽 . 电脑知识与技术 . 2021,第010期
4. 基于Xpath的天气数据的爬取研究 [J] . 王康 ,史雅婷 ,梁洪炎 . 江苏通信 . 2021,第005期
5. 基于python的网络数据爬取的研究与实现 [J] . 李天辉 . 时代人物 . 2021,第028期
6. 基于Hadoop平台的科技情报数据爬取系统研究 [C] . 李时玉 ,孟莹 ,孙沫卿 . “科技情报助力全国科技创新中心建设”2017年度论坛 . 2017
7. 基于数据爬取的新闻宣传信息系统的设计与实现 [A] . 闫慧珍 . 2021

基于Hadoop平台的农产品价格数据爬取和存储系统的研究

摘要

著录项

相似文献

相关主题

期刊订阅