首页> 外文会议>IEEE International Congress on Big Data >Enabling scientific data storage and processing on big-data systems
【24h】

Enabling scientific data storage and processing on big-data systems

机译:在大数据系统上实现科学数据的存储和处理

获取原文

摘要

Big-data systems are increasingly important for solving the data-driven problems in many science domains including geosciences. However, existing big-data systems cannot support the self-describing data formats such as NetCDF which are commonly used by scientific communities for data distribution and sharing. This limitation presents a serious hurdle to the further adoption of big-data systems by science domains and prevents scientific users from leveraging these systems to improve their productivity. This paper presents a solution to this problem by enabling big-data systems to directly store and process scientific data. Specifically, it enables Hadoop to efficiently store NetCDF data on HDFS and process them in MapReduce using convenient APIs. It also enables Hive to support standard queries on NetCDF data, transparently to users. The paper also presents an evaluation of the proposed solution using several representative queries on a typical geoscientific dataset. The results show that the proposed approach achieves substantial speedup (up to 20 times) and space saving (83% reduction), compared to the traditional approach which has to convert NetCDF data to CSV format for Hadoop and Hive to use them.
机译:大数据系统对于解决包括地球科学在内的许多科学领域中的数据驱动问题变得越来越重要。但是,现有的大数据系统无法支持科学界通常用于数据分发和共享的自描述数据格式,例如NetCDF。这种局限性严重阻碍了科学领域进一步采用大数据系统,并阻止了科学用户利用这些系统来提高其生产率。本文通过使大数据系统直接存储和处理科学数据,提出了解决此问题的方法。具体来说,它使Hadoop可以将NetCDF数据有效地存储在HDFS上,并使用方便的API在MapReduce中对其进行处理。它还使Hive对用户透明地支持对NetCDF数据的标准查询。本文还通过在典型的地球科学数据集上使用几个代表性查询,对所提出的解决方案进行了评估。结果表明,与必须将NetCDF数据转换为CSV和Hadoop和Hive的CSV格式的传统方法相比,所提出的方法可实现大幅提速(最高20倍)和节省空间(减少83%)。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号