Enabling scientific data storage and processing on big-data systems

机译：在大数据系统上实现科学数据的存储和处理

获取原文

页面导航

摘要
著录项
相似文献
相关主题

摘要

Big-data systems are increasingly important for solving the data-driven problems in many science domains including geosciences. However, existing big-data systems cannot support the self-describing data formats such as NetCDF which are commonly used by scientific communities for data distribution and sharing. This limitation presents a serious hurdle to the further adoption of big-data systems by science domains and prevents scientific users from leveraging these systems to improve their productivity. This paper presents a solution to this problem by enabling big-data systems to directly store and process scientific data. Specifically, it enables Hadoop to efficiently store NetCDF data on HDFS and process them in MapReduce using convenient APIs. It also enables Hive to support standard queries on NetCDF data, transparently to users. The paper also presents an evaluation of the proposed solution using several representative queries on a typical geoscientific dataset. The results show that the proposed approach achieves substantial speedup (up to 20 times) and space saving (83% reduction), compared to the traditional approach which has to convert NetCDF data to CSV format for Hadoop and Hive to use them.

机译：大数据系统对于解决包括地球科学在内的许多科学领域中的数据驱动问题变得越来越重要。但是，现有的大数据系统无法支持科学界通常用于数据分发和共享的自描述数据格式，例如NetCDF。这种局限性严重阻碍了科学领域进一步采用大数据系统，并阻止了科学用户利用这些系统来提高其生产率。本文通过使大数据系统直接存储和处理科学数据，提出了解决此问题的方法。具体来说，它使Hadoop可以将NetCDF数据有效地存储在HDFS上，并使用方便的API在MapReduce中对其进行处理。它还使Hive对用户透明地支持对NetCDF数据的标准查询。本文还通过在典型的地球科学数据集上使用几个代表性查询，对所提出的解决方案进行了评估。结果表明，与必须将NetCDF数据转换为CSV和Hadoop和Hive的CSV格式的传统方法相比，所提出的方法可实现大幅提速（最高20倍）和节省空间（减少83％）。

著录项

来源
《IEEE International Congress on Big Data》|2015年|1978-1984|共7页
会议地点
作者
Biookaghazadeh Saman; Xu Yiqi; Zhou Shujia; Zhao Ming;
展开▼
作者单位

展开▼
会议组织
原文格式 PDF
正文语种
中图分类
关键词
Hadoop; NetCDF; Scientific data; big data;

机译：Hadoop; NetCDF;科学数据;大数据;

相似文献

外文文献
中文文献
专利

1. A benchmark approach and its toolkit for online scheduling of multiple deadline-constrained workflows in big-data processing systems [J] . Dongzhan Zhang, Wenjing Yan, Emmanuel Bugingo, Future generation computer systems . 2018,第AUGa期

机译：用于大数据处理系统中多个受时间限制的工作流的在线调度的基准测试方法及其工具包
2. There are many ways to get smart-enabling data from individual vehicles to secure systems that can process the data for the benefit of individual drivers and system planners and operators.The major categories are: [J] . Environment . 2010,第5期

机译：有很多方法可以将单个车辆的智能支持数据获取到安全的系统中，该系统可以处理这些数据，从而使单个驾驶员，系统计划者和操作员受益。主要类别为：
3. A data dependency based strategy for intermediate data storage in scientific cloud workflow systems* [J] . Dong Yuan, Yun Yang, Xiao Liu, Concurrency and computation: practice and experience . 2012,第9期

机译：在科学云工作流系统中基于数据依赖的中间数据存储策略*
4. Enabling scientific data storage and processing on big-data systems [C] . Biookaghazadeh Saman, Xu Yiqi, Zhou Shujia, IEEE International Congress on Big Data . 2015

机译：在大数据系统启用科学数据存储和处理
5. Intelligent Energy-Efficient Storage System for Big-Data Applications [D] . Gong, Yifu. 2020

机译：用于大数据应用的智能节能存储系统
6. A Low-Cost Multielectrode System for Data Acquisition Enabling Real-Time Closed-Loop Processing with Rapid Recovery from Stimulation Artifacts [O] . John D. Rolston, Robert E. Gross, Steve M. Potter 2009

机译：一种低成本的多电极系统用于数据采集可从刺激伪像中快速恢复实时闭环处理
7. The origin of data : enabling the determination of provenance in multi-institutional scientific systems through the documentation of processes [O] . Groth Paul 2007

机译：数据的起源：通过记录过程，确定多机构科学系统的出处

Enabling scientific data storage and processing on big-data systems

摘要

著录项

相似文献

相关主题

期刊订阅