aHDFS: An Erasure-Coded Data Archival System for Hadoop Clusters

Yuanqi Chen; Yi Zhou; Shubbhi Taneja; Xiao Qin; Jianzhong Huang

首页> 外文期刊>IEEE Transactions on Parallel and Distributed Systems >aHDFS: An Erasure-Coded Data Archival System for Hadoop Clusters

【24h】

aHDFS: An Erasure-Coded Data Archival System for Hadoop Clusters

机译：aHDFS：适用于Hadoop集群的擦除编码数据归档系统

获取原文

获取原文并翻译 | 示例

掌桥外文数据库（机构版） >>

开具论文收录证明 >>

文献代查 >>

页面导航

摘要
著录项
相似文献
相关主题

摘要

In this paper, we propose an erasure-coded data archival system called aHDFS for Hadoop clusters, where codes are employed to archive data replicas in the Hadoop distributed file system or HDFS. We develop two archival strategies (i.e., aHDFS-Grouping and aHDFS-Pipeline) in aHDFS to speed up the data archival process. aHDFS-Grouping - a MapReduce-based data archiving scheme - keeps each mapper’s intermediate output Key-Value pairs in a local key-value store. With the local store in place, aHDFS-Grouping merges all the intermediate key-value pairs with the same key into one single key-value pair, followed by shuffling the single Key-Value pair to reducers to generate final parity blocks. aHDFS-Pipeline forms a data archival pipeline using multiple data node in a Hadoop cluster. aHDFS-Pipeline delivers the merged single key-value pair to a subsequent node’s local key-value store. Last node in the pipeline is responsible for outputting parity blocks. We implement aHDFS in a real-world Hadoop cluster. The experimental results show that aHDFS-Grouping and aHDFS-Pipeline speed up Baseline’s shuffle and reduce phases by a factor of 10 and 5, respectively. When block size is larger than 32 MB, aHDFS improves the performance of HDFS-RAID and HDFS-EC by approximately 31.8 and 15.7 percent, respectively.

机译：在本文中，我们为Hadoop群集提出了一种称为aHDFS的擦除编码数据归档系统，其中使用代码在Hadoop分布式文件系统或HDFS中归档数据副本。我们在aHDFS中开发了两种归档策略（即，aHDFS-分组和aHDFS-管道），以加快数据归档过程。 aHDFS-Grouping（一种基于MapReduce的数据归档方案）将每个映射器的中间输出键值对保留在本地键值存储区中。有了本地存储之后，aHDFS-Grouping将具有相同密钥的所有中间密钥-值对合并为一个单个密钥-值对，然后将单个“密钥-值”对改组为缩减器以生成最终的奇偶校验块。 aHDFS-Pipeline使用Hadoop集群中的多个数据节点形成数据归档管道。 aHDFS-Pipeline将合并的单个键值对传递到后续节点的本地键值存储中。流水线中的最后一个节点负责输出奇偶校验块。我们在真实的Hadoop集群中实现aHDFS。实验结果表明，aHDFS-Grouping和aHDFS-Pipeline可以加速Baseline的混洗，并将相位分别减少10倍和5倍。当块大小大于32 MB时，aHDFS分别将HDFS-RAID和HDFS-EC的性能提高约31.8％和15.7％。

著录项

来源
《IEEE Transactions on Parallel and Distributed Systems》 |2017年第11期|3060-3073|共14页
作者
Yuanqi Chen; Yi Zhou; Shubbhi Taneja; Xiao Qin; Jianzhong Huang;
展开▼
作者单位

Department of Computer Science and Software Engineering, Shelby Center for Engineering Technology, Samuel Ginn College of Engineering, Auburn University, AL;

Department of Computer Science and Software Engineering, Shelby Center for Engineering Technology, Samuel Ginn College of Engineering, Auburn University, AL;

Department of Computer Science and Software Engineering, Shelby Center for Engineering Technology, Samuel Ginn College of Engineering, Auburn University, AL;

Department of Computer Science and Software Engineering, Shelby Center for Engineering Technology, Samuel Ginn College of Engineering, Auburn University, AL;

Wuhan National Lab. for Optoelectronics, Huazhong University of Science and Technology (HUST), Wuhan, Hubei, China;

展开▼
收录信息
原文格式 PDF
正文语种 eng
中图分类
关键词
Mathematical model; Distributed databases; Redundancy; Encoding; Programming; Pipelines; Data models;

机译：数学模型分布式数据库冗余编码编程管线数据模型;

相似文献

外文文献
中文文献
专利

1. Optimizing Erasure-Coded Data Archival for Replica-Based Storage Clusters [J] . Huang Jianzhong, Zhou Panping, Qin Xiao, The Computer journal . 2019,第2期

机译：为基于副本的存储集群优化擦除编码数据归档
2. Optimizing Erasure-Coded Data Archival for Replica-Based Storage Clusters [J] . Jianzhong Huang, Panping Zhou, Xiao Qin, The Computer Journal . 2019,第2期

机译：为基于副本的存储集群优化擦除编码数据归档
3. A novel clustering technique for efficient clustering of big data in Hadoop Ecosystem [J] . Sunil Kumar, Maninder Singh Big Data Mining and Analytics . 2019,第4期

机译：一种用于Hadoop生态系统中大数据高效集群的新颖集群技术
4. Pre-feasibility Study of Astronomical Data Archive Systems Powered by Public Cloud Computing and Hadoop Hive [C] . Satoshi Eguchi Astronomical Data Analysis Software and Systems Conference . 2019

机译：公共云计算和Hadoop Hive供电的天文数据归档系统的可行性研究
5. Discrete distribution clustering in big data and a method for storm prediction leveraging large historical archives. [D] . Zhang, Yu. 2015

机译：大数据中的离散分布聚类以及利用大型历史档案的风暴预测方法。
6. Demonstration of Hadoop-GIS: A Spatial Data Warehousing System Over MapReduce [O] . Ablimit Aji, Xiling Sun, Hoang Vo, -1

机译：Hadoop-GIS演示：基于MapReduce的空间数据仓库系统
7. Rack Aware Data Placement for Network Consumption in Erasure-Coded Clustered Storage Systems [O] . Bilin Shao, Dan Song, Genqing Bian, 2018

机译：Rack意识到擦除编码集群存储系统中网络消耗的数据放置

aHDFS: An Erasure-Coded Data Archival System for Hadoop Clusters

摘要

著录项

相似文献

相关主题

期刊订阅