首页> 外文期刊>IEEE Transactions on Parallel and Distributed Systems >aHDFS: An Erasure-Coded Data Archival System for Hadoop Clusters
【24h】

aHDFS: An Erasure-Coded Data Archival System for Hadoop Clusters

机译:aHDFS:适用于Hadoop集群的擦除编码数据归档系统

获取原文
获取原文并翻译 | 示例
           

摘要

In this paper, we propose an erasure-coded data archival system called aHDFS for Hadoop clusters, where codes are employed to archive data replicas in the Hadoop distributed file system or HDFS. We develop two archival strategies (i.e., aHDFS-Grouping and aHDFS-Pipeline) in aHDFS to speed up the data archival process. aHDFS-Grouping - a MapReduce-based data archiving scheme - keeps each mapper’s intermediate output Key-Value pairs in a local key-value store. With the local store in place, aHDFS-Grouping merges all the intermediate key-value pairs with the same key into one single key-value pair, followed by shuffling the single Key-Value pair to reducers to generate final parity blocks. aHDFS-Pipeline forms a data archival pipeline using multiple data node in a Hadoop cluster. aHDFS-Pipeline delivers the merged single key-value pair to a subsequent node’s local key-value store. Last node in the pipeline is responsible for outputting parity blocks. We implement aHDFS in a real-world Hadoop cluster. The experimental results show that aHDFS-Grouping and aHDFS-Pipeline speed up Baseline’s shuffle and reduce phases by a factor of 10 and 5, respectively. When block size is larger than 32 MB, aHDFS improves the performance of HDFS-RAID and HDFS-EC by approximately 31.8 and 15.7 percent, respectively.
机译:在本文中,我们为Hadoop群集提出了一种称为aHDFS的擦除编码数据归档系统,其中使用代码在Hadoop分布式文件系统或HDFS中归档数据副本。我们在aHDFS中开发了两种归档策略(即,aHDFS-分组和aHDFS-管道),以加快数据归档过程。 aHDFS-Grouping(一种基于MapReduce的数据归档方案)将每个映射器的中间输出键值对保留在本地键值存储区中。有了本地存储之后,aHDFS-Grouping将具有相同密钥的所有中间密钥-值对合并为一个单个密钥-值对,然后将单个“密钥-值”对改组为缩减器以生成最终的奇偶校验块。 aHDFS-Pipeline使用Hadoop集群中的多个数据节点形成数据归档管道。 aHDFS-Pipeline将合并的单个键值对传递到后续节点的本地键值存储中。流水线中的最后一个节点负责输出奇偶校验块。我们在真实的Hadoop集群中实现aHDFS。实验结果表明,aHDFS-Grouping和aHDFS-Pipeline可以加速Baseline的混洗,并将相位分别减少10倍和5倍。当块大小大于32 MB时,aHDFS分别将HDFS-RAID和HDFS-EC的性能提高约31.8%和15.7%。

著录项

  • 来源
  • 作者单位

    Department of Computer Science and Software Engineering, Shelby Center for Engineering Technology, Samuel Ginn College of Engineering, Auburn University, AL;

    Department of Computer Science and Software Engineering, Shelby Center for Engineering Technology, Samuel Ginn College of Engineering, Auburn University, AL;

    Department of Computer Science and Software Engineering, Shelby Center for Engineering Technology, Samuel Ginn College of Engineering, Auburn University, AL;

    Department of Computer Science and Software Engineering, Shelby Center for Engineering Technology, Samuel Ginn College of Engineering, Auburn University, AL;

    Wuhan National Lab. for Optoelectronics, Huazhong University of Science and Technology (HUST), Wuhan, Hubei, China;

  • 收录信息
  • 原文格式 PDF
  • 正文语种 eng
  • 中图分类
  • 关键词

    Mathematical model; Distributed databases; Redundancy; Encoding; Programming; Pipelines; Data models;

    机译:数学模型分布式数据库冗余编码编程管线数据模型;

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号