首页> 外文期刊>IEEE Transactions on Parallel and Distributed Systems >The Design of Fast Content-Defined Chunking for Data Deduplication Based Storage Systems
【24h】

The Design of Fast Content-Defined Chunking for Data Deduplication Based Storage Systems

机译:基于数据重复数据删除的存储系统的快速内容定义块的设计

获取原文
获取原文并翻译 | 示例
           

摘要

Content-Defined Chunking (CDC) has been playing a key role in data deduplication systems recently due to its high redundancy detection ability. However, existing CDC-based approaches introduce heavy CPU overhead because they declare the chunk cut-points by computing and judging the rolling hashes of the data stream byte by byte. In this article, we propose FastCDC, a Fast and efficient Content-Defined Chunking approach, for data deduplication-based storage systems. The key idea behind FastCDC is the combined use of five key techniques, namely, gear based fast rolling hash, simplifying and enhancing the Gear hash judgment, skipping sub-minimum chunk cut-points, normalizing the chunk-size distribution in a small specified region to address the problem of the decreased deduplication ratio stemming from the cut-point skipping, and last but not least, rolling two bytes each time to further speed up CDC. Our evaluation results show that, by using a combination of the five techniques, FastCDC is 3-12X faster than the state-of-the-art CDC approaches, while achieving nearly the same and even higher deduplication ratio as the classic Rabin-based CDC. In addition, our study on the deduplication throughput of FastCDC-based Destor (an open source deduplication project) indicates that FastCDC helps achieve 1.2-3.0X higher throughput than Destor based on state-of-the-art chunkers.
机译:由于其高冗余检测能力,内容定义的块(CDC)一直在数据重复数据删除系统中发挥着关键作用。然而,现有的基于CDC的方法引入了沉重的CPU开销,因为它们通过计算并判断数据流字节的滚动哈希划分的块切割点。在本文中,我们提出了FastCDC,一种快速高效的内容定义的块方法,用于基于数据删除的存储系统。 FastCDC背后的关键思想是组合使用五个关键技术,即基于齿轮的快速滚动哈希,简化和增强齿轮散列判断,跳过次最小块切割点,将块大小分布标准化在小指定区域中的块大小分布解决从剪切点跳过的重复数据删除率下降的问题,最后但并非最不重要,每次滚动两个字节,以进一步加速CDC。我们的评价结果​​表明,通过使用五种技术的组合,FastCDC比最先进的CDC方法快3-12倍,同时实现与基于经典Rabin的CDC的几乎相同甚至更高的重复数据删除率。此外,我们对基于FastCDC的Destor(开放源重复数据删除项目)的重复数据删除吞吐量的研究表明FastCDC基于最先进的散装件的吞吐量比Destor达到1.2-3.0倍。

著录项

  • 来源
  • 作者单位

    Harbin Inst Technol Shenzhen 518055 Peoples R China|Cyberspace Secur Res Ctr Peng Cheng Lab Shenzhen 518055 Peoples R China|Wuhan Natl Lab Optoelect Wuhan 430074 Peoples R China;

    Harbin Inst Technol Shenzhen 518055 Peoples R China|Cyberspace Secur Res Ctr Peng Cheng Lab Shenzhen 518055 Peoples R China;

    Univ Texas Arlington Dept Comp Sci & Engn Arlington TX 76019 USA;

    Huazhong Univ Sci & Technol Sch Comp Sci & Tech Wuhan Natl Lab Optoelect Wuhan 430074 Peoples R China;

    Harbin Inst Technol Shenzhen 518055 Peoples R China|Cyberspace Secur Res Ctr Peng Cheng Lab Shenzhen 518055 Peoples R China;

    Huazhong Univ Sci & Technol Sch Comp Sci & Tech Wuhan Natl Lab Optoelect Wuhan 430074 Peoples R China;

    Huazhong Univ Sci & Technol Sch Comp Sci & Tech Wuhan Natl Lab Optoelect Wuhan 430074 Peoples R China;

    Huazhong Univ Sci & Technol Sch Comp Sci & Tech Wuhan Natl Lab Optoelect Wuhan 430074 Peoples R China;

    Huazhong Univ Sci & Technol Sch Comp Sci & Tech Wuhan Natl Lab Optoelect Wuhan 430074 Peoples R China;

  • 收录信息
  • 原文格式 PDF
  • 正文语种 eng
  • 中图分类
  • 关键词

    Microsoft Windows; Gears; Power capacitors; Redundancy; Acceleration; Throughput; Distributed databases; Data deduplication; content-defined chunking; storage system; performance evaluation;

    机译:Microsoft Windows;齿轮;电容器;冗余;加速;吞吐量;分布式数据库;数据重复数据删除;内容定义的块;存储系统;绩效评估;

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号