【24h】

Analyzing Data Properties Using Statistical Sampling Techniques: Illustrated on Scientific File Formats and Compression Features

机译:使用统计采样技术分析数据属性:科学文件格式和压缩功能说明

获取原文

摘要

Understanding the characteristics of data stored in data centers helps computer scientists in identifying the most suitable storage infrastructure to deal with these workloads. For example, knowing the relevance of file formats allows optimizing the relevant formats but also helps in a procurement to define benchmarks that cover these formats. Existing studies that investigate performance improvements and techniques for data reduction such as deduplication and compression operate on a small set of data. Some of those studies claim the selected data is representative and scale their result to the scale of the data center. One hurdle of running novel schemes on the complete data is the vast amount of data stored and, thus, the resources required to analyze the complete data set. Even if this would be feasible, the costs for running many of those experiments must be justified. This paper investigates stochastic sampling methods to compute and analyze quantities of interest on file numbers but also on the occupied storage space. It will be demonstrated that on our production system, scanning 1 % of files and data volume is sufficient to deduct conclusions. This speeds up the analysis process and reduces costs of such studies significantly. The contributions of this paper are: (1) the systematic investigation of the inherent analysis error when operating only on a subset of data, (2) the demonstration of methods that help future studies to mitigate this error, (3) the illustration of the approach on a study for scientific file types and compression for a data center.
机译:了解数据中心中存储的数据的特征有助于计算机科学家确定最合适的存储基础架构来处理这些工作负载。例如,了解文件格式的相关性可以优化相关格式,但也可以帮助您定义涵盖这些格式的基准。现有的调查性能改进和数据缩减技术(例如重复数据删除和压缩)的研究仅针对一小部分数据。其中一些研究声称所选数据具有代表性,并将其结果扩展到数据中心的规模。在完整数据上运行新颖方案的一个障碍是存储的大量数据,因此是分析完整数据集所需的资源。即使这是可行的,也必须证明运行许多这些实验的费用是合理的。本文研究了随机抽样方法,以计算和分析文件编号以及占用的存储空间上感兴趣的数量。可以证明,在我们的生产系统上,扫描1%的文件和数据量足以得出结论。这样可以加快分析过程,并显着降低此类研究的成本。本文的贡献是:(1)仅对数据子集进行操作时固有分析误差的系统研究;(2)有助于未来研究减轻该误差的方法的论证;(3)研究科学文件类型和数据中心压缩的方法。

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号