Analyzing Data Properties Using Statistical Sampling Techniques: Illustrated on Scientific File Formats and Compression Features

机译：使用统计采样技术分析数据属性：科学文件格式和压缩功能说明

获取原文

页面导航

摘要
著录项
相似文献
相关主题

摘要

Understanding the characteristics of data stored in data centers helps computer scientists in identifying the most suitable storage infrastructure to deal with these workloads. For example, knowing the relevance of file formats allows optimizing the relevant formats but also helps in a procurement to define benchmarks that cover these formats. Existing studies that investigate performance improvements and techniques for data reduction such as deduplication and compression operate on a small set of data. Some of those studies claim the selected data is representative and scale their result to the scale of the data center. One hurdle of running novel schemes on the complete data is the vast amount of data stored and, thus, the resources required to analyze the complete data set. Even if this would be feasible, the costs for running many of those experiments must be justified. This paper investigates stochastic sampling methods to compute and analyze quantities of interest on file numbers but also on the occupied storage space. It will be demonstrated that on our production system, scanning 1 % of files and data volume is sufficient to deduct conclusions. This speeds up the analysis process and reduces costs of such studies significantly. The contributions of this paper are: (1) the systematic investigation of the inherent analysis error when operating only on a subset of data, (2) the demonstration of methods that help future studies to mitigate this error, (3) the illustration of the approach on a study for scientific file types and compression for a data center.

机译：了解数据中心中存储的数据的特征有助于计算机科学家确定最合适的存储基础架构来处理这些工作负载。例如，了解文件格式的相关性可以优化相关格式，但也可以帮助您定义涵盖这些格式的基准。现有的调查性能改进和数据缩减技术（例如重复数据删除和压缩）的研究仅针对一小部分数据。其中一些研究声称所选数据具有代表性，并将其结果扩展到数据中心的规模。在完整数据上运行新颖方案的一个障碍是存储的大量数据，因此是分析完整数据集所需的资源。即使这是可行的，也必须证明运行许多这些实验的费用是合理的。本文研究了随机抽样方法，以计算和分析文件编号以及占用的存储空间上感兴趣的数量。可以证明，在我们的生产系统上，扫描1％的文件和数据量足以得出结论。这样可以加快分析过程，并显着降低此类研究的成本。本文的贡献是：（1）仅对数据子集进行操作时固有分析误差的系统研究；（2）有助于未来研究减轻该误差的方法的论证；（3）研究科学文件类型和数据中心压缩的方法。

著录项

来源
《International supercomputing conference international workshops;International Workshop on OpenPOWER for HPC;Workshop on performance scalability of storage systems;International Workshop on performance portable programming models for accelerators;Workshop on application performance on intel xeon phi - being prepared for KNL beyond;Workshop on HPC I/O in the data center;International Workshop on communication architectures at extreme scale;Workshop on exascale multi/many core computing systems;Workshop on virtualization in high-performance cloud computing》|2016年|130-141|共12页
会议地点 Frankfurt(DE)
作者
Julian M. Kunkel;
展开▼
作者单位

Deutsches Klimarechenzentrum Bundesstrasse 45a 20146 Hamburg Germany;

展开▼
会议组织
原文格式 PDF
正文语种
中图分类
关键词
Scientific data; Compression; Analyzing data properties;

机译：科学数据；压缩;分析数据属性;

相似文献

外文文献
中文文献
专利

1. Analyzing the Extracted File Metadata Evidences from Suspicious Nodes in DFXML format using Clustering Techniques [J] . Shruti B. Yagnik, Binod C. Agrawal International Journal of Applied Engineering Research . 2018,第23aPta1期

机译：使用聚类技术分析DFXML格式中的可疑节点中提取的文件元数据证明
2. An In-Class Experiment to Illustrate the Importance of Sampling Techniques and Statistical Analysis of Data to Quantitative Analysis Students [J] . JudithAnn R.Hartman, Daniel W.Bacon, Wayne C.Wolsey Journal of Chemical Education . 2000,第8期

机译：课堂实验，说明采样技术和数据统计分析对定量分析学生的重要性
3. Core Scientific Dataset Model: A lightweight and portable model and file format for multi-dimensional scientific data [J] . Deepansh J. Srivastava, Thomas Vosegaard, Dominique Massiot, PLoS One . 2020,第1期

机译：核心科学数据集模型：多维科学数据的轻量级和便携式模型和文件格式
4. Analyzing Data Properties Using Statistical Sampling Techniques - Illustrated on Scientific File Formats and Compression Features [C] . Julian M. Kunkel ISC High Performance Conference . 2016

机译：使用统计采样技术分析数据属性 - 在科学文件格式和压缩功能上说明
5. Data locality techniques in an active cluster file system designed for scientific workflows. [D] . Donnelly, Patrick Joseph. 2016

机译：为科学工作流而设计的活动集群文件系统中的数据局部性技术。
6. Dataset for file fragment classification of video file formats [O] . Narges Sadeghi, Mohadeseh Fahiminia, Mehdi Teimouri 2020

机译：视频文件格式的文件片段分类数据集
7. Figure 4: (A) One conserved sequence, which occurs 79 times in 46,264 binding site peaks from the ChIP-seq data-set. The mutation profile of this conserved sequence is illustrated, where ’_ ’ indicates this base is unchanged; DEL indicates this base is lost; INS X indicates a new base X is inserted in front of this base. (B) Several repeated elements patterns are listed. (C) In the first column, the top five DNA motifs, mined by meme-chip tools (Machanick Bailey, 2011) are illustrated. The resemblant conserved sequences, found by the CFSP algorithm are listed in the second column. In the third column, the position-specific scoring matrices, which are transformed from mutational information are listed. The similarity between meme motif and resemblant conserved sequence with PSSM format was calculated via a stamp motif comparison tool (Mahony Benos, 2007). The E-values for the similarity of those pairs is displayed in the fourth column. (D) One motif is selected in each group clustered by gkmsvm descriptors, and the corresponding motif found by the CFSP algorithm is listed below. (E) There are additional datasets (File No: ENCFF100GRL, ENCFF616IRT, ENCFF870CER, Target: SREBF1) collected from https://www.encodeproject.org. The top two motifs are selected in each file using meme tools, and the corresponding motifs found by our algorithm are listed below. [O] . -1

机译：图4：（a）一种保守序列，其发生在芯片-SEQ数据集中的46,264个结合位点峰值中的79倍。说明了这种保守序列的突变分布，其中'_'表示该碱度不变; del表示此基础丢失; INS X表示新的基础X插入此基础前面。（b）列出了几种重复的元素模式。（c）在第一栏中，示出了由MEME芯片工具（Machanick＆Bailey，2011）开采的前五个DNA主题。由CFSP算法发现的相应保守序列列于第二列中。在第三列中，列出了从突变信息转换的特定位置的评分矩阵。 MEME主题与PSSM格式的相似性与PSSM格式之间的相似性通过邮票图章比较工具（Mahony＆Benos，2007）计算。这些对相似性的电子值显示在第四列中。（d）在由GKMSVM描述符聚集的每个组中选择了一个图案，下面列出了CFSP算法的相应主题。（e）从https://www.encodeproject.org收集的，有附加数据集（文件no：cernff100grl，cenf616irl，conf8.20cer，target：srebf1）。使用MEME工具在每个文件中选择前两个图案，并且我们的算法发现的相应主题如下所示。

Analyzing Data Properties Using Statistical Sampling Techniques: Illustrated on Scientific File Formats and Compression Features

摘要

著录项

相似文献

相关主题

期刊订阅