NetCube: A Scalable Tool for Fast Data Mining and Compression

机译：NetCube：用于快速数据挖掘和压缩的可扩展工具

获取原文

获取原文并翻译 | 示例

页面导航

摘要
著录项
相似文献
相关主题

摘要

We propose an novel method of computing and storing DataCubes. Our idea is to use Bayesian Networks, which can generate approximate counts for any query combination of attribute values and "don't cares." A Bayesian network represents the underlying joint probability distribution of the data that were used to generate it. By means of such a network the proposed method, NetCube, exploits correlations among attributes. Our proposed preprocessing algorithm scales linearly on the size of the database, and is thus scalable; it is also parallelizable with a straightforward parallel implementation. Moreover, we give an algorithm to estimate counts of arbitrary queries that is fast (constant on the database size). Experimental results show that NetCubes have fast generation and use (a few minutes preprocessing time per 100,000 records and less than a second query time), achieve excellent compression (at least 1800:1 compression ratios on real data) and have low reconstruction error (less than 5% on average). Moreover, our method naturally allows for visualization and data mining, at no extra cost.

机译：我们提出了一种计算和存储DataCube的新颖方法。我们的想法是使用贝叶斯网络，它可以为属性值和“无关”的任何查询组合生成近似计数。贝叶斯网络表示用于生成数据的数据的潜在联合概率分布。通过这样的网络，所提出的方法NetCube利用了属性之间的相关性。我们提出的预处理算法根据数据库的大小线性扩展，因此具有可扩展性。它也可以通过简单的并行实现并行化。此外，我们提供了一种算法来估计任意查询的数量，该算法速度很快（取决于数据库大小）。实验结果表明，NetCube具有快速的生成和使用（每100,000条记录几分钟的预处理时间和不到一秒钟的查询时间），出色的压缩率（对实际数据的压缩率至少为1800：1）和低重构误差（较少超过5％）。此外，我们的方法自然允许可视化和数据挖掘，而无需任何额外费用。

著录项

来源
《Twenty-Seventh International Conference on Very Large Data Bases, 27th, Sep 11-14th, 2001, Roma, Italy》|2001年|p.311-320|共10页
会议地点 Roma(IT);Roma(IT)
作者
Dimitris Margaritis; Christos Faloutsos; Sebastian Thrun;
展开▼
作者单位

Computer Science Dept. Carnegie Mellon University Pittsburgh, PA 15213, U.S.A;

展开▼
会议组织
原文格式 PDF
正文语种 eng
中图分类自动化技术、计算机技术;
关键词

相似文献

外文文献
中文文献
专利

1. MFCompress: a compression tool for FASTA and multi-FASTA data [J] . Pinho Armando J., Pratas Diogo Bioinformatics . 2014,第1期

机译：MFCompress：用于FASTA和多FASTA数据的压缩工具
2. FastMFDs: a fast, efficient algorithm for mining minimal functional dependencies from large-scale distributed data with Spark [J] . Cheng Feng, Yang Zhe Journal of supercomputing . 2019,第5期

机译：FastMFDs：一种快速有效的算法，可通过Spark从大型分布式数据中挖掘最小的功能依赖性
3. FastMFDs: a fast, efficient algorithm for mining minimal functional dependencies from large-scale distributed data with Spark [J] . Cheng Feng, Yang Zhe Journal of supercomputing . 2019,第5期

机译：FASTMFDS：一种快速，高效的算法，用于挖掘大规模分布式数据的最小功能依赖性与火花
4. NetCube: A Scalable Tool for Fast Data Mining and Compression [C] . Sebastian Thrun, Christos Faloutsos, Dimitris Margaritis International conference on very large data bases . 2001

机译：NetCube：一种可扩展的用于快速数据挖掘和压缩的工具
5. Large Scale Archaeological Satellite Classification and Data Mining Tools. [D] . Canham, Kelly. 2012

机译：大型考古卫星分类和数据挖掘工具。
6. MFCompress: a compression tool for FASTA and multi-FASTA data [O] . Armando J. Pinho, Diogo Pratas -1

机译：MFCompress：用于FASTA和多FASTA数据的压缩工具
7. Bolt: Accelerated Data Mining with Fast Vector Compression [O] . Blalock, Davis W, Guttag, John V 2017

机译：Bolt：使用快速矢量压缩的加速数据挖掘
8. Tri-Plots: Scalable Tools for Multidimensional Data Mining [R] . Traina, A. , Traina, C. , Papadimitriou, S. , 2001

机译：Tri-plots：用于多维数据挖掘的可扩展工具

NetCube: A Scalable Tool for Fast Data Mining and Compression

摘要

著录项

相似文献

相关主题

期刊订阅