Sparse Computation for Large-Scale Data Mining

Dorit S. Hochbaum; Philipp Baumann

首页> 外文期刊>Big Data, IEEE Transactions on >Sparse Computation for Large-Scale Data Mining

【24h】

Sparse Computation for Large-Scale Data Mining

机译：大规模数据挖掘的稀疏计算

获取原文

获取原文并翻译 | 示例

掌桥外文数据库（机构版） >>

开具论文收录证明 >>

文献代查 >>

页面导航

摘要
著录项
相似文献
相关主题

摘要

Leading machine learning techniques rely on inputs in the form of pairwise similarities between objects in the data set. The number of pairwise similarities grows quadratically in the size of the data set which poses a challenge in terms of scalability. One way to achieve practical efficiency for similarity-based techniques is to sparsify the similarity matrix. However, existing sparsification approaches consider the complete similarity matrix and remove some of the non-zero entries. This requires quadratic time and storage and is thus intractable for large-scale data sets. We introduce here a method called sparse computation that generates a sparse similarity matrix which contains only relevant similarities without computing first all pairwise similarities. The relevant similarities are identified by projecting the data onto a low-dimensional space in which groups of objects that share the same grid neighborhood are deemed of potential high similarity whereas pairs of objects that do not share a neighborhood are considered to be dissimilar and thus their similarities are not computed. The projection is performed efficiently even for massively large data sets. We apply sparse computation for the -nearest neighbors algorithm (KNN), for graph-based machine learning techniques of supervised normalized cut and K-supervised normalized cut (SNC and KSNC) and for support vector machines with radial basis function kernels (SVM), on real-world classification problems. Our empirical results show that the approach achieves a significant reduction in the density of the similarity matrix, resulting in a substantial reduction in tuning and testing times, while having a minimal effect (and often none) on accuracy. The low-dimensional projection is of further use in massively large data- sets where the grid structure allows to easily identify groups of “almost identical” objects. Such groups of objects are then replaced by representatives, thus reducing the size of the matrix. This approach is effective, as illustrated here for data sets comprising up to 8.5 million objects.

机译：领先的机器学习技术依赖于数据集中对象之间成对相似形式的输入。成对相似性的数量在数据集的大小上呈平方增长，这在可伸缩性方面带来了挑战。为基于相似度的技术实现实际效率的一种方法是稀疏相似度矩阵。但是，现有的稀疏化方法会考虑完整的相似度矩阵，并删除一些非零条目。这需要二次时间和存储时间，因此对于大规模数据集来说是棘手的。我们在这里介绍一种称为稀疏计算的方法，该方法生成一个仅包含相关相似度而无需先计算所有成对相似度的稀疏相似度矩阵。通过将数据投影到低维空间来识别相关相似性，在该低维空间中，共享同一网格邻域的对象组被认为具有潜在的高度相似性，而不共享邻域的对象对被认为是不相似的，因此它们不计算相似度。即使对于大型数据集，也可以高效地执行投影。我们将稀疏计算应用于-最近邻算法（KNN），基于图的监督归一化割和K监督归一化割的机器学习技术（SNC和KSNC）以及具有径向基函数核（SVM）的支持向量机，关于现实世界中的分类问题。我们的经验结果表明，该方法显着降低了相似矩阵的密度，从而大大减少了调整和测试时间，同时对准确性的影响最小（通常没有影响）。低维投影在网格结构允许轻松识别“几乎相同”对象的组的大型数据集中进一步使用。然后将这些对象组替换为代表，从而减小矩阵的大小。如此处所示，此方法对于包含多达850万个对象的数据集有效。

著录项

来源
《Big Data, IEEE Transactions on》 |2016年第2期|151-174|共24页
作者
Dorit S. Hochbaum; Philipp Baumann;
展开▼
作者单位

Department of Industrial Engineering and Operations Research, University of California, Berkeley, CA;

展开▼
收录信息
原文格式 PDF
正文语种 eng
中图分类
关键词
$K$ -nearest neighbor algorithm; Big data; data mining; similarity-based machine learning; sparsification; supervised normalized cut; support vector machines;

机译：$ K $-最近邻算法;大数据;数据挖掘;基于相似度的机器学习;分类;监督归一化分割;支持向量机;

相似文献

外文文献
中文文献
专利

1. A Flow feature detection framework for large-scale computational data based onincremental proper orthogonal decomposition and data mining [J] . Robertson Eric D., Wang Yi, Pant Kapil, International journal of computational fluid dynamics . 2018,第6a10期

机译：基于大规模计算数据的流程特征检测框架，基于onincrenceal正确正交分解和数据挖掘
2. A rapid mining model for extracting sparse distribution association semantic link from large-scale web resources [J] . Zhang Shunxiang, Lu Kui, Yin Xiaobo, International journal of ad hoc and ubiquitous computing . 2017,第1a2期

机译：从大规模网络资源中提取稀疏分布关联语义链接的快速挖掘模型
3. Mining Large-Scale, Sparse GPS Traces for Map Inference: Comparison of Approaches [J] . Xuemei Liu, James Biagioni, Jakob Eriksson, SIGKDD explorations . 2012,第CDaROM期

机译：挖掘大规模，稀疏的GPS轨迹以进行地图推断：方法的比较
4. Sparse computation for large-scale data mining [C] . Hochbaum Dorit S., Baumann Philipp IEEE International Congress on Big Data . 2014

机译：大规模数据挖掘的稀疏计算
5. Sparse and large-scale learning models and algorithms for mining heterogeneous big data. [D] . Cai, Xiao. 2013

机译：用于挖掘异构大数据的稀疏大规模学习模型和算法。
6. Large-Scale Data Mining of Rapid Residue Detection Assay Data From HTML and PDF Documents: Improving Data Access and Visualization for Veterinarians [O] . Majid Jaberi-Douraki, Soudabeh Taghian Dinani, Nuwan Indika Millagaha Gedara, 2021

机译：来自HTML和PDF文件的快速残留检测测定数据的大规模数据挖掘：改善兽医的数据访问和可视化
7. Exploiting Data Sparsity for Large-Scale Matrix Computations [O] . Kadir Akbudak, Hatem Ltaief, Aleksandr Mikhalev, 2018

机译：利用大规模矩阵计算的数据稀疏性
8. Sparse Representation of Multimodality Sensing Databases for Data Mining and Retrieval. [R] . Hero, A. O., Savarese, S. 2015

机译：用于数据挖掘和检索的多模态传感数据库的稀疏表示。

Sparse Computation for Large-Scale Data Mining

摘要

著录项

相似文献

相关主题

期刊订阅