...
首页> 外文期刊>Big Data, IEEE Transactions on >Sparse Computation for Large-Scale Data Mining
【24h】

Sparse Computation for Large-Scale Data Mining

机译:大规模数据挖掘的稀疏计算

获取原文
获取原文并翻译 | 示例
           

摘要

Leading machine learning techniques rely on inputs in the form of pairwise similarities between objects in the data set. The number of pairwise similarities grows quadratically in the size of the data set which poses a challenge in terms of scalability. One way to achieve practical efficiency for similarity-based techniques is to sparsify the similarity matrix. However, existing sparsification approaches consider the complete similarity matrix and remove some of the non-zero entries. This requires quadratic time and storage and is thus intractable for large-scale data sets. We introduce here a method called sparse computation that generates a sparse similarity matrix which contains only relevant similarities without computing first all pairwise similarities. The relevant similarities are identified by projecting the data onto a low-dimensional space in which groups of objects that share the same grid neighborhood are deemed of potential high similarity whereas pairs of objects that do not share a neighborhood are considered to be dissimilar and thus their similarities are not computed. The projection is performed efficiently even for massively large data sets. We apply sparse computation for the -nearest neighbors algorithm (KNN), for graph-based machine learning techniques of supervised normalized cut and K-supervised normalized cut (SNC and KSNC) and for support vector machines with radial basis function kernels (SVM), on real-world classification problems. Our empirical results show that the approach achieves a significant reduction in the density of the similarity matrix, resulting in a substantial reduction in tuning and testing times, while having a minimal effect (and often none) on accuracy. The low-dimensional projection is of further use in massively large data- sets where the grid structure allows to easily identify groups of “almost identical” objects. Such groups of objects are then replaced by representatives, thus reducing the size of the matrix. This approach is effective, as illustrated here for data sets comprising up to 8.5 million objects.
机译:领先的机器学习技术依赖于数据集中对象之间成对相似形式的输入。成对相似性的数量在数据集的大小上呈平方增长,这在可伸缩性方面带来了挑战。为基于相似度的技术实现实际效率的一种方法是稀疏相似度矩阵。但是,现有的稀疏化方法会考虑完整的相似度矩阵,并删除一些非零条目。这需要二次时间和存储时间,因此对于大规模数据集来说是棘手的。我们在这里介绍一种称为稀疏计算的方法,该方法生成一个仅包含相关相似度而无需先计算所有成对相似度的稀疏相似度矩阵。通过将数据投影到低维空间来识别相关相似性,在该低维空间中,共享同一网格邻域的对象组被认为具有潜在的高度相似性,而不共享邻域的对象对被认为是不相似的,因此它们不计算相似度。即使对于大型数据集,也可以高效地执行投影。我们将稀疏计算应用于-最近邻算法(KNN),基于图的监督归一化割和K监督归一化割的机器学习技术(SNC和KSNC)以及具有径向基函数核(SVM)的支持向量机,关于现实世界中的分类问题。我们的经验结果表明,该方法显着降低了相似矩阵的密度,从而大大减少了调整和测试时间,同时对准确性的影响最小(通常没有影响)。低维投影在网格结构允许轻松识别“几乎相同”对象的组的大型数据集中进一步使用。然后将这些对象组替换为代表,从而减小矩阵的大小。如此处所示,此方法对于包含多达850万个对象的数据集有效。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号