...
首页> 外文期刊>Proceedings of the National Academy of Sciences of the United States of America >Spectral Methods In Machine Learning And New Strategies For Very Large Datasets
【24h】

Spectral Methods In Machine Learning And New Strategies For Very Large Datasets

机译:机器学习中的光谱方法和超大数据集的新策略

获取原文
获取原文并翻译 | 示例
           

摘要

Spectral methods are of fundamental importance in statistics and machine learning, because they underlie algorithms from classical principal components analysis to more recent approaches that exploit manifold structure. In most cases, the core technical problem can be reduced to computing a low-rank approximation to a positive-definite kernel. For the growing number of applications dealing with very large or high-dimensional datasets, however, the optimal approximation afforded by an exact spectral decomposition is too costly, because its complexity scales as the cube of either the number of training examples or their dimensionality. Motivated by such applications, we present here 2 new algorithms for the approximation of positive-semidefinite kernels, together with error bounds that improve on results in the literature. We approach this problem by seeking to determine, in an efficient manner, the most informative subset of our data relative to the kernel approximation task at hand. This leads to two new strategies based on the Nystrom method that are directly applicable to massive datasets. The first of these-based on sampling-leads to a randomized algorithm whereupon the kernel induces a probability distribution on its set of partitions, whereas the latter approach-based on sorting-provides for the selection of a partition in a deterministic way. We detail their numerical implementation and provide simulation results for a variety of representative problems in statistical data analysis, each of which demonstrates the improved performance of our approach relative to existing methods.
机译:频谱方法在统计和机器学习中至关重要,因为它们是从经典主成分分析到利用流形结构的最新方法的算法的基础。在大多数情况下,可以将核心技术问题简化为计算正定核的低秩近似。然而,对于处理超大型或高维数据集的越来越多的应用程序而言,精确频谱分解所提供的最佳逼近成本太高,因为其复杂性随训练样本数或维数的立方而变。受此类应用的启发,我们在这里介绍2种新算法,用于逼近正定核的逼近,以及误差边界,这些误差边界可改善文献中的结果。我们通过寻求一种有效的方式来确定我们的数据相对于手头的核逼近任务而言信息最丰富的子集来解决这个问题。这导致了两种基于Nystrom方法的新策略,它们直接适用于海量数据集。这些基于采样的方法中的第一个基于随机抽样算法,从而导致内核在其分区集合上引起概率分布,而基于排序的后一种方法则以确定性方式提供了对分区的选择。我们详细介绍了它们的数值实现方式,并为统计数据分析中的各种代表性问题提供了仿真结果,每一个问题都证明了我们的方法相对于现有方法的改进性能。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号