Spectral Methods In Machine Learning And New Strategies For Very Large Datasets

Mohamed-Ali Belabbas; Patrick J. Wolfe

首页> 外文期刊>Proceedings of the National Academy of Sciences of the United States of America >Spectral Methods In Machine Learning And New Strategies For Very Large Datasets

【24h】

Spectral Methods In Machine Learning And New Strategies For Very Large Datasets

机译：机器学习中的光谱方法和超大数据集的新策略

获取原文

获取原文并翻译 | 示例

掌桥外文数据库（机构版） >>

开具论文收录证明 >>

文献代查 >>

页面导航

摘要
著录项
相似文献
相关主题

摘要

Spectral methods are of fundamental importance in statistics and machine learning, because they underlie algorithms from classical principal components analysis to more recent approaches that exploit manifold structure. In most cases, the core technical problem can be reduced to computing a low-rank approximation to a positive-definite kernel. For the growing number of applications dealing with very large or high-dimensional datasets, however, the optimal approximation afforded by an exact spectral decomposition is too costly, because its complexity scales as the cube of either the number of training examples or their dimensionality. Motivated by such applications, we present here 2 new algorithms for the approximation of positive-semidefinite kernels, together with error bounds that improve on results in the literature. We approach this problem by seeking to determine, in an efficient manner, the most informative subset of our data relative to the kernel approximation task at hand. This leads to two new strategies based on the Nystrom method that are directly applicable to massive datasets. The first of these-based on sampling-leads to a randomized algorithm whereupon the kernel induces a probability distribution on its set of partitions, whereas the latter approach-based on sorting-provides for the selection of a partition in a deterministic way. We detail their numerical implementation and provide simulation results for a variety of representative problems in statistical data analysis, each of which demonstrates the improved performance of our approach relative to existing methods.

机译：频谱方法在统计和机器学习中至关重要，因为它们是从经典主成分分析到利用流形结构的最新方法的算法的基础。在大多数情况下，可以将核心技术问题简化为计算正定核的低秩近似。然而，对于处理超大型或高维数据集的越来越多的应用程序而言，精确频谱分解所提供的最佳逼近成本太高，因为其复杂性随训练样本数或维数的立方而变。受此类应用的启发，我们在这里介绍2种新算法，用于逼近正定核的逼近，以及误差边界，这些误差边界可改善文献中的结果。我们通过寻求一种有效的方式来确定我们的数据相对于手头的核逼近任务而言信息最丰富的子集来解决这个问题。这导致了两种基于Nystrom方法的新策略，它们直接适用于海量数据集。这些基于采样的方法中的第一个基于随机抽样算法，从而导致内核在其分区集合上引起概率分布，而基于排序的后一种方法则以确定性方式提供了对分区的选择。我们详细介绍了它们的数值实现方式，并为统计数据分析中的各种代表性问题提供了仿真结果，每一个问题都证明了我们的方法相对于现有方法的改进性能。

著录项

来源
《Proceedings of the National Academy of Sciences of the United States of America》 |2009年第2期|369-374|共6页
作者
Mohamed-Ali Belabbas; Patrick J. Wolfe;
展开▼
作者单位

Department of Statistics, School of Engineering and Applied Sciences, Oxford Street, Harvard University, Cambridge, MA 02138;

Department of Statistics, School of Engineering and Applied Sciences, Oxford Street, Harvard University, Cambridge, MA 02138;

展开▼
收录信息美国《科学引文索引》(SCI);美国《生物学医学文摘》(MEDLINE);美国《化学文摘》(CA);
原文格式 PDF
正文语种 eng
中图分类
关键词
statistical data analysis; kernel methods; low-rank approximation;

机译：统计数据分析核方法低秩逼近;

相似文献

外文文献
中文文献
专利

1. Extended data analysis strategies for high resolution imaging MS: New methods to deal with extremely large image hyperspectral datasets [J] . Klerk LA, Broersen A, Fletcher IW, International journal of mass spectrometry . 2007,第26期

机译：高分辨率成像MS的扩展数据分析策略：处理超大图像高光谱数据集的新方法
2. Machine learning methods for cyber security intrusion detection: Datasets and comparative study [J] . Kilincer Ilhan Firat, Ertam Fatih, Sengur Abdulkadir Computer networks . 2021,第Apra7期

机译：网络安全入侵检测机器学习方法：数据集和比较研究
3. Comparison of machine learning methods for ground settlement prediction with different tunneling datasets [J] . Libin Tang, SeonHong Na 岩石力学与岩土工程学报（英文版） . 2021,第006期

机译：不同隧道数据集的地面沉降预测机器学习方法的比较
4. CNN-based augmentation strategy for spectral unmixing datasets considering spectral variability [C] . Johannes Anastasiadis, Michael Heizmann Conference on Image and signal processing for remote sensing . 2020

机译：考虑光谱变异性的基于CNN的增强策略
5. Active learning with support vector machines for imbalanced datasets and a method for stopping active learning based on stabilizing predictions. [D] . Bloodgood, Michael. 2009

机译：支持向量机用于不平衡数据集的主动学习，以及一种基于稳定预测的主动学习停止方法。
6. Spectral methods in machine learning and new strategies for very large datasets [O] . Mohamed-Ali Belabbas, Patrick J. Wolfe 2009

机译：机器学习中的光谱方法和超大数据集的新策略
7. Spectral methods in machine learning and new strategies for very large datasets [O] . Belabbas, Mohamed-Ali, Wolfe, Patrick J. 2009

机译：机器学习中的光谱方法和超大数据集的新策略

Spectral Methods In Machine Learning And New Strategies For Very Large Datasets

摘要

著录项

相似文献

相关主题

期刊订阅