Non-parametric detection of meaningless distances in high dimensional data

Ata Kaban

首页> 外文期刊>Statistics and computing >Non-parametric detection of meaningless distances in high dimensional data

【24h】

Non-parametric detection of meaningless distances in high dimensional data

机译：高参数数据中无意义距离的非参数检测

获取原文

获取原文并翻译 | 示例

掌桥外文数据库（机构版） >>

开具论文收录证明 >>

文献代查 >>

页面导航

摘要
著录项
相似文献
相关主题

摘要

Distance concentration is the phenomenon that, in certain conditions, the contrast between the nearest and the farthest neighbouring points vanishes as the data dimensionality increases. It affects high dimensional data processing, analysis, retrieval, and indexing, which all rely on some notion of distance or dissimilarity. Previous work has characterised this phenomenon in the limit of infinite dimensions. However, real data is finite dimensional, and hence the infinite-dimensional characterisation is insufficient. Here we quantify the phenomenon more precisely, for the possibly high but finite dimensional case in a distribution-free manner, by bounding the tails of the probability that distances become meaningless. As an application, we show how this can be used to assess the concentration of a given distance function in some unknown data distribution solely on the basis of an available data sample from it. This can be used to test and detect problematic cases more rigorously than it is currently possible, and we demonstrate the working of this approach on both synthetic data and ten real-world data sets from different domains.

机译：距离集中是在某些条件下，随着数据维数增加，最近和最远的相邻点之间的对比度消失的现象。它影响高维数据处理，分析，检索和索引编制，这些全都依赖于距离或不相似性的某种概念。先前的工作已经在无限尺寸的极限中表征了这种现象。但是，实际数据是有限维的，因此无限维的表征不足。在这里，通过限制距离变得无意义的可能性的尾部，对于可能的高维但有限维的情况，我们以无分布的方式更精确地量化了该现象。作为一个应用程序，我们展示了如何仅基于来自其中的可用数据样本，将其用于评估某些未知数据分布中给定距离函数的集中度。它可以比目前更严格地用于测试和检测有问题的案例，并且我们演示了这种方法在来自不同域的综合数据集和十个真实数据集上的作用。

著录项

来源
《Statistics and computing》 |2012年第2期|p.375-385|共11页
作者
Ata Kaban;
展开▼
作者单位

School of Computer Science, The University of Birmingham, Edgbaston, Birmingham, B15 2TT, UK;

展开▼
收录信息
原文格式 PDF
正文语种 eng
中图分类
关键词
high dimensional data; curse of dimensionality; distance concentration; nearest neighbour; chebyshev bound; statistical test;

机译：高维数据;维度诅咒距离集中最近的邻居;切比雪夫界统计检验;

相似文献

外文文献
中文文献
专利

1. Asymptotic distribution-free change-point detection based on interpoint distances for high-dimensional data [J] . Li Jun Journal of nonparametric statistics . 2020,第1a2期

机译：基于用于高维数据的Interpoint距离的渐近分布 - 点检测
2. Learning Representations of Ultrahigh-dimensional Data for Random Distance-based Outlier Detection [J] . Guansong Pang, Longbing Cao, Ling Chen, SIGKDD explorations . 2018,第Udisk期

机译：基于随机距离的异常检测的超高维数据学习表示
3. Non-parametric estimation of data dimensionality prior to data compression: the case of the human development index [J] . David Canning, Declan French, Michael Moore Journal of applied statistics . 2013,第9a10期

机译：数据压缩之前的数据维数的非参数估计：人类发展指数的情况
4. An Unbiased Distance-Based Outlier Detection Approach for High-Dimensional Data [C] . Hoang Vu Nguyen, Vivekanand Gopalkrishnan, Ira Assent International conference on database systems for advanced applications;DASFAA 2011 . 2011

机译：基于无偏距离的高维数据离群值检测方法
5. Novel Metrics and Theoretical Properties of Nearest-Neighbor Distance-Based Feature Selection in High-Dimensional Bioinformatics Data [D] . Dawkins, Bryan A. 2020

机译：高维生物信息学数据中最近邻距离的特征选择的新特性和理论特性
6. A robustness study of parametric and non-parametric tests in model-based multifactor dimensionality reduction for epistasis detection [O] . Jestinah M Mahachie John, François Van Lishout, Elena S Gusareva, 2013

机译：基于模型的多因素降维用于上位性检测的参数和非参数检验的鲁棒性研究
7. Learning Representations of Ultrahigh-dimensional Data for Random Distance-based Outlier Detection [O] . Guansong Pang, Longbing Cao, Ling Chen, 2018

机译：基于随机距离的异常检测的超高维数据学习表示

Non-parametric detection of meaningless distances in high dimensional data

摘要

著录项

相似文献

相关主题

期刊订阅