首页> 外文期刊>Statistics and computing >Non-parametric detection of meaningless distances in high dimensional data
【24h】

Non-parametric detection of meaningless distances in high dimensional data

机译:高参数数据中无意义距离的非参数检测

获取原文
获取原文并翻译 | 示例
           

摘要

Distance concentration is the phenomenon that, in certain conditions, the contrast between the nearest and the farthest neighbouring points vanishes as the data dimensionality increases. It affects high dimensional data processing, analysis, retrieval, and indexing, which all rely on some notion of distance or dissimilarity. Previous work has characterised this phenomenon in the limit of infinite dimensions. However, real data is finite dimensional, and hence the infinite-dimensional characterisation is insufficient. Here we quantify the phenomenon more precisely, for the possibly high but finite dimensional case in a distribution-free manner, by bounding the tails of the probability that distances become meaningless. As an application, we show how this can be used to assess the concentration of a given distance function in some unknown data distribution solely on the basis of an available data sample from it. This can be used to test and detect problematic cases more rigorously than it is currently possible, and we demonstrate the working of this approach on both synthetic data and ten real-world data sets from different domains.
机译:距离集中是在某些条件下,随着数据维数增加,最近和最远的相邻点之间的对比度消失的现象。它影响高维数据处理,分析,检索和索引编制,这些全都依赖于距离或不相似性的某种概念。先前的工作已经在无限尺寸的极限中表征了这种现象。但是,实际数据是有限维的,因此无限维的表征不足。在这里,通过限制距离变得无意义的可能性的尾部,对于可能的高维但有限维的情况,我们以无分布的方式更精确地量化了该现象。作为一个应用程序,我们展示了如何仅基于来自其中的可用数据样本,将其用于评估某些未知数据分布中给定距离函数的集中度。它可以比目前更严格地用于测试和检测有问题的案例,并且我们演示了这种方法在来自不同域的综合数据集和十个真实数据集上的作用。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号