...
首页> 外文期刊>Evolutionary computation >Genetic Programming for Evolving Similarity Functions for Clustering: Representations and Analysis
【24h】

Genetic Programming for Evolving Similarity Functions for Clustering: Representations and Analysis

机译:用于演化相似性群体的遗传编程:表示和分析

获取原文
获取原文并翻译 | 示例
           

摘要

Clustering is a difficult and widely studied data mining task, with many varieties of clustering algorithms proposed in the literature. Nearly all algorithms use a similarity measure such as a distance metric (e.g., Euclidean distance) to decide which instances to assign to the same cluster. These similarity measures are generally predefined and cannot be easily tailored to the properties of a particular dataset, which leads to limitations in the quality and the interpretability of the clusters produced. In this article, we propose a new approach to automatically evolving similarity functions for a given clustering algorithm by using genetic programming. We introduce a new genetic programming-based method which automatically selects a small subset of features (feature selection) and then combines them using a variety of functions (feature construction) to produce dynamic and flexible similarity functions that are specifically designed for a given dataset. We demonstrate how the evolved similarity functions can be used to perform clustering using a graph-based representation. The results of a variety of experiments across a range of large, high-dimensional datasets show that the proposed approach can achieve higher and more consistent performance than the benchmark methods. We further extend the proposed approach to automatically produce multiple complementary similarity functions by using a multi-tree approach, which gives further performance improvements. We also analyse the interpretability and structure of the automatically evolved similarity functions to provide insight into how and why they are superior to standard distance metrics.
机译:聚类是一种困难而广泛研究的数据挖掘任务,具有许多各种聚类算法在文献中提出。几乎所有算法都使用相似度量,例如距离度量(例如,欧几里德距离),以确定将哪个实例分配给同一群集。这些相似度测量通常是预定义的,并且不能容易地定制到特定数据集的特性,这导致质量的限制和产生的集群的解释性。在本文中,我们提出了一种新方法来通过使用遗传编程来自动不断地发展给定聚类算法的相似性功能。我们介绍了一种新的基于基于遗传编程的方法,它自动选择了一个小的特征子集(特征选择),然后使用各种功能(特征构造)组合,从而产生专门为给定数据集设计的动态和灵活的相似性功能。我们演示了如何使用基于图形的表示来执行群集的进化相似性函数。在一系列大的高维数据集中的各种实验结果表明,所提出的方法可以实现比基准方法更高,更一致的性能。我们进一步扩展了所提出的方法来通过使用多棵树方法自动产生多个互补相似性功能,这提供了进一步的性能改进。我们还分析了自动进化相似性的可解释性和结构,以了解如何以及为什么优于标准距离指标。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号