...
首页> 外文期刊>Expert Systems with Application >Categorical data clustering: What similarity measure to recommend?
【24h】

Categorical data clustering: What similarity measure to recommend?

机译:分类数据聚类:推荐什么样的相似性度量?

获取原文
获取原文并翻译 | 示例
           

摘要

Inside the clustering problem of categorical data resides the challenge of choosing the most adequate similarity measure. The existing literature presents several similarity measures, starting from the ones based on simple matching up to the most complex ones based on Entropy. The following issue, therefore, is raised: is there a similarity measure containing characteristics which offer more stability and also provides satisfactory results in databases involving categorical variables? To answer this, this work compared nine different similarity measures using the TaxMap clustering mechanism, and in order to evaluate the clustering, four quality measures were considered: NCC, Entropy, Compactness and Silhouette Index. Tests were performed in 15 different databases containing categorical data extracted from public repositories of distinct sizes and contexts. Analyzing the results from the tests, and by means of a pairwise ranking, it was observed that the coefficient of Gower, the simplest similarity measure presented in this work, obtained the best performance overall. It was considered the ideal measure since it provided satisfactory results for the databases considered.
机译:在分类数据的聚类问题内部,存在着选择最适当的相似性度量的挑战。现有文献提出了几种相似性度量,从基于简单匹配的度量到基于熵的最复杂度量。因此,提出了以下问题:在涉及分类变量的数据库中,是否存在包含特性的相似性度量,该特性可提供更高的稳定性并提供令人满意的结果?为了回答这个问题,这项工作使用TaxMap聚类机制比较了九种不同的相似性度量,并且为了评估聚类,考虑了四个质量度量:NCC,熵,紧实度和轮廓指数。测试在15个不同的数据库中进行,这些数据库包含从大小和上下文不同的公共存储库中提取的分类数据。通过分析测试结果,并通过成对排名,可以观察到这项工作中提出的最简单的相似性度量方法高尔系数在总体上获得了最佳性能。由于它为所考虑的数据库提供了令人满意的结果,因此被认为是理想的措施。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号