Categorical data clustering: What similarity measure to recommend?

Tiago R.L. dos Santos; Luis E. Zarate

首页> 外文期刊>Expert Systems with Application >Categorical data clustering: What similarity measure to recommend?

【24h】

Categorical data clustering: What similarity measure to recommend?

机译：分类数据聚类：推荐什么样的相似性度量？

获取原文

获取原文并翻译 | 示例

掌桥外文数据库（机构版） >>

开具论文收录证明 >>

文献代查 >>

页面导航

摘要
著录项
相似文献
相关主题

摘要

Inside the clustering problem of categorical data resides the challenge of choosing the most adequate similarity measure. The existing literature presents several similarity measures, starting from the ones based on simple matching up to the most complex ones based on Entropy. The following issue, therefore, is raised: is there a similarity measure containing characteristics which offer more stability and also provides satisfactory results in databases involving categorical variables? To answer this, this work compared nine different similarity measures using the TaxMap clustering mechanism, and in order to evaluate the clustering, four quality measures were considered: NCC, Entropy, Compactness and Silhouette Index. Tests were performed in 15 different databases containing categorical data extracted from public repositories of distinct sizes and contexts. Analyzing the results from the tests, and by means of a pairwise ranking, it was observed that the coefficient of Gower, the simplest similarity measure presented in this work, obtained the best performance overall. It was considered the ideal measure since it provided satisfactory results for the databases considered.

机译：在分类数据的聚类问题内部，存在着选择最适当的相似性度量的挑战。现有文献提出了几种相似性度量，从基于简单匹配的度量到基于熵的最复杂度量。因此，提出了以下问题：在涉及分类变量的数据库中，是否存在包含特性的相似性度量，该特性可提供更高的稳定性并提供令人满意的结果？为了回答这个问题，这项工作使用TaxMap聚类机制比较了九种不同的相似性度量，并且为了评估聚类，考虑了四个质量度量：NCC，熵，紧实度和轮廓指数。测试在15个不同的数据库中进行，这些数据库包含从大小和上下文不同的公共存储库中提取的分类数据。通过分析测试结果，并通过成对排名，可以观察到这项工作中提出的最简单的相似性度量方法高尔系数在总体上获得了最佳性能。由于它为所考虑的数据库提供了令人满意的结果，因此被认为是理想的措施。

著录项

来源
《Expert Systems with Application》 |2015年第3期|1247-1260|共14页
作者
Tiago R.L. dos Santos; Luis E. Zarate;
展开▼
作者单位

Department of Computer Science, Pontifical Catholic University of Minas Gerais, Av. Dom Jose Gaspar 500, Coracao Eucaristico, Belo Horizonte, 30535-610 MG, Brazil;

Department of Computer Science, Pontifical Catholic University of Minas Gerais, Av. Dom Jose Gaspar 500, Coracao Eucaristico, Belo Horizonte, 30535-610 MG, Brazil;

展开▼
收录信息
原文格式 PDF
正文语种 eng
中图分类
关键词
Categorical data; Clustering; Clustering criterion; Clustering goal; Similarity;

机译：分类数据;集群;聚类标准;集群目标;相似;

相似文献

外文文献
中文文献
专利

1. Comparison of Similarity Measures for Categorical Data in Hierarchical Clustering [J] . Sulc Zdenek, Rezankova Hana Journal of classification . 2019,第1期

机译：分层聚类中分类数据相似度量的比较
2. Term Frequency Based Cosine Similarity Measure for Clustering Categorical Data using Hierarchical Algorithm [J] . S. Anitha Elavarasi, J. Akilandeswari Research journal of applied science, engineering and technology . 2015,第7期

机译：基于术语频率的余弦相似度度量用于分类数据聚类
3. Term Frequency Based Cosine Similarity Measure for Clustering Categorical Data using Hierarchical Algorithm [J] . S. Anitha Elavarasi, J. Akilandeswari Research journal of applied science, engineering and technology . 2015,第7期

机译：基于层次词算法的分类数据聚类的基于词频的余弦相似度度量
4. Generalized similarity measure for categorical data clustering [C] . Shruti Sharma, Manoj Singh International conference on advances in computing, communications and informatics . 2016

机译：分类数据聚类的广义相似性度量
5. Automatic categorical data clustering and spatial data clustering by consecutive resolution refinement. [D] . Foss, Andrew Philip Ogilvie. 2002

机译：通过连续的分辨率优化自动分类数据聚类和空间数据聚类。
6. GO functional similarity clustering depends on similarity measure clustering method and annotation completeness [O] . Meng Liu, Paul D. Thomas 2019

机译：GO功能相似性聚类取决于相似性度量聚类方法和注释完整性
7. A Novel Similarity Measure for Clustering Categorical Data Sets [O] . Rishi Sayal 2015

机译：一种新的聚类分类数据集相似度量

Categorical data clustering: What similarity measure to recommend?

摘要

著录项

相似文献

相关主题

期刊订阅