首页> 美国卫生研究院文献>other >Cross-Clustering: A Partial Clustering Algorithm with Automatic Estimation of the Number of Clusters
【2h】

Cross-Clustering: A Partial Clustering Algorithm with Automatic Estimation of the Number of Clusters

机译:跨集群:具有自动估计集群数量的部分集群算法

代理获取
本网站仅为用户提供外文OA文献查询和代理获取服务,本网站没有原文。下单后我们将采用程序或人工为您竭诚获取高质量的原文,但由于OA文献来源多样且变更频繁,仍可能出现获取不到、文献不完整或与标题不符等情况,如果获取不到我们将提供退款服务。请知悉。

摘要

Four of the most common limitations of the many available clustering methods are: i) the lack of a proper strategy to deal with outliers; ii) the need for a good a priori estimate of the number of clusters to obtain reasonable results; iii) the lack of a method able to detect when partitioning of a specific data set is not appropriate; and iv) the dependence of the result on the initialization. Here we propose Cross-clustering (CC), a partial clustering algorithm that overcomes these four limitations by combining the principles of two well established hierarchical clustering algorithms: Ward’s minimum variance and Complete-linkage. We validated CC by comparing it with a number of existing clustering methods, including Ward’s and Complete-linkage. We show on both simulated and real datasets, that CC performs better than the other methods in terms of: the identification of the correct number of clusters, the identification of outliers, and the determination of real cluster memberships. We used CC to cluster samples in order to identify disease subtypes, and on gene profiles, in order to determine groups of genes with the same behavior. Results obtained on a non-biological dataset show that the method is general enough to be successfully used in such diverse applications. The algorithm has been implemented in the statistical language R and is freely available from the CRAN contributed packages repository.
机译:在许多可用的聚类方法中,四个最常见的局限性是:i)缺乏适当的策略来处理离群值; ii)需要对簇的数量进行先验估计以获得合理的结果; iii)缺乏一种方法来检测何时不适合对特定数据集进行分区; iv)结果对初始化的依赖性。在这里,我们提出交叉集群(CC),这是一种部分集群算法,它通过结合两个完善的分层集群算法(沃德最小方差和完全链接)的原理来克服这四个限制。我们通过将CC与许多现有的聚类方法(包括Ward和Complete-linkage)进行比较来验证CC。我们在模拟数据集和真实数据集上均显示,CC在以下方面比其他方法表现更好:识别正确数量的聚类,识别异常值以及确定真实聚类成员。我们使用CC对样本进行聚类,以鉴定疾病亚型,并根据基因概况,以确定具有相同行为的基因组。在非生物数据集上获得的结果表明,该方法足够通用,可以成功地在各种应用中使用。该算法已用统计语言R实现,可从CRAN贡献软件包存储库中免费获得。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
代理获取

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号