首页> 外文学位 >Robust methods for locating multiple dense regions in complex datasets.
【24h】

Robust methods for locating multiple dense regions in complex datasets.

机译:在复杂数据集中定位多个密集区域的稳健方法。

获取原文
获取原文并翻译 | 示例

摘要

In classical clustering, each data point is assigned to at least one cluster. However, in many real-world problems, only a small subset of the data clusters well, while the rest shows little or no clustering tendencies. For such situations, this thesis presents several techniques that cluster only a subset of the data into one or more groupings.;We first develop a very general parametric approach called Bregman Bubble Clustering that can find multiple dense regions, and can scale to very large datasets. By using a fast iterative relocation based approach combined with a novel concept for improving local search called Pressurization, Bregman Bubble Clustering extends density-based clustering to a much larger set of problems. We also develop a seeding algorithm that can automatically determine the number of clusters, and make the results deterministic.;We then describe a more focussed non-parametric alternative called Automated Hierarchical Density Shaving (Auto-HDS), a framework that consists of a fast, hierarchical, density-based clustering algorithm and an unsupervised model selection strategy. Auto-HDS can automatically select between clusters of different densities, present them in a compact hierarchy, and rank individual clusters using an innovative stability criteria. The Auto-HDS framework also provides a simple yet powerful 2-D visualization of the hierarchy of clusters that is useful for further exploring the dense clusters in high-dimensional datasets. We also developed a robust, memory efficient, platform independent, and open source Java based implementation of Auto-HDS called Gene DIVER (Gene Density Interactive Visual Explorer) that provides interactive clustering capabilities for high-throughput biological datasets.;For problems where finding small dense regions is important, the parametric approach is applicable to a wide variety of scenarios and is scalable to very large datasets. On the other hand, Auto-HDS, the non-parametric approach, provides a powerful visualization, a compact clustering hierarchy, and interactive clustering: properties that are useful for biologists interested in finding and understanding small dense clusters of genes. Together, the two approaches greatly extend the scope of density based clustering in three different dimensions; the diversity of problems that density-based clustering can now be used with, the expanded capability to quickly understand and analyze the clusters in the data, and the scale of the problems that are now within reach of modest computing resources.
机译:在经典聚类中,每个数据点都分配给至少一个聚类。但是,在许多实际问题中,只有一小部分数据可以很好地群集,而其余的则很少或没有群集趋势。对于这种情况,本文提出了几种仅将数据子集聚为一个或多个分组的技术。我们首先开发一种非常通用的参数化方法,称为Bregman Bubble聚类,它可以找到多个密集区域,并且可以扩展到非常大的数据集。通过使用基于快速迭代重定位的方法以及一种用于改善本地搜索的新颖概念(称为加压),Bregman Bubble聚类将基于密度的聚类扩展到更多问题。我们还开发了一种播种算法,该算法可以自动确定聚类的数量并确定结果;然后,我们介绍一种更加集中的非参数替代方案,称为自动分层密度剃刮(Auto-HDS),该框架由快速,基于密度的分层聚类算法和无监督模型选择策略。 Auto-HDS可以自动在不同密度的群集之间进行选择,以紧凑的层次显示它们,并使用创新的稳定性标准对各个群集进行排名。 Auto-HDS框架还提供了集群层次结构的简单而强大的2D可视化,对于进一步探索高维数据集中的密集集群很有用。我们还开发了一种强大的,内存有效的,平台无关的,基于Java的Auto-HDS实现,称为Gene DIVER(Gene Density Interactive Visual Explorer),它为高通量生物数据集提供了交互式聚类功能。由于密集区域非常重要,因此参数化方法适用于各种情况,并且可扩展到非常大的数据集。另一方面,非参数方法Auto-HDS提供了强大的可视化,紧凑的聚类层次结构和交互式聚类:这些属性对于有兴趣寻找和理解基因的小型密集簇的生物学家很有用。两种方法一起在三个不同的维度上极大地扩展了基于密度的聚类的范围。现在可以使用基于密度的聚类的问题的多样性,快速理解和分析数据中聚类的扩展功能以及适度的计算资源现在可以解决的问题的范围。

著录项

  • 作者

    Gupta, Gunjan Kumar.;

  • 作者单位

    The University of Texas at Austin.;

  • 授予单位 The University of Texas at Austin.;
  • 学科 Biology Bioinformatics.;Computer Science.
  • 学位 Ph.D.
  • 年度 2006
  • 页码 243 p.
  • 总页数 243
  • 原文格式 PDF
  • 正文语种 eng
  • 中图分类
  • 关键词

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号