首页> 外文学位 >Robust methods for locating multiple dense regions in complex datasets.

【24h】

Robust methods for locating multiple dense regions in complex datasets.

机译：在复杂数据集中定位多个密集区域的稳健方法。

获取原文

获取原文并翻译 | 示例

页面导航

摘要
著录项
相似文献
相关主题

摘要

In classical clustering, each data point is assigned to at least one cluster. However, in many real-world problems, only a small subset of the data clusters well, while the rest shows little or no clustering tendencies. For such situations, this thesis presents several techniques that cluster only a subset of the data into one or more groupings.;We first develop a very general parametric approach called Bregman Bubble Clustering that can find multiple dense regions, and can scale to very large datasets. By using a fast iterative relocation based approach combined with a novel concept for improving local search called Pressurization, Bregman Bubble Clustering extends density-based clustering to a much larger set of problems. We also develop a seeding algorithm that can automatically determine the number of clusters, and make the results deterministic.;We then describe a more focussed non-parametric alternative called Automated Hierarchical Density Shaving (Auto-HDS), a framework that consists of a fast, hierarchical, density-based clustering algorithm and an unsupervised model selection strategy. Auto-HDS can automatically select between clusters of different densities, present them in a compact hierarchy, and rank individual clusters using an innovative stability criteria. The Auto-HDS framework also provides a simple yet powerful 2-D visualization of the hierarchy of clusters that is useful for further exploring the dense clusters in high-dimensional datasets. We also developed a robust, memory efficient, platform independent, and open source Java based implementation of Auto-HDS called Gene DIVER (Gene Density Interactive Visual Explorer) that provides interactive clustering capabilities for high-throughput biological datasets.;For problems where finding small dense regions is important, the parametric approach is applicable to a wide variety of scenarios and is scalable to very large datasets. On the other hand, Auto-HDS, the non-parametric approach, provides a powerful visualization, a compact clustering hierarchy, and interactive clustering: properties that are useful for biologists interested in finding and understanding small dense clusters of genes. Together, the two approaches greatly extend the scope of density based clustering in three different dimensions; the diversity of problems that density-based clustering can now be used with, the expanded capability to quickly understand and analyze the clusters in the data, and the scale of the problems that are now within reach of modest computing resources.

机译：在经典聚类中，每个数据点都分配给至少一个聚类。但是，在许多实际问题中，只有一小部分数据可以很好地群集，而其余的则很少或没有群集趋势。对于这种情况，本文提出了几种仅将数据子集聚为一个或多个分组的技术。我们首先开发一种非常通用的参数化方法，称为Bregman Bubble聚类，它可以找到多个密集区域，并且可以扩展到非常大的数据集。通过使用基于快速迭代重定位的方法以及一种用于改善本地搜索的新颖概念（称为加压），Bregman Bubble聚类将基于密度的聚类扩展到更多问题。我们还开发了一种播种算法，该算法可以自动确定聚类的数量并确定结果;然后，我们介绍一种更加集中的非参数替代方案，称为自动分层密度剃刮（Auto-HDS），该框架由快速，基于密度的分层聚类算法和无监督模型选择策略。 Auto-HDS可以自动在不同密度的群集之间进行选择，以紧凑的层次显示它们，并使用创新的稳定性标准对各个群集进行排名。 Auto-HDS框架还提供了集群层次结构的简单而强大的2D可视化，对于进一步探索高维数据集中的密集集群很有用。我们还开发了一种强大的，内存有效的，平台无关的，基于Java的Auto-HDS实现，称为Gene DIVER（Gene Density Interactive Visual Explorer），它为高通量生物数据集提供了交互式聚类功能。由于密集区域非常重要，因此参数化方法适用于各种情况，并且可扩展到非常大的数据集。另一方面，非参数方法Auto-HDS提供了强大的可视化，紧凑的聚类层次结构和交互式聚类：这些属性对于有兴趣寻找和理解基因的小型密集簇的生物学家很有用。两种方法一起在三个不同的维度上极大地扩展了基于密度的聚类的范围。现在可以使用基于密度的聚类的问题的多样性，快速理解和分析数据中聚类的扩展功能以及适度的计算资源现在可以解决的问题的范围。

著录项

作者
Gupta, Gunjan Kumar.;
展开▼
作者单位

The University of Texas at Austin.;

展开▼
授予单位 The University of Texas at Austin.;
学科 Biology Bioinformatics.;Computer Science.
学位 Ph.D.
年度 2006
页码 243 p.
总页数 243
原文格式 PDF
正文语种 eng
中图分类
关键词

相似文献

外文文献
中文文献
专利

1. Dependence of regionalization methods on the complexity of hydrological models in multiple climatic regions [J] . Yang Xue, Magnusson Jan, Huang Shaochun, Journal of Hydrology . 2020,第1期

机译：区域化方法对多气候区水文模型复杂性的依赖性
2. THE COMPLEXITY AND DYNAMICS OF TUMOR RESPONSE TO VORINOSTAT CAN BE ELUCIDATED BY INTEGRATING MULTIPLE LARGE HIGH-THROUGHPUT DATASETS. [J] . Geeleher P., Loboda A., Lenkala D., Clinical Pharmacology and Therapeutics . 2015,第S1期

机译：通过整合多个大型高通量数据集，可以消除肿瘤对伏立诺他州反应的复杂性和动态性。
3. Unraveling complex temporal associations in cellular systems across multiple time-series microarray datasets. [J] . Li W, Xu M, Zhou XJ Journal of biomedical informatics. . 2010,第4期

机译：跨多个时间序列微阵列数据集揭示细胞系统中的复杂时间关联。
4. Bregman Bubble Clustering: A Robust, Scalable Framework for Locating Multiple, Dense Regions in Data [C] . Gunjan Gupta, Joy deep Ghosh IEEE International Conference on Data Mining . 2006

机译：BREGMAN BUBBLE CLUSENTING：一个强大，可扩展的框架，用于定位数据中的多个，密集区域
5. New Statistical Learning Methods for Multiple High Dimensional Datasets. [D] . Lee, Wonyul. 2013

机译：多个高维数据集的新统计学习方法。
6. A Molecular-Cytogenetic Method for Locating Genes to Pericentromeric Regions Facilitates a Genomewide Comparison of Synteny Between the Centromeric Regions of Wheat and Rice [O] . Lili Qi, Bernd Friebe, Peng Zhang, 2009

机译：用于定位基因到大肠骨膜区域的分子细胞遗传学方法促进了小麦和水稻着丝粒区域之间同义性的全基因组比较
7. Bregman bubble clustering: A robust, scalable framework for locating multiple, dense regions in data [O] . Gunjan Gupta, Joydeep Ghosh 2006

机译：Bregman气泡聚类：一个健壮的，可扩展的框架，用于定位数据中的多个密集区域

Robust methods for locating multiple dense regions in complex datasets.

摘要

著录项

相似文献

相关主题

期刊订阅