首页> 外文学位 >Information-theoretic and hypothesis-based clustering in bioinformatics.

【24h】

Information-theoretic and hypothesis-based clustering in bioinformatics.

机译：生物信息学中的信息理论和基于假设的聚类。

获取原文

获取原文并翻译 | 示例

页面导航

摘要
著录项
相似文献
相关主题

摘要

Many machine learning problems in biology involve clustering data generated in complex or incompletely understood ways. Processes such as protein and viral evolution are difficult to model, involving complex mechanisms and constraints at multiple levels. This thesis presents a family of clustering algorithms, based on the Information Bottleneck method, to cluster such datasets by imposing constraints related to statistical tests of their known properties. The first algorithm clusters continuous data; we apply it to amino acid profiles to derive a compact discrete representation that preserves much of their information. This discretization yields an easily interpretable textual representation of amino acid profiles. It also greatly improves the speed of profile-profile alignment, and makes it possible to index large profile databases. The second algorithm clusters discrete sequences while constraining mutual information between sequence positions within each cluster. We apply it to the problem of finding population substructure in viral and human SNP data, showing it to be competitive with or superior to current approaches.;Biological datasets often strain the limits of modern computers, and advances in biotechnology promise to generate even more data in the future as computational power increases. We therefore present a randomized clustering algorithm for discrete sequences that is similar to the previous algorithm but scalable to much larger datasets. This clustering algorithm relies on statistical tests to perform structure learning, an approach that has the added benefit of naturally limiting model complexity. We use this algorithm to produce detailed phylogenies of large DNA mobile element families. Our results provide a more detailed picture of their history, and their important role in genomic evolution.

机译：生物学中的许多机器学习问题都涉及对以复杂或不完全理解的方式生成的数据进行聚类。诸如蛋白质和病毒进化之类的过程很难建模，涉及复杂的机制和多个层面的约束。本文提出了一种基于信息瓶颈方法的聚类算法，通过施加与已知数据统计测试相关的约束来对此类数据集进行聚类。第一种算法对连续数据进行聚类；我们将其应用于氨基酸图谱，以导出紧凑的离散表示形式，从而保留了它们的许多信息。这种离散产生了氨基酸谱的易于解释的文本表示。它还极大地提高了概要文件-概要文件对齐的速度，并使索引大型概要文件数据库成为可能。第二种算法将离散序列聚类，同时限制每个聚类中序列位置之间的相互信息。我们将其应用于在病毒和人类SNP数据中发现种群亚结构的问题，表明它与现有方法相比具有竞争优势或优于现有方法。;生物数据集通常会限制现代计算机的局限性，而生物技术的发展有望生成更多数据未来随着计算能力的提高。因此，我们提出了一种用于离散序列的随机聚类算法，该算法与以前的算法相似，但可扩展到更大的数据集。这种聚类算法依赖于统计测试来执行结构学习，这种方法具有自然限制模型复杂性的额外好处。我们使用此算法来产生大型DNA移动元件家族的详细系统发育。我们的结果提供了它们的历史及其在基因组进化中的重要作用的更详细的描述。

著录项

作者
O'Rourke, Sean Michael.;
展开▼
作者单位

University of California, San Diego.;

展开▼
授予单位 University of California, San Diego.;
学科 Biology Bioinformatics.;Computer Science.;Artificial Intelligence.
学位 Ph.D.
年度 2009
页码 118 p.
总页数 118
原文格式 PDF
正文语种 eng
中图分类人工智能理论;自动化技术、计算机技术;
关键词

相似文献

外文文献
中文文献
专利

1. Emergent unsupervised clustering paradigms with potential application to bioinformatics. [J] . Miller DJ, Wang Y, Kesidis G Frontiers in bioscience: a journal and virtual library . 2008,第2期

机译：新兴的无监督聚类范例在生物信息学中的潜在应用。
2. coupleCoC+: An information-theoretic co-clustering-based transfer learning framework for the integrative analysis of single-cell genomic data [J] . Pengcheng Zeng, Zhixiang Lin PLoS Computational Biology . 2021,第6期

机译：耦合+：一种基于信息的基于信息的共聚类转移学习框架，用于单细胞基因组数据的整合分析
3. Bearing performance degradation assessment based on information-theoretic metric learning and fuzzy c-means clustering [J] . Measurement Science & Technology . 2020,第7期

机译：基于信息理论度量学习和模糊C型聚类的轴承性能下降评估
4. Feature Weighting Information-Theoretic Co-Clustering for Document Clustering [C] . Ye Yunming, Li Xutao, Wu Biao, 2009 2nd International Conference on Computer Science and its Applications . 2009

机译：基于特征加权的信息理论联合聚类
5. Relational clustering and its applications in text mining and bioinformatics. [D] . Shen, Chengcheng. 2010

机译：关系聚类及其在文本挖掘和生物信息学中的应用。
6. coupleCoC+: An information-theoretic co-clustering-based transfer learning framework for the integrative analysis of single-cell genomic data [O] . Pengcheng Zeng, Zhixiang Lin, Qing Nie, 2021

机译：耦合+：用于单细胞基因组数据的综合分析的基于信息理论共聚类的转移学习框架
7. Optimization of Basic Clustering for Ensemble Clustering: An Information-Theoretic Perspective [O] . Wei Liang, Yuanjian Zhang, Jianfeng Xu, 2019

机译：合奏聚类基本聚类的优化：信息 - 理论观点

Information-theoretic and hypothesis-based clustering in bioinformatics.

摘要

著录项

相似文献

相关主题

期刊订阅