首页> 外文学位 >Information-theoretic and hypothesis-based clustering in bioinformatics.
【24h】

Information-theoretic and hypothesis-based clustering in bioinformatics.

机译:生物信息学中的信息理论和基于假设的聚类。

获取原文
获取原文并翻译 | 示例

摘要

Many machine learning problems in biology involve clustering data generated in complex or incompletely understood ways. Processes such as protein and viral evolution are difficult to model, involving complex mechanisms and constraints at multiple levels. This thesis presents a family of clustering algorithms, based on the Information Bottleneck method, to cluster such datasets by imposing constraints related to statistical tests of their known properties. The first algorithm clusters continuous data; we apply it to amino acid profiles to derive a compact discrete representation that preserves much of their information. This discretization yields an easily interpretable textual representation of amino acid profiles. It also greatly improves the speed of profile-profile alignment, and makes it possible to index large profile databases. The second algorithm clusters discrete sequences while constraining mutual information between sequence positions within each cluster. We apply it to the problem of finding population substructure in viral and human SNP data, showing it to be competitive with or superior to current approaches.;Biological datasets often strain the limits of modern computers, and advances in biotechnology promise to generate even more data in the future as computational power increases. We therefore present a randomized clustering algorithm for discrete sequences that is similar to the previous algorithm but scalable to much larger datasets. This clustering algorithm relies on statistical tests to perform structure learning, an approach that has the added benefit of naturally limiting model complexity. We use this algorithm to produce detailed phylogenies of large DNA mobile element families. Our results provide a more detailed picture of their history, and their important role in genomic evolution.
机译:生物学中的许多机器学习问题都涉及对以复杂或不完全理解的方式生成的数据进行聚类。诸如蛋白质和病毒进化之类的过程很难建模,涉及复杂的机制和多个​​层面的约束。本文提出了一种基于信息瓶颈方法的聚类算法,通过施加与已知数据统计测试相关的约束来对此类数据集进行聚类。第一种算法对连续数据进行聚类;我们将其应用于氨基酸图谱,以导出紧凑的离散表示形式,从而保留了它们的许多信息。这种离散产生了氨基酸谱的易于解释的文本表示。它还极大地提高了概要文件-概要文件对齐的速度,并使索引大型概要文件数据库成为可能。第二种算法将离散序列聚类,同时限制每个聚类中序列位置之间的相互信息。我们将其应用于在病毒和人类SNP数据中发现种群亚结构的问题,表明它与现有方法相比具有竞争优势或优于现有方法。;生物数据集通常会限制现代计算机的局限性,而生物技术的发展有望生成更多数据未来随着计算能力的提高。因此,我们提出了一种用于离散序列的随机聚类算法,该算法与以前的算法相似,但可扩展到更大的数据集。这种聚类算法依赖于统计测试来执行结构学习,这种方法具有自然限制模型复杂性的额外好处。我们使用此算法来产生大型DNA移动元件家族的详细系统发育。我们的结果提供了它们的历史及其在基因组进化中的重要作用的更详细的描述。

著录项

  • 作者

    O'Rourke, Sean Michael.;

  • 作者单位

    University of California, San Diego.;

  • 授予单位 University of California, San Diego.;
  • 学科 Biology Bioinformatics.;Computer Science.;Artificial Intelligence.
  • 学位 Ph.D.
  • 年度 2009
  • 页码 118 p.
  • 总页数 118
  • 原文格式 PDF
  • 正文语种 eng
  • 中图分类 人工智能理论;自动化技术、计算机技术;
  • 关键词

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号