...
首页> 外文期刊>BMC Genomics >Scaling statistical multiple sequence alignment to large datasets
【24h】

Scaling statistical multiple sequence alignment to large datasets

机译:将统计多序列比对扩展到大型数据集

获取原文
           

摘要

Background Multiple sequence alignment is an important task in bioinformatics, and alignments of large datasets containing hundreds or thousands of sequences are increasingly of interest. While many alignment methods exist, the most accurate alignments are likely to be based on stochastic models where sequences evolve down a tree with substitutions, insertions, and deletions. While some methods have been developed to estimate alignments under these stochastic models, only the Bayesian method BAli-Phy has been able to run on even moderately large datasets, containing 100 or so sequences. A technique to extend BAli-Phy to enable alignments of thousands of sequences could potentially improve alignment and phylogenetic tree accuracy on large-scale data beyond the best-known methods today. Results We use simulated data with up to 10,000 sequences representing a variety of model conditions, including some that are significantly divergent from the statistical models used in BAli-Phy and elsewhere. We give a method for incorporating BAli-Phy into PASTA and UPP, two strategies for enabling alignment methods to scale to large datasets, and give alignment and tree accuracy results measured against the ground truth from simulations. Comparable results are also given for other methods capable of aligning this many sequences. Conclusions Extensions of BAli-Phy using PASTA and UPP produce significantly more accurate alignments and phylogenetic trees than the current leading methods.
机译:背景技术多序列比对是生物信息学中的重要任务,并且包含数百或数千个序列的大型数据集的比对越来越受到关注。尽管存在许多比对方法,但最准确的比对可能基于随机模型,在随机模型中,序列沿着带有替换,插入和删除的树向下进化。尽管已经开发出一些方法来估计这些随机模型下的比对,但只有贝叶斯方法BAli-Phy才能在包含100个左右序列的中等规模数据集上运行。扩展BAli-Phy以实现数千个序列比对的技术,可能会超越当今最著名的方法,改善大规模数据的比对和系统树的准确性。结果我们使用了多达10,000个序列的模拟数据,这些序列代表了各种模型条件,包括与BAli-Phy和其他地方使用的统计模型明显不同的一些条件。我们提供了一种将BAli-Phy整合到PASTA和UPP中的方法,这两种策略使对齐方法能够扩展到大型数据集,并给出了针对仿真的地面真实性测得的对齐和树精度结果。对于其他能够比对这么多序列的方法,也给出了可比的结果。结论使用PASTA和UPP扩展BAli-Phy可以比目前的领先方法产生更准确的比对和系统发育树。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号