首页> 外文期刊>BMC Bioinformatics >Characterization of multiple sequence alignment errors using complete-likelihood score and position-shift map
【24h】

Characterization of multiple sequence alignment errors using complete-likelihood score and position-shift map

机译:使用完全似然评分和位置偏移图表征多个序列比对错误

获取原文
           

摘要

Background Reconstruction of multiple sequence alignments (MSAs) is a crucial step in most homology-based sequence analyses, which constitute an integral part of computational biology. To improve the accuracy of this crucial step, it is essential to better characterize errors that state-of-the-art aligners typically make. For this purpose, we here introduce two tools: the complete-likelihood score and the position-shift map. Results The logarithm of the total probability of a MSA under a stochastic model of sequence evolution along a time axis via substitutions, insertions and deletions (called the “complete-likelihood score” here) can serve as an ideal score of the MSA. A position-shift map, which maps the difference in each residue’s position between two MSAs onto one of them, can clearly visualize where and how MSA errors occurred and help disentangle composite errors. To characterize MSA errors using these tools, we constructed three sets of simulated MSAs of selectively neutral mammalian DNA sequences, with small, moderate and large divergences, under a stochastic evolutionary model with an empirically common power-law insertion/deletion length distribution. Then, we reconstructed MSAs using MAFFT and Prank as representative state-of-the-art single-optimum-search aligners. About 40–99% of the hundreds of thousands of gapped segments were involved in alignment errors. In a substantial fraction, from about 1/4 to over 3/4, of erroneously reconstructed segments, reconstructed MSAs by each aligner showed complete-likelihood scores not lower than those of the true MSAs. Out of the remaining errors, a majority by an iterative option of MAFFT showed discrepancies between the aligner-specific score and the complete-likelihood score, and a majority by Prank seemed due to inadequate exploration of the MSA space. Analyses by position-shift maps indicated that true MSAs are in considerable neighborhoods of reconstructed MSAs in about 80–99% of the erroneous segments for small and moderate divergences, but in only a minority for large divergences. Conclusions The results of this study suggest that measures to further improve the accuracy of reconstructed MSAs would substantially differ depending on the types of aligners. They also re-emphasize the importance of obtaining a probability distribution of fairly likely MSAs, instead of just searching for a single optimum MSA.
机译:在大多数基于同源性的序列分析中,多重序列比对(MSA)的背景重建是至关重要的一步,这构成了计算生物学的组成部分。为了提高这一关键步骤的准确性,至关重要的是更好地表征最新型对准器通常会产生的错误。为此,我们在这里介绍两个工具:完全似然得分和位置平移图。结果在随机序列演化模型下,沿时间轴经过替换,插入和删除的MSA的总概率的对数(此处称为“完全似然评分”)可以作为MSA的理想评分。位置偏移图将两个MSA之间每个残基位置的差异映射到其中一个,可以清楚地看到MSA错误发生的位置和方式,并有助于消除复合错误。为了表征使用这些工具的MSA错误,我们在具有经验上通用的幂律插入/缺失长度分布的随机进化模型下,构建了三组模拟MSA,它们分别具有小,中和大差异,选择性中性哺乳动物DNA序列。然后,我们使用MAFFT和Prank作为代表性的最先进的单最佳搜索对齐器来重构MSA。数十万个间隔段中约有40–99%涉及对齐错误。在大部分错误重构段中,从大约1/4到超过3/4,由每个定位器重构的MSA显示的完全似然得分不低于真实MSA的得分。在剩余的错误中,通过MAFFT的迭代选项获得的大多数错误显示了特定于对准器的得分与完全似然得分之间的差异,而Prank的大多数似乎是由于对MSA空间的探索不足。通过位置偏移图进行的分析表明,真正的MSA位于重构MSA的相当大的邻域中,其中约80-99%的错误段中小至中度差异很大,而少数地区则为大差异。结论这项研究的结果表明,根据对准器的类型,进一步提高重建MSA准确性的措施将大不相同。他们还再次强调了获得相当可能的MSA的概率分布的重要性,而不是仅搜索单个最佳MSA。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号