...
首页> 外文期刊>PLoS Genetics >SPA: A Probabilistic Algorithm for Spliced Alignment
【24h】

SPA: A Probabilistic Algorithm for Spliced Alignment

机译:SPA:拼接对齐的概率算法

获取原文
           

摘要

Recent large-scale cDNA sequencing efforts show that elaborate patterns of splice variation are responsible for much of the proteome diversity in higher eukaryotes. To obtain an accurate account of the repertoire of splice variants, and to gain insight into the mechanisms of alternative splicing, it is essential that cDNAs are very accurately mapped to their respective genomes. Currently available algorithms for cDNA-to-genome alignment do not reach the necessary level of accuracy because they use ad hoc scoring models that cannot correctly trade off the likelihoods of various sequencing errors against the probabilities of different gene structures. Here we develop a Bayesian probabilistic approach to cDNA-to-genome alignment. Gene structures are assigned prior probabilities based on the lengths of their introns and exons, and based on the sequences at their splice boundaries. A likelihood model for sequencing errors takes into account the rates at which misincorporation, as well as insertions and deletions of different lengths, occurs during sequencing. The parameters of both the prior and likelihood model can be automatically estimated from a set of cDNAs, thus enabling our method to adapt itself to different organisms and experimental procedures. We implemented our method in a fast cDNA-to-genome alignment program, SPA, and applied it to the FANTOM3 dataset of over 100,000 full-length mouse cDNAs and a dataset of over 20,000 full-length human cDNAs. Comparison with the results of four other mapping programs shows that SPA produces alignments of significantly higher quality. In particular, the quality of the SPA alignments near splice boundaries and SPA's mapping of the 5′ and 3′ ends of the cDNAs are highly improved, allowing for more accurate identification of transcript starts and ends, and accurate identification of subtle splice variations. Finally, our splice boundary analysis on the human dataset suggests the existence of a novel non-canonical splice site that we also find in the mouse dataset. The SPA software package is available at http://www.biozentrum.unibas.ch/personalimwegen/cgi-bin/spa.cgi.
机译:最近的大规模cDNA测序工作表明,剪接变异的精细模式是导致高等真核生物蛋白质组多样性的主要因素。为了准确了解剪接变体的组成,并深入了解选择性剪接的机制,必须将cDNA非常精确地定位到其各自的基因组。当前可用的cDNA与基因组比对算法没有达到必要的准确性水平,因为它们使用的临时评分模型无法正确权衡各种测序错误的可能性与不同基因结构的可能性之间的权衡。在这里,我们开发了一种贝叶斯概率方法来实现cDNA与基因组的比对。根据基因结构的内含子和外显子的长度以及其剪接边界处的序列,为它们分配先验概率。测序错误的可能性模型考虑了测序过程中发生错误掺入以及不同长度的插入和缺失的速率。先验模型和似然模型的参数都可以从一组cDNA中自动估算出来,从而使我们的方法能够适应不同的生物体和实验程序。我们在快速的cDNA与基因组比对程序SPA中实现了我们的方法,并将其应用于超过100,000个全长小鼠cDNA的FANTOM3数据集和超过20,000个全长人cDNA的数据集。与其他四个制图程序的结果进行比较,结果表明SPA可以产生质量更高的比对。特别地,高度提高了剪接边界附近的SPA比对的质量以及cDNA的5'和3'端的SPA的定位,从而允许更准确地鉴定转录本的起点和终点,以及精确鉴定细微的剪接变异。最后,我们对人类数据集的剪接边界分析表明,我们还在鼠标数据集中发现了一个新的非规范的剪接位点。 SPA软件包可在http://www.biozentrum.unibas.ch/personalimwegen/cgi-bin/spa.cgi上获得。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号