...
首页> 外文期刊>Journal of Biomolecular Structure and Dynamics >Filtering redundancies for sequence similarity search programs.
【24h】

Filtering redundancies for sequence similarity search programs.

机译:过滤序列相似性搜索程序的冗余。

获取原文
获取原文并翻译 | 示例
           

摘要

Database scanning programs such as BLAST and FASTA are used nowadays by most biologists for the post-genomic processing of DNA or protein sequence information (in particular to retrieve the structure/function of uncharacterized proteins). Unfortunately, their results can be polluted by identical alignments (called redundancies) coming from the same protein or DNA sequences present in different entries of the database. This makes the efficient use of the listed alignments difficult. Pretreatment of databases has been proposed to suppress strictly identical entries. However, there still remain many identical alignments since redundancies may occur locally for entries corresponding to various fragments of the same sequence or for entries corresponding to very homologous sequences but differing at the level of a few residues such as ortholog proteins. In the present work, we show that redundant alignments can be indeed numerous even when working with a pretreated non-redundant data bank, going as high as 60% of the output results according to the query and the bank. Therefore the accuracy and the efficiency of the post-genomic work will be greatly increased if these redundancies are removed. To solve this up to now unaddressed problem, we have developed an algorithm that allows for the efficient and safe suppression of all the redundancies with no loss of information. This algorithm is based on various filtering steps that we describe here in the context of the Automat similarity search program, and such an algorithm should also be added to the other similarity search programs (BLAST, FASTA, etc...).
机译:如今,大多数生物学家都使用数据库扫描程序(例如BLAST和FASTA)来对DNA或蛋白质序列信息进行基因组后处理(特别是检索未表征蛋白质的结构/功能)。不幸的是,它们的结果可能会受到来自数据库不同条目中存在的相同蛋白质或DNA序列的相同比对(称为重复)的污染。这使得有效使用列出的比对变得困难。已经提出对数据库进行预处理以抑制严格相同的条目。然而,仍然存在许多相同的比对,因为对于对应于相同序列的各个片段的条目或对应于非常同源的序列但在一些残基如直向同源蛋白水平上不同的条目,冗余可能局部发生。在当前的工作中,我们表明,即使使用经过预处理的非冗余数据库,冗余对齐的确可以实现很多,根据查询和存储库,高达高达60%的输出结果。因此,如果消除了这些冗余,基因组后工作的准确性和效率将大大提高。为了解决目前为止尚未解决的问题,我们开发了一种算法,该算法可在不丢失信息的情况下有效,安全地抑制所有冗余。该算法基于我们在Automat相似性搜索程序的上下文中在此描述的各种过滤步骤,并且还应将这种算法添加到其他相似性搜索程序(BLAST,FASTA等)中。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号