首页> 美国卫生研究院文献>other >Word Decoding of Protein Amino Acid Sequences with Availability Analysis: A Linguistic Approach
【2h】

Word Decoding of Protein Amino Acid Sequences with Availability Analysis: A Linguistic Approach

机译:蛋白氨基酸序列的字解码与可用性分析:语言学方法

代理获取
本网站仅为用户提供外文OA文献查询和代理获取服务,本网站没有原文。下单后我们将采用程序或人工为您竭诚获取高质量的原文,但由于OA文献来源多样且变更频繁,仍可能出现获取不到、文献不完整或与标题不符等情况,如果获取不到我们将提供退款服务。请知悉。

摘要

The amino acid sequences of proteins determine their three-dimensional structures and functions. However, how sequence information is related to structures and functions is still enigmatic. In this study, we show that at least a part of the sequence information can be extracted by treating amino acid sequences of proteins as a collection of English words, based on a working hypothesis that amino acid sequences of proteins are composed of short constituent amino acid sequences (SCSs) or “words”. We first confirmed that the English language highly likely follows Zipf's law, a special case of power law. We found that the rank-frequency plot of SCSs in proteins exhibits a similar distribution when low-rank tails are excluded. In comparison with natural English and “compressed” English without spaces between words, amino acid sequences of proteins show larger linear ranges and smaller exponents with heavier low-rank tails, demonstrating that the SCS distribution in proteins is largely scale-free. A distribution pattern of SCSs in proteins is similar among species, but species-specific features are also present. Based on the availability scores of SCSs, we found that sequence motifs are enriched in high-availability sites (i.e., “key words”) and vice versa. In fact, the highest availability peak within a given protein sequence often directly corresponds to a sequence motif. The amino acid composition of high-availability sites within motifs is different from that of entire motifs and all protein sequences, suggesting the possible functional importance of specific SCSs and their compositional amino acids within motifs. We anticipate that our availability-based word decoding approach is complementary to sequence alignment approaches in predicting functionally important sites of unknown proteins from their amino acid sequences.
机译:蛋白质的氨基酸序列决定了它们的三维结构和功能。但是,序列信息如何与结构和功能相关仍然是个谜。在这项研究中,我们基于蛋白质的氨基酸序列由短组成氨基酸组成的有效假设,将蛋白质的氨基酸序列视为英语单词的集合,可以提取至少一部分序列信息。序列(SCS)或“单词”。我们首先确认英语极有可能遵循Zipf定律,这是幂律的特例。我们发现,当排除低秩尾巴时,蛋白质中SCS的秩频图显示出相似的分布。与自然英语和单词之间没有空格的“压缩”英语相比,蛋白质的氨基酸序列显示较大的线性范围,且指数较小,且尾端较重,表明蛋白质中的SCS分布基本无鳞。物种中蛋白质中SCS的分布模式相似,但也存在物种特异性特征。根据SCS的可用性得分,我们发现序列基序在高可用性位点(即“关键字”)中富集,反之亦然。实际上,给定蛋白质序列内的最高可用性峰通常直接对应于序列基序。基序内高可用性位点的氨基酸组成与整个基序和所有蛋白质序列的氨基酸组成不同,表明特定SCS及其基序内组成氨基酸的可能功能重要性。我们期望我们的基于可用性的单词解码方法可以与序列比对方法互补,从而从其氨基酸序列预测未知蛋白质的功能重要位点。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
代理获取

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号