首页> 外文期刊>IEEE Transactions on Information Theory >Capacity and Expressiveness of Genomic Tandem Duplication
【24h】

Capacity and Expressiveness of Genomic Tandem Duplication

机译:基因组串联复制的能力和表达

获取原文
获取原文并翻译 | 示例
       

摘要

The majority of the human genome consists of repeated sequences. An important type of repeated sequences common in the human genome are tandem repeats, where identical copies appear next to each other. For example, in the sequence , is a tandem repeat, that may be generated from by a tandem duplication of length 2. In this paper, we investigate the possibility of generating a large number of sequences from a seed, i.e. a small initial string, by tandem duplications of bounded length. We study the capacity of such a system, a notion that quantifies the system’s generating power. Our results include exact capacity values for certain tandem duplication string systems. In addition, motivated by the role of DNA sequences in expressing proteins via RNA and the genetic code, we define the notion of the expressiveness of a tandem duplication system as the capability of expressing arbitrary substrings. We then completely characterize the expressiveness of tandem duplication systems for general alphabet sizes and duplication lengths. In particular, based on a celebrated result by Axel Thue from 1906, presenting a construction for ternary squarefree sequences, we show that for alphabets of size 4 or larger, bounded tandem duplication systems, regardless of the seed and the bound on duplication length, are not fully expressive, i.e. they cannot generate all strings even as substrings of other strings. Note that the alphabet of size 4 is of particular interest as it pertains to the genomic alphabet. Building on this result, we also show that these systems do not have full capacity. In general, our results illustrate that duplication lengths play a more significant role than the seed in generating a large number of sequences for these systems.
机译:人类基因组的大部分由重复序列组成。在人类基因组中常见的一种重要的重复序列是串联重复,其中相同的拷贝彼此相邻出现。例如,在序列中,是一个串联重复序列,可能是由长度2的串联重复序列生成的。在本文中,我们研究了从种子(即小的初始字符串)生成大量序列的可能性,通过限制长度的串联重复。我们研究了这种系统的容量,这个概念量化了系统的发电量。我们的结果包括某些串联复制字符串系统的准确容量值。此外,受DNA序列在通过RNA和遗传密码表达蛋白质中的作用的启发,我们将串联复制系统的表达能力定义为表达任意子串的能力。然后,我们针对通用字母大小和重复长度完全表征了串联复制系统的表现力。尤其是,根据Axel Thue从1906年提出的著名结果,提出了三元无平方序列的构造,我们表明,对于大小为4或更大的字母,有界的串联复制系统,无论种子和复制长度的界限如何,都是不能完全表达,即它们不能生成所有字符串,甚至不能生成其他字符串的子字符串。请注意,大小为4的字母特别受关注,因为它与基因组字母有关。在此结果的基础上,我们还表明这些系统没有完整的容量。通常,我们的结果表明,在为这些系统生成大量序列时,重复长度比种子起着更重要的作用。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号