首页> 外文学位 >Distributional properties of inversions and segmentation algorithms for RNA sequences.
【24h】

Distributional properties of inversions and segmentation algorithms for RNA sequences.

机译:RNA序列的反演和分割算法的分布特性。

获取原文
获取原文并翻译 | 示例

摘要

Ribonucleic acid (RNA) is a long single stranded molecule made up of four types of nucleotide bases: Adenine (A), Cytosine(C), Guanine (G) and Uracil (U). It folds back on itself and forms C-G and A-U complementary base pairs. The set of such hydrogen-bonded pairs in an RNA molecule is called its secondary structure. Knowing the secondary structure of RNA is useful for understanding its biological function. Prediction of RNA secondary structure from the nucleotide sequence has been an important bioinformatics problem for over two decades.;The work in this thesis is motivated by the need to improve the secondary structure prediction accuracy and efficiency for long RNA molecules. It involves investigating the distribution of inversions in random nucleotide sequences. An inversion is a string of nucleotide bases in an RNA sequence followed closely by its inverted complementary sequence downstream. It is the essential element to build any secondary structure. In this study, I focused on a random variable representing the number of inversions in an independent and identically distributed (i.i.d.) letter sequence, sampled from the nucleotide alphabet {A, C, G, U} with base composition {pA, pC, pG, pU}. I derived a recursive expression for calculating its mean, obtained simulated values for its variance, and demonstrated that this random variable can be reasonably approximated by a Poisson random variable in a range of inversion parameter values.;Predicting RNA secondary structure is a complicated process. It requires so much computer time and memory that often makes it impractical to perform any detailed predictions for sequences only several hundred bases long. Yet, there exist RNA molecules (e.g., in some viral genomes) of biological interest that contain over a thousand bases. In order to overcome the limitations in computing resources, Taufer et al. (2008) developed an approach by the grid computing technology. The idea of this approach is to segment a long RNA sequences into smaller chunks and send them to different computers on the grid for individual predictions. Then, the individual predictions are assembled to give the prediction for the original molecule. I developed two algorithms for segmenting a long RNA sequence into small chunks. Both algorithms attempt to identify areas in the sequence with high concentration of inversions and preserve these areas within a single chunk in order to reduce the information loss. The effect of these two segmentation algorithms on secondary structure prediction accuracy is tested on a set of data from the Rfam database of RNA sequences with known secondary structures.
机译:核糖核酸(RNA)是一个长单链分子,由四种类型的核苷酸碱基组成:腺嘌呤(A),胞嘧啶(C),鸟嘌呤(G)和尿嘧啶(U)。它向后折叠,形成C-G和A-U互补碱基对。 RNA分子中这种氢键对的集合称为其二级结构。了解RNA的二级结构有助于理解其生物学功能。从核苷酸序列预测RNA二级结构已成为一个重要的生物信息学问题,已有二十多年的历史。本论文的工作是由于需要提高长RNA分子的二级结构预测精度和效率。它涉及调查随机核苷酸序列中倒位的分布。反向是RNA序列中的一串核苷酸碱基,紧接着是其下游的反向互补序列。构建任何二级结构都是必不可少的。在这项研究中,我集中于一个随机变量,该变量代表独立且均等分布的(iid)字母序列中的倒置次数,该变量取自碱基组成为{pA,pC,pG的核苷酸字母{A,C,G,U} ,pU}。我推导了一个用于计算其均值的递归表达式,并获得了其方差的模拟值,并证明了该随机变量可以在一定的反演参数值范围内被泊松随机变量合理地近似。;预测RNA二级结构是一个复杂的过程。它需要大量的计算机时间和内存,这常常使得对只有几百个碱基长的序列进行任何详细的预测通常是不切实际的。然而,存在具有生物学目的的RNA分子(例如,在某些病毒基因组中),其包含一千多个碱基。为了克服计算资源的局限性,Taufer等人。 (2008)开发了一种通过网格计算技术的方法。这种方法的想法是将长的RNA序列分成较小的块,然后将其发送到网格上的不同计算机以进行单独的预测。然后,将各个预测组合起来以给出原始分子的预测。我开发了两种将长RNA序列分割成小块的算法。两种算法都试图以高浓度的反演识别序列中的区域,并将这些区域保留在单个块中,以减少信息丢失。在具有已知二级结构的RNA序列的Rfam数据库的一组数据上测试了这两种分割算法对二级结构预测精度的影响。

著录项

  • 作者单位

    The University of Texas at El Paso.;

  • 授予单位 The University of Texas at El Paso.;
  • 学科 Statistics.;Biology Bioinformatics.
  • 学位 M.S.
  • 年度 2011
  • 页码 101 p.
  • 总页数 101
  • 原文格式 PDF
  • 正文语种 eng
  • 中图分类 语言学;
  • 关键词

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号