首页> 外文会议>International Conference on Bioinformatics and Computational Biology >Convolutional neural net learns promoter sequence features driving transcription strength
【24h】

Convolutional neural net learns promoter sequence features driving transcription strength

机译:卷积神经网络学习启动子序列特征驱动转录强度

获取原文

摘要

Promoters drive gene expression and help regulate cellular responses to the environment. In recent research, machine learning models have been developed to predict a bacterial promoter's transcriptional initiation rate, although these models utilize expert-labeled sequence elements across a defined set of DNA building blocks. The generalizability of these methods is therefore limited by the necessary labeling of the specific components studied. As a result, current models have not been used to predict the transcriptional initiation rates of promoters with generalized nucleotide sequences. If generalizable models existed, they could greatly facilitate the design of synthetic genetic circuits with well-controlled transcription rates in bacteria. To address these limitations, we used a convolutional neural network (CNN) to predict a promoter's transcriptional initiation rate directly from its DNA nucleotide sequence. We first evaluated the model on a published promoter component dataset. Trained using only the sequence as input, our model fits held-out test data with R~2 = 0.90, comparable to published models that fit expert-labeled sequence elements. We produced a new promoter strength dataset including non-repetitive promoters with high sequence variation and not limited to combinations of discrete expert-labeled components. Our CNN trained on this more varied dataset fits held-out promoter strength with R~2 = 0.61. Previously-published models are intractable on a dataset like this with highly diverse inputs. The CNN outperforms classical approach baselines like LASSO on a bag of words for promoter sequence elements (R~2 = 0.42). We applied recent machine learning approaches to quantify the contribution of individual nucleotides to the CNN's promoter strength prediction. Learning directly from DNA sequence, our model identified the consensus -35 and -10 hexamer regions as well as the discriminator element as key contributors to σ~(70) promoter strength. It also replicated a finding that a perfect consensus sequence match does not yield the strongest promoter. The model's ability to independently learn biologically-relevant information directly from sequence, while performing similarly to or better than classical methods, makes it appealing for further prediction optimization and research into generalizability. This approach may be useful for synthetic promoter design, as well as for sequence feature identification.
机译:启动子驱动基因表达并有助于调节对环境的细胞反应。在最近的研究中,已经开发了机器学习模型来预测细菌启动子的转录启动率,尽管这些模型利用跨一组定义的DNA构建块进行专家标记的序列元素。因此,这些方法的普遍性受到所研究的特定组分的必要标记的限制。结果,目前模型未被用于预测具有广义核苷酸序列的启动子的转录起始速率。如果存在普遍的模型,它们可以极大地促进具有细菌良好控制的转录速率的合成遗传电路的设计。为了解决这些限制,我们使用卷积神经网络(CNN)直接从其DNA核苷酸序列预测启动子的转录起始率。我们首先在发布的推动者组件数据集上进行评估模型。我们的型号仅使用序列作为输入,我们的模型适合R〜2 = 0.90的保持测试数据,可与已符合专家标记的序列元素的已发布模型相当。我们制作了一种新的推动者强度数据集,包括具有高序列变异的非重复启动子,不限于离散专家标记组件的组合。我们的CNN接受过更多各种数据集的CNN培训,符合R〜2 = 0.61的启动子强度。以前发布的型号在像这样的数据集上是棘手的,具有高度多样化的输入。 CNN在促进剂序列元素的一袋单词上占套索等古典方法基线(R〜2 = 0.42)。我们应用最近的机器学习方法来量化个体核苷酸对CNN启动子强度预测的贡献。直接从DNA序列学习,我们的模型确定了与σ〜(70)启动子强度的关键贡献者的共识-35和-10六聚集区以及鉴别器元素。它还复制了一个完美的共识序列匹配不会产生最强的启动子。该模型能够直接从序列独立学习生物相关信息,同时与经典方法类似地执行或更好地执行,使其吸引进一步预测优化和研究普遍性。该方法可用于合成启动子设计,以及序列特征识别。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号