...
首页> 外文期刊>BMC Medical Genomics >A machine learning framework for genotyping the structural variations with copy number variant
【24h】

A machine learning framework for genotyping the structural variations with copy number variant

机译:用于基因分型与拷贝数变体的结构变体的机器学习框架

获取原文
           

摘要

Genotyping of structural variation is an important computational problem in next generation sequence data analysis. However, in cancer genomes, the copy number variant(CNV) often coexists with other types of structural variations which significantly reduces the accuracy of the existing genotype methods. The bias on sequencing coverage and variant allelic frequency can be observed on a CNV region, which leads to the genotyping approaches that misinterpret the heterozygote as a homozygote. Furthermore, other data signals such as split mapped read, abnormal read will also be misjudged because of the CNV. Therefore, genotyping the structural variations with CNV is a complicated computational problem which should consider multiple features and their interactions. Here we proposed a computational method for genotyping indels in the CNV region, which introduced a machine learning framework to comprehensively incorporate a set of data features and their interactions. We extracted fifteen kinds of classification features as input and different from the traditional genotyping problem, here the structure of variant may fall into types of normal homozygote, homozygous variant, heterozygous variant without CNV, heterozygous variant with a CNV on the mutated haplotype, and heterozygous variant with a CNV on the wild haplotype. The Multiclass Relevance Vector Machine (M-RVM) was used as a machine learning framework combined with the distribution characteristics of the features. We applied the proposed method to both simulated and real data, and compared it with the existing popular softwares include Gindel, Facets, GATK, and also compared with other machine learning cores: Support Vector Machine, Lanrange-SVM with OVO multiple classification, Na?ve Bayes and BP Neural Network. The results demonstrated that the proposed method outperforms others on accuracy, stability and efficiency. This work shows that the genotyping of structural variations on the CNV region cannot be solved as a traditional genotyping problem. More features should be used to efficiently complete the five-category task. According to the result, the proposed method can be a practical algorithm to correct genotype structural variations with CNV on the next generation sequence data. The source codes have been uploaded at https://github.com/TrinaZ/Mixgenotype for academic usage only.
机译:结构变异的基因分型是下一代序列数据分析中的重要计算问题。然而,在癌症基因组中,拷贝数变体(CNV)通常与其他类型的结构变异共存,这显着降低了现有基因型方法的准确性。在CNV区域可以观察到测序覆盖和变异等位基频率的偏差,这导致误解杂合子作为纯合子的基因分型方法。此外,诸如拆分映射读取的其他数据信号,由于CNV也将被判断出来的异常读取。因此,基因分型具有CNV的结构变化是一种复杂的计算问题,应该考虑多个特征及其相互作用。在这里,我们提出了一种用于CNV区域的基因分型indels的计算方法,其引入了机器学习框架,以综合地纳入一组数据特征及其交互。我们提取了十五种分类特征作为输入和不同于传统的基因分型问题,这里的变体结构可能落入正常纯合子的类型,纯合变体,杂合的变体没有CNV,杂合变体在突变的单倍型上具有CNV,杂合在野生单倍型上具有CNV的变体。多种子相关矢量机(M-RVM)用作机器学习框架,与功能的分布特性相结合。我们将建议的方法应用于模拟和实际数据,并将其与现有的流行软件进行比较,包括巷道,刻面,卡茨克,以及与其他机器学习核心相比:支持向量机,Lanrange-SVM与ovo多分类,na? VE贝叶斯和BP神经网络。结果表明,所提出的方法在准确性,稳定性和效率方面优于其他方法。这项工作表明,CNV区域的结构变化的基因分型不能作为传统基因分型问题解决。应使用更多功能来有效地完成五类任务。根据结果​​,所提出的方法可以是一种实用的算法,用于在下一代序列数据上校正与CNV的基因型结构变化。源代码已在https://github.com/trinaz/mixgenotype上传到学术用途。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号