A machine learning framework for genotyping the structural variations with copy number variant

Tian Zheng; Xiaoyan Zhu; Xuanping Zhang; Zhongmeng Zhao; Xin Yi; Jiayin Wang; Hongle Li

首页> 外文期刊>BMC Medical Genomics >A machine learning framework for genotyping the structural variations with copy number variant

【24h】

A machine learning framework for genotyping the structural variations with copy number variant

机译：用于基因分型与拷贝数变体的结构变体的机器学习框架

获取原文

掌桥外文数据库（机构版） >>

开具论文收录证明 >>

文献代查 >>

页面导航

摘要
著录项
相似文献
相关主题

摘要

Genotyping of structural variation is an important computational problem in next generation sequence data analysis. However, in cancer genomes, the copy number variant(CNV) often coexists with other types of structural variations which significantly reduces the accuracy of the existing genotype methods. The bias on sequencing coverage and variant allelic frequency can be observed on a CNV region, which leads to the genotyping approaches that misinterpret the heterozygote as a homozygote. Furthermore, other data signals such as split mapped read, abnormal read will also be misjudged because of the CNV. Therefore, genotyping the structural variations with CNV is a complicated computational problem which should consider multiple features and their interactions. Here we proposed a computational method for genotyping indels in the CNV region, which introduced a machine learning framework to comprehensively incorporate a set of data features and their interactions. We extracted fifteen kinds of classification features as input and different from the traditional genotyping problem, here the structure of variant may fall into types of normal homozygote, homozygous variant, heterozygous variant without CNV, heterozygous variant with a CNV on the mutated haplotype, and heterozygous variant with a CNV on the wild haplotype. The Multiclass Relevance Vector Machine (M-RVM) was used as a machine learning framework combined with the distribution characteristics of the features. We applied the proposed method to both simulated and real data, and compared it with the existing popular softwares include Gindel, Facets, GATK, and also compared with other machine learning cores: Support Vector Machine, Lanrange-SVM with OVO multiple classification, Na?ve Bayes and BP Neural Network. The results demonstrated that the proposed method outperforms others on accuracy, stability and efficiency. This work shows that the genotyping of structural variations on the CNV region cannot be solved as a traditional genotyping problem. More features should be used to efficiently complete the five-category task. According to the result, the proposed method can be a practical algorithm to correct genotype structural variations with CNV on the next generation sequence data. The source codes have been uploaded at https://github.com/TrinaZ/Mixgenotype for academic usage only.

机译：结构变异的基因分型是下一代序列数据分析中的重要计算问题。然而，在癌症基因组中，拷贝数变体（CNV）通常与其他类型的结构变异共存，这显着降低了现有基因型方法的准确性。在CNV区域可以观察到测序覆盖和变异等位基频率的偏差，这导致误解杂合子作为纯合子的基因分型方法。此外，诸如拆分映射读取的其他数据信号，由于CNV也将被判断出来的异常读取。因此，基因分型具有CNV的结构变化是一种复杂的计算问题，应该考虑多个特征及其相互作用。在这里，我们提出了一种用于CNV区域的基因分型indels的计算方法，其引入了机器学习框架，以综合地纳入一组数据特征及其交互。我们提取了十五种分类特征作为输入和不同于传统的基因分型问题，这里的变体结构可能落入正常纯合子的类型，纯合变体，杂合的变体没有CNV，杂合变体在突变的单倍型上具有CNV，杂合在野生单倍型上具有CNV的变体。多种子相关矢量机（M-RVM）用作机器学习框架，与功能的分布特性相结合。我们将建议的方法应用于模拟和实际数据，并将其与现有的流行软件进行比较，包括巷道，刻面，卡茨克，以及与其他机器学习核心相比：支持向量机，Lanrange-SVM与ovo多分类，na？ VE贝叶斯和BP神经网络。结果表明，所提出的方法在准确性，稳定性和效率方面优于其他方法。这项工作表明，CNV区域的结构变化的基因分型不能作为传统基因分型问题解决。应使用更多功能来有效地完成五类任务。根据结果，所提出的方法可以是一种实用的算法，用于在下一代序列数据上校正与CNV的基因型结构变化。源代码已在https://github.com/trinaz/mixgenotype上传到学术用途。

著录项

来源
《BMC Medical Genomics》 |2020年第6期|共15页
作者
Tian Zheng; Xiaoyan Zhu; Xuanping Zhang; Zhongmeng Zhao; Xin Yi; Jiayin Wang; Hongle Li;
展开▼
作者单位

展开▼
收录信息
原文格式 PDF
正文语种
中图分类
关键词
Cancer genomicsNGS data analysisGenotyping structural variationCopy number variantMulticlass relevance vector machine;

机译：癌症Genomicsngs数据分析结构变形剖反数变型变型VARICLICLASS相关矢量机;

相似文献

外文文献
中文文献
专利

1. EnsembleCNV: an ensemble machine learning algorithm to identify and genotype copy number variation using SNP array data [J] . Zhang Zhongyang, Cheng Haoxiang, Hong Xiumei, Nucleic Acids Research . 2019,第7期

机译：Ensemblecnv：使用SNP阵列数据识别和基因型拷贝数变型的集合机器学习算法
2. EnsembleCNV: an ensemble machine learning algorithm to identify and genotype copy number variation using SNP array data [J] . Zhongyang Zhang, Haoxiang Cheng, Xiumei Hong, Nucleic acids research . 2019,第7期

机译：EnsembleCNV：一种集成的机器学习算法，用于使用SNP数组数据识别和基因型拷贝数变异
3. SVFX: a machine learning framework to quantify the pathogenicity of structural variants [J] . Sushant Kumar, Arif Harmanci, Jagath Vytheeswaran, Genome Biology . 2020,第1期

机译：SVFX：一种机器学习框架，用于量化结构变体的致病性
4. Combining Machine Learning and Formal Techniques for Small Data Applications - A Framework to Explore New Structural Materials [C] . Rolf Drechsler, Sebastian Huhn, Christina Plump Euromicro Conference on Digital System Design . 2020

机译：将机器学习与形式技术相结合以实现小数据应用-探索新结构材料的框架
5. Predicting the Effects of Protein Variants using Structural Modeling, Large-Scale Data Integration, and Machine Learning. [D] . Baugh, Evan H. 2017

机译：使用结构建模，大规模数据集成和机器学习预测蛋白质变异体的影响。
6. A machine learning framework for genotyping the structural variations with copy number variant [O] . Tian Zheng, Xiaoyan Zhu, Xuanping Zhang, 2020

机译：用于基因分型与拷贝数变体的结构变体的机器学习框架
7. A machine learning framework for genotyping the structural variations with copy number variant [O] . Tian Zheng, Xiaoyan Zhu, Xuanping Zhang, 2020

机译：用于基因分型与拷贝数变体的结构变体的机器学习框架

A machine learning framework for genotyping the structural variations with copy number variant

摘要

著录项

相似文献

相关主题

期刊订阅