...
首页> 外文期刊>Algorithms for Molecular Biology >Wrapper-based selection of genetic features in genome-wide association studies through fast matrix operations
【24h】

Wrapper-based selection of genetic features in genome-wide association studies through fast matrix operations

机译:通过快速矩阵操作在全基因组关联研究中基于包装的遗传特征选择

获取原文
   

获取外文期刊封面封底 >>

       

摘要

Background Through the wealth of information contained within them, genome-wide association studies (GWAS) have the potential to provide researchers with a systematic means of associating genetic variants with a wide variety of disease phenotypes. Due to the limitations of approaches that have analyzed single variants one at a time, it has been proposed that the genetic basis of these disorders could be determined through detailed analysis of the genetic variants themselves and in conjunction with one another. The construction of models that account for these subsets of variants requires methodologies that generate predictions based on the total risk of a particular group of polymorphisms. However, due to the excessive number of variants, constructing these types of models has so far been computationally infeasible. Results We have implemented an algorithm, known as greedy RLS, that we use to perform the first known wrapper-based feature selection on the genome-wide level. The running time of greedy RLS grows linearly in the number of training examples, the number of features in the original data set, and the number of selected features. This speed is achieved through computational short-cuts based on matrix calculus. Since the memory consumption in present-day computers can form an even tighter bottleneck than running time, we also developed a space efficient variation of greedy RLS which trades running time for memory. These approaches are then compared to traditional wrapper-based feature selection implementations based on support vector machines (SVM) to reveal the relative speed-up and to assess the feasibility of the new algorithm. As a proof of concept, we apply greedy RLS to the Hypertension – UK National Blood Service WTCCC dataset and select the most predictive variants using 3-fold external cross-validation in less than 26 minutes on a high-end desktop. On this dataset, we also show that greedy RLS has a better classification performance on independent test data than a classifier trained using features selected by a statistical p-value-based filter, which is currently the most popular approach for constructing predictive models in GWAS. Conclusions Greedy RLS is the first known implementation of a machine learning based method with the capability to conduct a wrapper-based feature selection on an entire GWAS containing several thousand examples and over 400,000 variants. In our experiments, greedy RLS selected a highly predictive subset of genetic variants in a fraction of the time spent by wrapper-based selection methods used together with SVM classifiers. The proposed algorithms are freely available as part of the RLScore software library at http://users.utu.fi/aatapa/RLScore/ webcite .
机译:背景技术通过其中包含的大量信息,全基因组关联研究(GWAS)可以为研究人员提供一种系统的方法,将遗传变异与多种疾病表型相关联。由于一次分析单个变体的方法的局限性,已经提出可以通过对遗传变体本身以及彼此结合的详细分析来确定这些疾病的遗传基础。解释这些变体子集的模型的构建需要基于特定多态性组的总风险来生成预测的方法。但是,由于变体数量过多,到目前为止,构建这些类型的模型在计算上是不可行的。结果我们实现了一种称为贪婪RLS的算法,该算法用于在全基因组水平上执行第一个已知的基于包装的特征选择。贪婪的RLS的运行时间随着训练示例的数量,原始数据集中的特征数量以及所选特征的数量线性增加。通过基于矩阵演算的计算快捷方式可以达到这种速度。由于当今计算机的内存消耗可能会比运行时间形成更严格的瓶颈,因此我们还开发了一种节省空间的贪婪的RLS版本,该版本以运行时间换取内存。然后将这些方法与基于支持向量机(SVM)的传统基于包装器的特征选择实现方案进行比较,以揭示相对提速并评估新算法的可行性。作为概念验证,我们将贪婪的RLS应用于“高血压-英国国家血液服务WTCCC”数据集,并在不到26分钟的高端台式机上使用3倍外部交叉验证选择最具预测性的变体。在此数据集上,我们还表明,贪婪RLS在独立测试数据上的分类性能要好于使用基于统计p值的过滤器选择的特征训练的分类器,分类器是目前在GWAS中构建预测模型的最流行方法。结论Greedy RLS是基于机器学习的方法的第一个已知实现,它能够对包含数千个示例和400,000多个变体的整个GWAS进行基于包装的特征选择。在我们的实验中,贪婪的RLS在与SVM分类器一起使用的基于包装的选择方法所花费的时间的一小部分中,选择了遗传变量的高度预测子集。拟议的算法可作为RLScore软件库的一部分免费获得,网址为http://users.utu.fi/aatapa/RLScore/ webcite。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号