...
首页> 外文期刊>Journal of Bioinformatics and Computational Biology >Classification of large microarray datasets using fast random forest construction.
【24h】

Classification of large microarray datasets using fast random forest construction.

机译:使用快速随机森林构建对大型微阵列数据集进行分类。

获取原文
获取原文并翻译 | 示例
           

摘要

Random forest is an ensemble classification algorithm. It performs well when most predictive variables are noisy and can be used when the number of variables is much larger than the number of observations. The use of bootstrap samples and restricted subsets of attributes makes it more powerful than simple ensembles of trees. The main advantage of a random forest classifier is its explanatory power: it measures variable importance or impact of each factor on a predicted class label. These characteristics make the algorithm ideal for microarray data. It was shown to build models with high accuracy when tested on high-dimensional microarray datasets. Current implementations of random forest in the machine learning and statistics community, however, limit its usability for mining over large datasets, as they require that the entire dataset remains permanently in memory. We propose a new framework, an optimized implementation of a random forest classifier, which addresses specific properties of microarray data, takes computational complexity of a decision tree algorithm into consideration, and shows excellent computing performance while preserving predictive accuracy. The implementation is based on reducing overlapping computations and eliminating dependency on the size of main memory. The implementation's excellent computational performance makes the algorithm useful for interactive data analyses and data mining.
机译:随机森林是一种集成分类算法。当大多数预测变量嘈杂时,它表现良好;当变量数量远大于观测值数量时,可以使用它。引导程序样本和属性的受限子集的使用使其比简单的树状集成更强大。随机森林分类器的主要优点是其解释力:它可以衡量变量的重要性或每个因素对预测类别标签的影响。这些特性使该算法非常适合微阵列数据。在高维微阵列数据集上进行测试时,它显示出可以建立高精度的模型。但是,机器学习和统计领域中随机森林的当前实现限制了其在大型数据集上进行挖掘的可用性,因为它们要求整个数据集永久保留在内存中。我们提出了一个新的框架,即随机森林分类器的优化实现,该算法解决了微阵列数据的特定属性,考虑了决策树算法的计算复杂性,并在保持预测精度的同时显示了出色的计算性能。该实现基于减少重叠计算并消除对主存储器大小的依赖性。该实现的出色计算性能使该算法可用于交互式数据分析和数据挖掘。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号