...
首页> 外文期刊>Journal of Bioinformatics and Computational Biology >Multi-factorial analysis of class prediction error: Estimating optimal number of biomarkers for various classification rules
【24h】

Multi-factorial analysis of class prediction error: Estimating optimal number of biomarkers for various classification rules

机译:类预测误差的多因素分析:估算各种分类规则的最佳生物标记数

获取原文
获取原文并翻译 | 示例
           

摘要

Machine learning and statistical model based classifiers have increasingly been used with more complex and high dimensional biological data obtained from high-throughput technologies. Understanding the impact of various factors associated with large and complex microarray datasets on the predictive performance of classifiers is computationally intensive, under investigated, yet vital in determining the optimal number of biomarkers for various classification purposes aimed towards improved detection, diagnosis, and therapeutic monitoring of diseases. We investigate the impact of microarray based data characteristics on the predictive performance for various classification rules using simulation studies. Our investigation using Random Forest, Support Vector Machines, Linear Discriminant Analysis and k-Nearest Neighbour shows that the predictive performance of classifiers is strongly influenced by training set size, biological and technical variability, replication, fold change and correlation between biomarkers. Optimal number of biomarkers for a classification problem should therefore be estimated taking account of the impact of all these factors. A database of average generalization errors is built for various combinations of these factors. The database of generalization errors can be used for estimating the optimal number of biomarkers for given levels of predictive accuracy as a function of these factors. Examples show that curves from actual biological data resemble that of simulated data with corresponding levels of data characteristics. An R package optBiomarker implementing the method is freely available for academic use from the Comprehensive R Archive Network.
机译:基于机器学习和统计模型的分类器已越来越多地用于从高通量技术中获得的更复杂,更高维度的生物学数据。了解与大型和复杂的微阵列数据集相关的各种因素对分类器预测性能的影响是计算密集型的,尚需深入研究,但对于确定各种分类目的最佳生物标志物的最佳数量至关重要,目的在于改善检测,诊断和治疗监测疾病。我们使用模拟研究调查了基于微阵列的数据特征对各种分类规则的预测性能的影响。我们使用随机森林,支持向量机,线性判别分析和k最近邻的研究表明,分类器的预测性能受训练集大小,生物学和技术变异性,复制,倍数变化以及生物标志物之间相关性的强烈影响。因此,应考虑所有这些因素的影响,估计分类问题的最佳生物标志物数量。针对这些因素的各种组合建立了平均泛化误差数据库。对于这些因素的函数,对于给定的预测准确性水平,泛化误差数据库可用于估计最佳生物标记数。实例表明,来自实际生物学数据的曲线类似于具有相应数据特征水平的模拟数据的曲线。可以从综合R存档网络免费获得用于实现该方法的R包optBiomarker,以供学术使用。

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号