...
首页> 外文期刊>Journal of Bioinformatics and Computational Biology >Evaluating feature-selection stability in next-generation proteomics
【24h】

Evaluating feature-selection stability in next-generation proteomics

机译:评估下一代蛋白质组学中的特征选择稳定性

获取原文
获取原文并翻译 | 示例
           

摘要

Identifying reproducible yet relevant features is a major challenge in biological research. This is well documented in genomics data. Using a proposed set of three reliability benchmarks, we find that this issue exists also in proteomics for commonly used feature-selection methods, e.g. t-test and recursive feature elimination. Moreover, due to high test variability, selecting the top proteins based on p-value ranks — even when restricted to high-abundance proteins — does not improve reproducibility. Statistical testing based on networks are believed to be more robust, but this does not always hold true: The commonly used hypergeometric enrichment that tests for enrichment of protein subnets performs abysmally due to its dependence on unstable protein pre-selection steps. We demonstrate here for the first time the utility of a novel suite of network-based algorithms called ranked-based network algorithms (RBNAs) on proteomics. These have originally been introduced and tested extensively on genomics data. We show here that they are highly stable, reproducible and select relevant features when applied to proteomics data. It is also evident from these results that use of statistical feature testing on protein expression data should be executed with due caution. Careless use of networks does not resolve poor-performance issues, and can even mislead. We recommend augmenting statistical feature-selection methods with concurrent analysis on stability and reproducibility to improve the quality of the selected features prior to experimental validation.
机译:识别可重复但相关的特征是生物学研究中的主要挑战。这在基因组学数据中有很好的记录。使用拟议的三种可靠性基准测试,我们发现这个问题也存在于常用特征选择方法的蛋白质组学中,例如, T检验和递归特征消除。此外,由于高测试变异性,基于P值等级选择顶部蛋白质 - 即使仅限于高丰度蛋白 - 也不会提高再现性。基于网络的统计测试被认为是更强大的,但这并不总是保持真实:常用的超细富集,用于富集蛋白质子网的富集的测试由于其对不稳定蛋白质预选择步骤的依赖性而进行了自我。我们在这里首次展示了一种新颖的基于网络算法套件的效用,称为基于排名的网络算法(RBNAS)。这些最初已经在基因组学数据中广泛引入和测试。我们在此显示它们是高度稳定的,可重复的,并且在应用于蛋白质组学数据时选择相关的功能。从这些结果中也显而易见的是,应在蛋白质表达数据上使用统计特征测试,应当谨慎执行。粗心使用网络无法解决绩​​效问题不佳,甚至可以误导。我们建议使用同时分析稳定性和再现性来增强统计特征选择方法,以提高实验验证之前所选特征的质量。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号