首页> 美国卫生研究院文献>International Journal of Molecular Sciences >Classifying Incomplete Gene-Expression Data: Ensemble Learning with Non-Pre-Imputation Feature Filtering and Best-First Search Technique
【2h】

Classifying Incomplete Gene-Expression Data: Ensemble Learning with Non-Pre-Imputation Feature Filtering and Best-First Search Technique

机译:对不完整的基因表达数据进行分类:使用非预先输入特征过滤和最佳优先搜索技术进行集成学习

代理获取
本网站仅为用户提供外文OA文献查询和代理获取服务,本网站没有原文。下单后我们将采用程序或人工为您竭诚获取高质量的原文,但由于OA文献来源多样且变更频繁,仍可能出现获取不到、文献不完整或与标题不符等情况,如果获取不到我们将提供退款服务。请知悉。

摘要

(1) Background: Gene-expression data usually contain missing values (MVs). Numerous methods focused on how to estimate MVs have been proposed in the past few years. Recent studies show that those imputation algorithms made little difference in classification. Thus, some scholars believe that how to select the informative genes for downstream classification is more important than how to impute MVs. However, most feature-selection (FS) algorithms need beforehand imputation, and the impact of beforehand MV imputation on downstream FS performance is seldom considered. (2) Method: A modified chi-square test-based FS is introduced for gene-expression data. To deal with the challenge of a small sample size of gene-expression data, a heuristic method called recursive element aggregation is proposed in this study. Our approach can directly handle incomplete data without any imputation methods or missing-data assumptions. The most informative genes can be selected through a threshold. After that, the best-first search strategy is utilized to find optimal feature subsets for classification. (3) Results: We compare our method with several FS algorithms. Evaluation is performed on twelve original incomplete cancer gene-expression datasets. We demonstrate that MV imputation on an incomplete dataset impacts subsequent FS in terms of classification tasks. Through directly conducting FS on incomplete data, our method can avoid potential disturbances on subsequent FS procedures caused by MV imputation. An experiment on small, round blue cell tumor (SRBCT) dataset showed that our method found additional genes besides many common genes with the two compared existing methods.
机译:(1)背景:基因表达数据通常包含缺失值(MV)。在过去的几年中,已经提出了许多专注于如何估计MV的方法。最近的研究表明,这些插补算法在分类上没有什么区别。因此,一些学者认为,如何为下游分类选择信息基因比推算MV更重要。但是,大多数特征选择(FS)算法都需要预先进行插补,并且很少考虑预先MV插补对下游FS性能的影响。 (2)方法:针对基因表达数据引入了一种改进的基于卡方检验的FS。为了应对基因表达数据样本量较小的挑战,本研究提出了一种称为递归元素聚合的启发式方法。我们的方法可以直接处理不完整的数据,而无需任何估算方法或缺少数据的假设。可以通过阈值选择最有用的基因。之后,利用最佳优先搜索策略来找到用于分类的最佳特征子集。 (3)结果:我们将我们的方法与几种FS算法进行了比较。对十二个原始的不完整癌症基因表达数据集进行评估。我们证明,在不完整数据集上的MV插补会影响分类任务中的后续FS。通过直接对不完整的数据进行FS,我们的方法可以避免由于MV插值而对后续FS程序造成潜在的干扰。在小的圆形蓝细胞肿瘤(SRBCT)数据集上进行的实验表明,我们的方法除发现了许多常见基因外,还发现了另外两个与现有方法比较的基因。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
代理获取

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号