首页> 外文期刊>BMC Medical Genomics >TSG: a new algorithm for binary and multi-class cancer classification and informative genes selection
【24h】

TSG: a new algorithm for binary and multi-class cancer classification and informative genes selection

机译:TSG:用于二元和多分类癌症分类和信息基因选择的新算法

获取原文
       

摘要

Background One of the challenges in classification of cancer tissue samples based on gene expression data is to establish an effective method that can select a parsimonious set of informative genes. The Top Scoring Pair (TSP), k-Top Scoring Pairs (k-TSP), Support Vector Machines (SVM), and prediction analysis of microarrays (PAM) are four popular classifiers that have comparable performance on multiple cancer datasets. SVM and PAM tend to use a large number of genes and TSP, k-TSP always use even number of genes. In addition, the selection of distinct gene pairs in k-TSP simply combined the pairs of top ranking genes without considering the fact that the gene set with best discrimination power may not be the combined pairs. The k-TSP algorithm also needs the user to specify an upper bound for the number of gene pairs. Here we introduce a computational algorithm to address the problems. The algorithm is named Chisquare-statistic-based Top Scoring Genes (Chi-TSG) classifier simplified as TSG. Results The TSG classifier starts with the top two genes and sequentially adds additional gene into the candidate gene set to perform informative gene selection. The algorithm automatically reports the total number of informative genes selected with cross validation. We provide the algorithm for both binary and multi-class cancer classification. The algorithm was applied to 9 binary and 10 multi-class gene expression datasets involving human cancers. The TSG classifier outperforms TSP family classifiers by a big margin in most of the 19 datasets. In addition to improved accuracy, our classifier shares all the advantages of the TSP family classifiers including easy interpretation, invariant to monotone transformation, often selects a small number of informative genes allowing follow-up studies, resistant to sampling variations due to within sample operations. Conclusions Redefining the scores for gene set and the classification rules in TSP family classifiers by incorporating the sample size information can lead to better selection of informative genes and classification accuracy. The resulting TSG classifier offers a useful tool for cancer classification based on numerical molecular data.
机译:背景技术基于基因表达数据对癌症组织样品进行分类的挑战之一是建立一种可以选择简约信息基因组的有效方法。最高评分对(TSP),k最高评分对(k-TSP),支持向量机(SVM)和微阵列预测分析(PAM)是四个流行的分类器,它们在多个癌症数据集上具有可比的性能。 SVM和PAM倾向于使用大量的基因,而TSP,k-TSP总是使用偶数的基因。另外,在k-TSP中不同基因对的选择简单地组合了排名最高的基因对,而没有考虑具有最佳区分能力的基因组可能不是组合对。 k-TSP算法还需要用户指定基因对数量的上限。在这里,我们介绍一种计算算法来解决这些问题。该算法称为简化为TSG的基于Chisquare统计的最高评分基因(Chi-TSG)分类器。结果TSG分类器从前两个基因开始,然后将其他基因顺序添加到候选基因集中以进行信息丰富的基因选择。该算法自动报告通过交叉验证选择的信息性基因总数。我们提供了用于二进制和多类癌症分类的算法。该算法已应用于涉及人类癌症的9个二进制和10个多类基因表达数据集。在19个数据集中的大多数数据集中,TSG分类器的性能大大优于TSP系列分类器。除了提高准确性外,我们的分类器还具有TSP系列分类器的所有优点,包括易于解释,不变到单调转化,经常选择少量信息基因以进行后续研究,从而抵抗由于样本操作而导致的样本变异。结论通过合并样本量信息,重新定义TSP家庭分类器中的基因集得分和分类规则,可以更好地选择信息基因和分类准确性。所得的TSG分类器为基于数字分子数据的癌症分类提供了有用的工具。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号