...
首页> 外文期刊>Proteins: Structure, Function, and Genetics >Protein classification based on text document classification techniques.
【24h】

Protein classification based on text document classification techniques.

机译:基于文本文档分类技术的蛋白质分类。

获取原文
获取原文并翻译 | 示例
   

获取外文期刊封面封底 >>

       

摘要

The need for accurate, automated protein classification methods continues to increase as advances in biotechnology uncover new proteins. G-protein coupled receptors (GPCRs) are a particularly difficult superfamily of proteins to classify due to extreme diversity among its members. Previous comparisons of BLAST, k-nearest neighbor (k-NN), hidden markov model (HMM) and support vector machine (SVM) using alignment-based features have suggested that classifiers at the complexity of SVM are needed to attain high accuracy. Here, analogous to document classification, we applied Decision Tree and Naive Bayes classifiers with chi-square feature selection on counts of n-grams (i.e. short peptide sequences of length n) to this classification task. Using the GPCR dataset and evaluation protocol from the previous study, the Naive Bayes classifier attained an accuracy of 93.0 and 92.4% in level I and level II subfamily classification respectively, while SVM has a reported accuracy of 88.4 and 86.3%. This is a 39.7 and 44.5% reduction in residual error for level I and level II subfamily classification, respectively. The Decision Tree, while inferior to SVM, outperforms HMM in both level I and level II subfamily classification. For those GPCR families whose profiles are stored in the Protein FAMilies database of alignments and HMMs (PFAM), our method performs comparably to a search against those profiles. Finally, our method can be generalized to other protein families by applying it to the superfamily of nuclear receptors with 94.5, 97.8 and 93.6% accuracy in family, level I and level II subfamily classification respectively.
机译:随着生物技术的进步发现新的蛋白质,对准确,自动化的蛋白质分类方法的需求持续增长。由于其成员之间的极端多样性,G蛋白偶联受体(GPCR)是特别难以分类的蛋白质超家族。使用基于对齐的特征对BLAST,k最近邻(k-NN),隐马尔可夫模型(HMM)和支持向量机(SVM)进行的先前比较表明,需要使用具有SVM复杂性的分类器来获得高精度。在这里,类似于文档分类,我们将决策树和朴素贝叶斯分类器与n克计数(即长度为n的短肽序列)进行卡方特征选择相结合。使用先前研究的GPCR数据集和评估方案,朴素贝叶斯分类器在I级和II级亚科分类中的准确度分别为93.0%和92.4%,而SVM的准确度据报道为88.4%和86.3%。对于I级和II级亚家族分类,这分别使残留误差降低了39.7和44.5%。决策树虽然不如SVM,但在I级和II级亚家族分类中均优于HMM。对于其谱图存储在比对和HMM的Protein FAMilies蛋白数据库(PFAM)中的GPCR家族,我们的方法与针对这些谱图的搜索相比具有可比性。最后,通过将其应用于核受体超家族,我们的方法可以推广到其他蛋白质家族,分别在家族,I级和II级亚家族分类中分别具有94.5、97.8和93.6%的准确性。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号