Protein classification based on text document classification techniques.

Cheng BY; Carbonell JG; Klein Seetharaman J

首页> 外文期刊>Proteins: Structure, Function, and Genetics >Protein classification based on text document classification techniques.

【24h】

Protein classification based on text document classification techniques.

机译：基于文本文档分类技术的蛋白质分类。

获取原文

获取原文并翻译 | 示例

获取外文期刊封面封底 >>

开具论文收录证明 >>

文献代查 >>

页面导航

摘要
著录项
相似文献
相关主题

摘要

The need for accurate, automated protein classification methods continues to increase as advances in biotechnology uncover new proteins. G-protein coupled receptors (GPCRs) are a particularly difficult superfamily of proteins to classify due to extreme diversity among its members. Previous comparisons of BLAST, k-nearest neighbor (k-NN), hidden markov model (HMM) and support vector machine (SVM) using alignment-based features have suggested that classifiers at the complexity of SVM are needed to attain high accuracy. Here, analogous to document classification, we applied Decision Tree and Naive Bayes classifiers with chi-square feature selection on counts of n-grams (i.e. short peptide sequences of length n) to this classification task. Using the GPCR dataset and evaluation protocol from the previous study, the Naive Bayes classifier attained an accuracy of 93.0 and 92.4% in level I and level II subfamily classification respectively, while SVM has a reported accuracy of 88.4 and 86.3%. This is a 39.7 and 44.5% reduction in residual error for level I and level II subfamily classification, respectively. The Decision Tree, while inferior to SVM, outperforms HMM in both level I and level II subfamily classification. For those GPCR families whose profiles are stored in the Protein FAMilies database of alignments and HMMs (PFAM), our method performs comparably to a search against those profiles. Finally, our method can be generalized to other protein families by applying it to the superfamily of nuclear receptors with 94.5, 97.8 and 93.6% accuracy in family, level I and level II subfamily classification respectively.

机译：随着生物技术的进步发现新的蛋白质，对准确，自动化的蛋白质分类方法的需求持续增长。由于其成员之间的极端多样性，G蛋白偶联受体（GPCR）是特别难以分类的蛋白质超家族。使用基于对齐的特征对BLAST，k最近邻（k-NN），隐马尔可夫模型（HMM）和支持向量机（SVM）进行的先前比较表明，需要使用具有SVM复杂性的分类器来获得高精度。在这里，类似于文档分类，我们将决策树和朴素贝叶斯分类器与n克计数（即长度为n的短肽序列）进行卡方特征选择相结合。使用先前研究的GPCR数据集和评估方案，朴素贝叶斯分类器在I级和II级亚科分类中的准确度分别为93.0％和92.4％，而SVM的准确度据报道为88.4％和86.3％。对于I级和II级亚家族分类，这分别使残留误差降低了39.7和44.5％。决策树虽然不如SVM，但在I级和II级亚家族分类中均优于HMM。对于其谱图存储在比对和HMM的Protein FAMilies蛋白数据库（PFAM）中的GPCR家族，我们的方法与针对这些谱图的搜索相比具有可比性。最后，通过将其应用于核受体超家族，我们的方法可以推广到其他蛋白质家族，分别在家族，I级和II级亚家族分类中分别具有94.5、97.8和93.6％的准确性。

著录项

来源
《Proteins: Structure, Function, and Genetics》 |2005年第4期|共16页
作者
Cheng BY; Carbonell JG; Klein Seetharaman J;
展开▼
作者单位

展开▼
收录信息
原文格式 PDF
正文语种 eng
中图分类基础医学;
关键词
Classification of information; Proteins; Protein measurement; 蛋白质类;

机译：Classification of information;Proteins;Protein measurement;蛋白质类;

相似文献

外文文献
中文文献
专利

1. Protein classification based on text document classification techniques. [J] . Cheng BY, Carbonell JG, Klein Seetharaman J Proteins: Structure, Function, and Genetics . 2005,第4期

机译：基于文本文档分类技术的蛋白质分类。
2. A discourse-aware neural network-based text model for document-level text classification [J] . Lee Kangwook, Han Sanggyu, Myaeng Sung-Hyon Journal of Information Science . 2018,第6期

机译：基于话语感知的神经网络文本模型用于文档级文本分类
3. Algorithm based on modified angle‐based outlier factor for open‐set classification of text documents [J] . Walkowiak Tomasz, Datko Szymon, Maciejewski Henryk Applied stochastic models in business and industry . 2018,第5期

机译：基于修改角度的文本文档分类的基于角度的异常因素的算法
4. Active Learning Algorithm for Threshold of Decision Probability on Imbalanced Text Classification Based on Protein-Protein Interaction Documents [C] . Guixian Xu, Zhendong Niu, Xu Gao, Data Storage and Data Engineering (DSDE), 2010 . 2010

机译：基于蛋白质-蛋白质相互作用文件的不平衡文本分类决策概率阈值的主动学习算法
5. A semantic partition based text mining model for document classification. [D] . Inibhunu, Catherine. 2006

机译：用于文档分类的基于语义分区的文本挖掘模型。
6. Hierarchical bi-directional attention-based RNNs for supporting document classification on protein–protein interactions affected by genetic mutations [O] . Aris Fergadis, Christos Baziotis, Dimitris Pappas, 2018

机译：基于分层双向注意的RNN支持受基因突变影响的蛋白质间相互作用的文档分类
7. Optimal Feature Subset Selection Based on Combining Document Frequency and Term Frequency for Text Classification [O] . Thirumoorthy Karpagalingam, Muneeswaran Karuppaiah 2020

机译：基于组合文档频率和文本分类术语频率的最佳特征子集选择

Protein classification based on text document classification techniques.

摘要

著录项

相似文献

相关主题

期刊订阅