An Ensemble Tf-Idf Based Approach to Protein Function Prediction via Sequence Segmentation

Ashish Ranjan; David Fernández-Baca; Sudhakar TripathiAkshay Deepak

首页> 外文期刊>IEEE/ACM transactions on computational biology and bioinformatics >An Ensemble Tf-Idf Based Approach to Protein Function Prediction via Sequence Segmentation

【24h】

An Ensemble Tf-Idf Based Approach to Protein Function Prediction via Sequence Segmentation

机译：一种基于Ensemble Tf-IDF的序列分割蛋白质功能预测方法

获取原文

获取原文并翻译 | 示例

掌桥外文数据库（机构版） >>

开具论文收录证明 >>

文献代查 >>

页面导航

摘要
著录项
相似文献
相关主题

摘要

This paper explores the use of variants of tf-idf -based descriptors, namely length-normalized-tf-idf and log-normalized-tf-idf , combined with a segmentation technique, for efficient modeling of variable-length protein sequences. The proposed solution, ProtVecGen-Ensemble , is an ensemble of three models trained on differently segmented datasets constructed from an input dataset containing complete protein sequences. Evaluations using biological process (BP) and molecular function (MF) datasets demonstrate that the proposed feature set is not only superior to its contemporaries but also produces more consistent results with respect to variation in sequence lengths. Improvements of +6.07% (BP) and +7.56% (MF) over state-of-the-art tf-idf -based MLDA feature set were obtained. The best results were achieved when ProtVecGen-Ensemble was combined with ProtVecGen-Plus — the state-of-the-art method for protein function prediction — resulting in improvements of +8.90% (BP) and +11.28% (MF) over MLDA and +1.49% (BP) and +2.07% (MF) over ProtVecGen-Plus +MLDA. To capture the performance consistency with respect to sequence lengths, we have defined a variance-based metric, with lower values indicating better performance. On this metric, the proposed ProtVecGen-Ensemble + ProtVecGen-Plus framework resulted in reductions of 56.85 percent (BP) and 56.08 percent (MF) over MLDA and 10.37 percent (BP) and 26.48 percent (MF) over ProtVecGenPlus +MLDA.

机译：本文探讨了使用基于 tf-idf 的描述符的变体，即长度归一化 tf-idf 和对数归一化 tf-idf，结合分割技术，对可变长度蛋白质序列进行有效建模。提出的解决方案ProtVecGen-Ensemble是三个模型的集合，这些模型在不同分割的数据集上训练，这些数据集由包含完整蛋白质序列的输入数据集构建。使用生物过程（BP）和分子功能（MF）数据集的评估表明，所提出的特征集不仅优于同时代的特征集，而且在序列长度的变化方面也产生了更一致的结果。与最先进的基于 tf-idf 的 MLDA 功能集相比，获得了 +6.07% （BP）和 +7.56% （MF）的改进。当ProtVecGen-Ensemble与ProtVecGen-Plus（最先进的蛋白质功能预测方法）结合使用时，取得了最佳结果，与MLDA相比，性能分别提高了+8.90%（BP）和+11.28%（MF），与ProtVecGen-Plus +MLDA相比，提高了+1.49%（BP）和+2.07%（MF）。为了捕获序列长度方面的性能一致性，我们定义了一个基于方差的指标，值越低表示性能越好。在这一指标上，拟议的ProtVecGen-Ensemble + ProtVecGen-Plus框架使MLDA降低了56.85%（BP）和56.08%（MF），比ProtVecGenPlus + MLDA降低了10.37%（BP）和26.48%（MF）。

著录项

来源
《IEEE/ACM transactions on computational biology and bioinformatics》 |2022年第5期|2685-2696|共12页
作者
Ashish Ranjan; David Fernández-Baca; Sudhakar TripathiAkshay Deepak;
展开▼
作者单位

Department of Computer Science & Engineering, National Institute of Technology Patna, Patna, Bihar, India;

Department of Computer Science, Iowa State University, Ames, IA, USA;

Department of Information Technology, REC Ambedkar Nagar, Akbarpur, Uttar Pradesh, India;

展开▼
收录信息
原文格式 PDF
正文语种英语
中图分类生物信息论;
关键词
Proteins; Amino acids; Protein sequence; Predictive models; Task analysis; Location awareness; Feature extraction;

机译：蛋白质;氨基酸;蛋白质序列;预测模型;任务分析;位置感知;特征提取;

相似文献

外文文献
中文文献
专利

1. Prediction of GPCR-G Protein Coupling Specificity Using Features of Sequences and Biological Functions [J] . Toshihide Ono, Haretsugu Hishigaki 基因组蛋白质组与生物信息学报（英文版） . 2006,第4期
2. A Sub-Sequence Based Approach to Protein Function Prediction via Multi-Attention Based Multi-Aspect Network [J] . Ashish Ranjan, Archana Tiwari, Akshay Deepak IEEE/ACM transactions on computational biology and bioinformatics . 2023,第1期

机译：A Sub-Sequence Based Approach to Protein Function Prediction via Multi-Attention Based Multi-Aspect Network
3. In-silico target prediction by ensemble chemogenomic model based on multi-scale information of chemical structures and protein sequences [J] . Yang Su-Qing, Zhang Liu-Xia, Ge You-JinZhang Jin-WeiHu Jian-XinShen Cheng-YingLu Ai-PingHou Ting-JunCao Dong-Sheng journal of cheminformatics . 2023,第1期

机译：In-silico target prediction by ensemble chemogenomic model based on multi-scale information of chemical structures and protein sequences
4. IPPF-FE: an integrated peptide and protein function prediction framework based on fused features and ensemble models [J] . Han Yu, Xiaozhou Luo Briefings in bioinformatics . 2023,第1期

机译：IPPF-FE: an integrated peptide and protein function prediction framework based on fused features and ensemble models
5. 基于反馈的汽车牌照纹理分析定位算法的研究（License Plate Segmentation Algorithm Based on Feedback of Texture Analysis） [C] . Chinese Control Conference vol.2; 20040810-13; Wuxi(CN) . 2004

机译：基于反馈的汽车牌照纹理分析定位算法的研究（License Plate Segmentation Algorithm Based on Feedback of Texture Analysis）
6. Structure Determination and Mechanistic Insights of: I.Cyanobacteriochrome NpR6012g4 Light Sensor Protein in Phototaxis II.Retinal Degeneration 3 (RD3) Protein in Vision III.Ryanodine Receptor 2 (RyR2) Regulation by Calmodulin (CaM) in Cardiac Function =结构测定和机理洞悉：I.趋光性中的蓝细菌色素NpR6012g4光敏蛋白 II.视觉作用中的视网膜退化蛋白3 III.心脏功能中的钙调蛋白调控兰诺定受体2 [D] . Yu, Qinhong. 2019

机译：Structure Determination and Mechanistic Insights of: I.Cyanobacteriochrome NpR6012g4 Light Sensor Protein in Phototaxis II.Retinal Degeneration 3 (RD3) Protein in Vision III.Ryanodine Receptor 2 (RyR2) Regulation by Calmodulin (CaM) in Cardiac Function =结构测定和机理洞悉：I.趋光性中的蓝细菌色素NpR6012g4光敏蛋白 II.视觉作用中的视网膜退化蛋白3 III.心脏功能中的钙调蛋白调控兰诺定受体2
7. TPRpred: a tool for prediction of TPR-, PPR- and SEL1-like repeats from protein sequences [O] . Söding Johannes, Lupas Andrei N, Karpenahalli Manjunatha R 2007

机译：TpRpred: a tool for prediction of TpR-, ppR- and sEL1-like repeats from protein sequences
8. Estimation du Mouvement 3D et Segmentation d'Objets dans Une Sequence Longue deTriplets Stereoscopiques (Three-Dimensional Motion Computation and Object Segmentation in a Long Sequence of Stereo Frames) [R] . Zhang, Z., Faugeras, O. D. 1991

机译：估计du mouvement 3D et segmentation d'Objets dans Une sequence Longue deTriplets stereoscopiques（三维运动计算和一系列立体帧中的对象分割）

An Ensemble Tf-Idf Based Approach to Protein Function Prediction via Sequence Segmentation

摘要

著录项

相似文献

相关主题

期刊订阅