首页> 外文期刊>IEEE/ACM transactions on computational biology and bioinformatics >An Ensemble Tf-Idf Based Approach to Protein Function Prediction via Sequence Segmentation
【24h】

An Ensemble Tf-Idf Based Approach to Protein Function Prediction via Sequence Segmentation

机译:一种基于Ensemble Tf-IDF的序列分割蛋白质功能预测方法

获取原文
获取原文并翻译 | 示例
           

摘要

This paper explores the use of variants of tf-idf -based descriptors, namely length-normalized-tf-idf and log-normalized-tf-idf , combined with a segmentation technique, for efficient modeling of variable-length protein sequences. The proposed solution, ProtVecGen-Ensemble , is an ensemble of three models trained on differently segmented datasets constructed from an input dataset containing complete protein sequences. Evaluations using biological process (BP) and molecular function (MF) datasets demonstrate that the proposed feature set is not only superior to its contemporaries but also produces more consistent results with respect to variation in sequence lengths. Improvements of +6.07% (BP) and +7.56% (MF) over state-of-the-art tf-idf -based MLDA feature set were obtained. The best results were achieved when ProtVecGen-Ensemble was combined with ProtVecGen-Plus — the state-of-the-art method for protein function prediction — resulting in improvements of +8.90% (BP) and +11.28% (MF) over MLDA and +1.49% (BP) and +2.07% (MF) over ProtVecGen-Plus +MLDA. To capture the performance consistency with respect to sequence lengths, we have defined a variance-based metric, with lower values indicating better performance. On this metric, the proposed ProtVecGen-Ensemble + ProtVecGen-Plus framework resulted in reductions of 56.85 percent (BP) and 56.08 percent (MF) over MLDA and 10.37 percent (BP) and 26.48 percent (MF) over ProtVecGenPlus +MLDA.
机译:本文探讨了使用基于 tf-idf 的描述符的变体,即长度归一化 tf-idf 和对数归一化 tf-idf,结合分割技术,对可变长度蛋白质序列进行有效建模。提出的解决方案ProtVecGen-Ensemble是三个模型的集合,这些模型在不同分割的数据集上训练,这些数据集由包含完整蛋白质序列的输入数据集构建。使用生物过程 (BP) 和分子功能 (MF) 数据集的评估表明,所提出的特征集不仅优于同时代的特征集,而且在序列长度的变化方面也产生了更一致的结果。与最先进的基于 tf-idf 的 MLDA 功能集相比,获得了 +6.07% (BP) 和 +7.56% (MF) 的改进。当ProtVecGen-Ensemble与ProtVecGen-Plus(最先进的蛋白质功能预测方法)结合使用时,取得了最佳结果,与MLDA相比,性能分别提高了+8.90%(BP)和+11.28%(MF),与ProtVecGen-Plus +MLDA相比,提高了+1.49%(BP)和+2.07%(MF)。为了捕获序列长度方面的性能一致性,我们定义了一个基于方差的指标,值越低表示性能越好。在这一指标上,拟议的ProtVecGen-Ensemble + ProtVecGen-Plus框架使MLDA降低了56.85%(BP)和56.08%(MF),比ProtVecGenPlus + MLDA降低了10.37%(BP)和26.48%(MF)。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号