首页> 外文期刊>Database >A robust data-driven approach for gene ontology annotation
【24h】

A robust data-driven approach for gene ontology annotation

机译:强大的数据驱动方法进行基因本体标注

获取原文
           

摘要

Gene ontology (GO) and GO annotation are important resources for biological information management and knowledge discovery, but the speed of manual annotation became a major bottleneck of database curation. BioCreative IV GO annotation task aims to evaluate the performance of system that automatically assigns GO terms to genes based on the narrative sentences in biomedical literature. This article presents our work in this task as well as the experimental results after the competition. For the evidence sentence extraction subtask, we built a binary classifier to identify evidence sentences using reference distance estimator (RDE), a recently proposed semi-supervised learning method that learns new features from around 10 million unlabeled sentences, achieving an F1 of 19.3% in exact match and 32.5% in relaxed match. In the post-submission experiment, we obtained 22.1% and 35.7% F1 performance by incorporating bigram features in RDE learning. In both development and test sets, RDE-based method achieved over 20% relative improvement on F1 and AUC performance against classical supervised learning methods, e.g. support vector machine and logistic regression. For the GO term prediction subtask, we developed an information retrieval-based method to retrieve the GO term most relevant to each evidence sentence using a ranking function that combined cosine similarity and the frequency of GO terms in documents, and a filtering method based on high-level GO classes. The best performance of our submitted runs was 7.8% F1 and 22.2% hierarchy F1. We found that the incorporation of frequency information and hierarchy filtering substantially improved the performance. In the post-submission evaluation, we obtained a 10.6% F1 using a simpler setting. Overall, the experimental analysis showed our approaches were robust in both the two tasks.
机译:基因本体(GO)和GO注释是生物信息管理和知识发现的重要资源,但是手动注释的速度已成为数据库管理的主要瓶颈。 BioCreative IV GO注释任务旨在评估根据生物医学文献中的叙述语句自动将GO术语分配给基因的系统的性能。本文介绍了我们在这项任务中的工作以及比赛后的实验结果。对于证据句子提取子任务,我们建立了一个二元分类器,使用参考距离估计器(RDE)来识别证据句子。RDE是最近提出的一种半监督学习方法,可以从大约1000万个未标记的句子中学习新功能,在F1中的识别率达到19.3%完全配对和轻松配对的32.5%。在提交后的实验中,我们通过在RDE学习中纳入双字母组特征,获得了12.1%和35.7%的F1成绩。在开发和测试集中,基于RDE的方法相对于经典的监督学习方法,例如F1和AUC,在F1和AUC性能方面的相对改进均超过20%。支持向量机和逻辑回归。对于GO项预测子任务,我们开发了一种基于信息检索的方法,该方法使用结合余弦相似度和文档中GO项的频率的排序函数来检索与每个证据语句最相关的GO项,以及一种基于高级别的GO类。我们提交的运行的最佳性能是7.8%F1和22.2%层次F1。我们发现,频率信息和层次过滤的结合大大提高了性能。在提交后评估中,我们使用更简单的设置获得了10.6%的F1。总体而言,实验分析表明,我们的方法在两项任务中均很可靠。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号