首页> 美国卫生研究院文献>Database: The Journal of Biological Databases and Curation >Automatic query generation using word embeddings for retrieving passages describing experimental methods
【2h】

Automatic query generation using word embeddings for retrieving passages describing experimental methods

机译:使用单词嵌入自动查询生成以检索描述实验方法的段落

代理获取
本网站仅为用户提供外文OA文献查询和代理获取服务,本网站没有原文。下单后我们将采用程序或人工为您竭诚获取高质量的原文,但由于OA文献来源多样且变更频繁,仍可能出现获取不到、文献不完整或与标题不符等情况,如果获取不到我们将提供退款服务。请知悉。

摘要

Information regarding the physical interactions among proteins is crucial, since protein–protein interactions (PPIs) are central for many biological processes. The experimental techniques used to verify PPIs are vital for characterizing and assessing the reliability of the identified PPIs. A lot of information about PPIs and the experimental methods are only available in the text of the scientific publications that report them. In this study, we approach the problem of identifying passages with experimental methods for physical interactions between proteins as an information retrieval search task. The baseline system is based on query matching, where the queries are generated by utilizing the names (including synonyms) of the experimental methods in the Proteomics Standard Initiative–Molecular Interactions (PSI-MI) ontology. We propose two methods, where the baseline queries are expanded by including additional relevant terms. The first method is a supervised approach, where the most salient terms for each experimental method are obtained by using the term frequency–relevance frequency (tf.rf) metric over 13 articles from our manually annotated data set of 30 full text articles, which is made publicly available. On the other hand, the second method is an unsupervised approach, where the queries for each experimental method are expanded by using the word embeddings of the names of the experimental methods in the PSI-MI ontology. The word embeddings are obtained by utilizing a large unlabeled full text corpus. The proposed methods are evaluated on the test set consisting of 17 articles. Both methods obtain higher recall scores compared with the baseline, with a loss in precision. Besides higher recall, the word embeddings based approach achieves higher F-measure than the baseline and the tf.rf based methods. We also show that incorporating gene name and interaction keyword identification leads to improved precision and F-measure scores for all three evaluated methods. The tf.rf based approach was developed as part of our participation in the Collaborative Biocurator Assistant Task of the BioCreative V challenge assessment, whereas the word embeddings based approach is a novel contribution of this article.Database URL:
机译:有关蛋白质间物理相互作用的信息至关重要,因为蛋白质间相互作用(PPI)对于许多生物学过程而言至关重要。用于验证PPI的实验技术对于表征和评估已识别PPI的可靠性至关重要。关于PPI和实验方法的许多信息只能在报告它们的科学出版物的文本中找到。在这项研究中,我们以蛋白质之间的物理相互作用的实验方法作为信息检索搜索任务来解决识别段落的问题。基线系统基于查询匹配,其中查询是通过使用蛋白质组学标准计划–分子相互作用(PSI-MI)本体中实验方法的名称(包括同义词)生成的。我们提出了两种方法,其中通过包括其他相关术语来扩展基线查询。第一种方法是一种有监督的方法,其中每种实验方法的最显着术语是通过使用我们的30篇全文文章的人工注释数据集中的13篇文章中的频率-相关频率(tf.rf)度量来获得的,即公开可用。另一方面,第二种方法是无监督方法,其中通过使用PSI-MI本体中实验方法名称的单词嵌入来扩展对每种实验方法的查询。单词嵌入是通过使用大型的未标记全文语料库而获得的。建议的方法在由17篇文章组成的测试集上进行评估。与基线相比,这两种方法均获得较高的召回得分,但准确性下降。除了提高召回率外,基于单词嵌入的方法还比基于基线和基于tf.rf的方法获得了更高的F-measure。我们还表明,将基因名称和相互作用关键字标识合并在一起,可以提高所有三种评估方法的精度和F量度得分。基于tf.rf的方法是我们参与BioCreative V挑战评估的协作性生物管理员助手任务的一部分而开发的,而基于单词嵌入的方法是本文的新颖贡献。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
代理获取

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号