首页> 外文期刊>Database >Integrating text mining into the MGI biocuration workflow
【24h】

Integrating text mining into the MGI biocuration workflow

机译:将文本挖掘集成到MGI生物固化工作流程中

获取原文
           

摘要

A major challenge for functional and comparative genomics resource development is the extraction of data from the biomedical literature. Although text mining for biological data is an active research field, few applications have been integrated into production literature curation systems such as those of the model organism databases (MODs). Not only are most available biological natural language (bioNLP) and information retrieval and extraction solutions difficult to adapt to existing MOD curation workflows, but many also have high error rates or are unable to process documents available in those formats preferred by scientific journals. In September 2008, Mouse Genome Informatics (MGI) at The Jackson Laboratory initiated a search for dictionary-based text mining tools that we could integrate into our biocuration workflow. MGI has rigorous document triage and annotation procedures designed to identify appropriate articles about mouse genetics and genome biology. We currently screen ~1000 journal articles a month for Gene Ontology terms, gene mapping, gene expression, phenotype data and other key biological information. Although we do not foresee that curation tasks will ever be fully automated, we are eager to implement named entity recognition (NER) tools for gene tagging that can help streamline our curation workflow and simplify gene indexing tasks within the MGI system. Gene indexing is an MGI-specific curation function that involves identifying which mouse genes are being studied in an article, then associating the appropriate gene symbols with the article reference number in the MGI database. Here, we discuss our search process, performance metrics and success criteria, and how we identified a short list of potential text mining tools for further evaluation. We provide an overview of our pilot projects with NCBO's Open Biomedical Annotator and Fraunhofer SCAI's ProMiner. In doing so, we prove the potential for the further incorporation of semi-automated processes into the curation of the biomedical literature.
机译:功能和比较基因组学资源开发的主要挑战是从生物医学文献中提取数据。尽管生物数据的文本挖掘是一个活跃的研究领域,但很少有应用程序集成到生产文献管理系统中,例如模型生物数据库(MODs)的系统。大多数可用的生物自然语言(bioNLP)以及信息检索和提取解决方案不仅难以适应现有的MOD策画工作流程,而且许多错误率很高,或者无法处理科学期刊首选的那些格式的文档。 2008年9月,杰克逊实验室(Jackson Laboratory)的老鼠基因组信息学(MGI)发起了对基于字典的文本挖掘工具的搜索,我们可以将其集成到生物固化工作流程中。 MGI具有严格的文档分类和注释程序,旨在识别有关小鼠遗传学和基因组生物学的文章。目前,我们每月筛选约1000篇期刊文章,以了解基因本体论术语,基因作图,基因表达,表型数据和其他重要的生物学信息。尽管我们预计管理任务不会完全自动化,但我们仍渴望为基因标记实施命名实体识别(NER)工具,以帮助简化管理工作流程并简化MGI系统中的基因索引编制任务。基因索引是一种MGI特定的管理功能,涉及识别文章中正在研究哪些小鼠基因,然后将适当的基因符号与MGI数据库中的文章参考号相关联。在这里,我们讨论了搜索过程,性能指标和成功标准,以及如何确定潜在文本挖掘工具的简短列表以进行进一步评估。我们使用NCBO的开放式生物医学注释器和Fraunhofer SCAI的ProMiner概述了我们的试点项目。通过这样做,我们证明了将半自动化过程进一步纳入生物医学文献管理的潜力。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号