...
首页> 外文期刊>Journal of Bioinformatics and Computational Biology >Generation of a large gene/protein lexicon by morphological pattern analysis
【24h】

Generation of a large gene/protein lexicon by morphological pattern analysis

机译:通过形态学模式分析生成大基因/蛋白质词典

获取原文
获取原文并翻译 | 示例
           

摘要

The identification of gene/protein names in natural language text is an important problem in named entity recognition. In previous work we have processed MEDLINE? documents to obtain a collection of over two million names of which we estimate that perhaps two thirds are valid gene/protein names. Our problem has been how to purify this set to obtain a high quality subset of gene/protein names. Here we describe an approach which is based on the generation of certain classes of names that are characterized by common morphological features. Within each class inductive logic programming (ILP) is applied to learn the characteristics of those names that are gene/protein names. The criteria learned in this manner are then applied to our large set of names. We generated 193 classes of names and ILP led to criteria defining a select subset of 1,240,462 names. A simple false positive filter was applied to remove 8% of this set leaving 1,145,913 names. Examination of a random sample from this gene/protein name lexicon suggests it is composed of 82% (±3%) complete and accurate gene/protein names, 12% names related to genes/proteins (too generic, a valid name plus additional text, part of a valid name, etc.), and 6% names unrelated to genes/proteins. The lexicon is freely available at ftp://ftp.ncbi.nlm.nih.gov/pub/tanabe/Gene.Lexicon.
机译:在自然语言文本中识别基因/蛋白质名称是命名实体识别中的重要问题。在以前的工作中,我们已经处理过MEDLINE?这些文件收集了超过200万个名称,我们估计其中三分之二是有效的基因/蛋白质名称。我们的问题是如何纯化该基因组以获得高质量的基因/蛋白质名称子集。在这里,我们描述了一种方法,该方法基于生成具有共同形态特征的某些类别的名称。在每个类别中,应用归纳逻辑编程(ILP)来学习那些名称的特征,即基因/蛋白质名称。然后将以此方式学习的标准应用于我们的大量名称。我们生成了193类名称,并且ILP导致标准定义了1,240,462个名称的选定子集。应用了一个简单的误报过滤器,以删除此集合中的8%,留下1,145,913个名称。从该基因/蛋白质名称词典中对随机样本进行检查后发现,该样本由82%(±3%)完整且准确的基因/蛋白质名称,12%与基因/蛋白质相关的名称组成(太通用,有效名称以及其他文字) ,有效名称的一部分等),以及与基因/蛋白质无关的6%名称。该词典可从ftp://ftp.ncbi.nlm.nih.gov/pub/tanabe/Gene.Lexicon免费获得。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号