首页> 外文期刊>IEEE/ACM transactions on computational biology and bioinformatics >Chemical Name Extraction Based on Automatic Training Data Generation and Rich Feature Set
【24h】

Chemical Name Extraction Based on Automatic Training Data Generation and Rich Feature Set

机译:基于自动训练数据生成和丰富特征集的化学名称提取

获取原文
获取原文并翻译 | 示例
           

摘要

The automation of extracting chemical names from text has significant value to biomedical and life science research. A major barrier in this task is the difficulty of getting a sizable and good quality data to train a reliable entity extraction model. Another difficulty is the selection of informative features of chemical names, since comprehensive domain knowledge on chemistry nomenclature is required. Leveraging random text generation techniques, we explore the idea of automatically creating training sets for the task of chemical name extraction. Assuming the availability of an incomplete list of chemical names, called a dictionary, we are able to generate well-controlled, random, yet realistic chemical-like training documents. We statistically analyze the construction of chemical names based on the incomplete dictionary, and propose a series of new features, without relying on any domain knowledge. Compared to state-of-the-art models learned from manually labeled data and domain knowledge, our solution shows better or comparable results in annotating real-world data with less human effort. Moreover, we report an interesting observation about the language for chemical names. That is, both the structural and semantic components of chemical names follow a Zipfian distribution, which resembles many natural languages.
机译:从文本中自动提取化学名称对生物医学和生命科学研究具有重要价值。这项任务的主要障碍是难以获得大量高质量的数据来训练可靠的实体提取模型。另一个困难是选择化学名称的信息性,因为需要有关化学命名的全面领域知识。利用随机文本生成技术,我们探索了为化学名称提取任务自动创建训练集的想法。假设提供了不完整的化学名称清单(称为字典),我们就能生成受控良好,随机但现实的类似于化学物质的培训文件。我们基于不完整的词典对化学名称的构造进行统计分析,并提出了一系列新功能,而无需依赖任何领域知识。与从手动标记的数据和领域知识中学到的最新模型相比,我们的解决方案在用更少的人工注释实际数据时显示出更好或可比的结果。此外,我们报告了有关化学名称语言的有趣观察。也就是说,化学名称的结构和语义成分都遵循Zipfian分布,该分布类似于许多自然语言。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号