...
首页> 外文期刊>ACS Omega >Representing Multiword Chemical Terms through Phrase-Level Preprocessing and Word Embedding
【24h】

Representing Multiword Chemical Terms through Phrase-Level Preprocessing and Word Embedding

机译:通过短语级预处理和词嵌入来表示多词化学术语

获取原文
           

摘要

In recent years, data-driven methods and artificial intelligence have been widely used in chemoinformatic and material informatics domains, for which the success is critically determined by the availability of training data with good quality and large quantity. A potential approach to break this bottleneck is by leveraging the chemical literature such as papers and patents as alternative data resources to high throughput experiments and simulation. Compared to other domains where natural language processing techniques have established successes, the chemical literature contains a large portion of phrases of multiple words that create additional challenges for accurate identification and representation. Here, we introduce a chemistry domain suitable approach to identify multiword chemical terms and train word representations at the phrase level. Through a series of special-designed experiments, we demonstrate that our multiword identifying and representing method effectively and accurately identifies multiword chemical terms from 119, 166 chemical patents and is more robust and precise to preserve the semantic meaning of chemical phrases compared to the conventional approach, which represents constituent single words first and combine them afterward. Because the accurate representation of chemical terms is the first and essential step to provide learning features for downstream natural language processing tasks, our results pave the road to utilize the large volume of chemical literature in future data-driven studies.
机译:近年来,数据驱动的方法和人工智能已广泛应用于化学信息学和材料信息学领域,其成功与否取决于高质量和大量训练数据的可用性。解决这一瓶颈的一种潜在方法是利用化学文献(如论文和专利)作为高通量实验和模拟的替代数据资源。与自然语言处理技术已取得成功的其他领域相比,化学文献包含大部分由多个单词组成的短语,这为准确识别和表示语言带来了其他挑战。在这里,我们介绍一种适合化学领域的方法来识别多词化学术语并在短语级别训练词表示形式。通过一系列特殊设计的实验,我们证明了我们的多词识别和表示方法可有效,准确地识别119、166种化学专利中的多词化学术语,并且与传统方法相比,它在保留化学短语的语义上更强大,更精确。 ,它首先代表组成的单个单词,然后再组合它们。由于化学术语的准确表示是为下游自然语言处理任务提供学习功能的第一步,也是必不可少的步骤,因此我们的结果为在未来的数据驱动研究中利用大量化学文献铺平了道路。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号