Representing Multiword Chemical Terms through Phrase-Level Preprocessing and Word Embedding

Liyuan Huang; Chen Ling

首页> 外文期刊>ACS Omega >Representing Multiword Chemical Terms through Phrase-Level Preprocessing and Word Embedding

【24h】

Representing Multiword Chemical Terms through Phrase-Level Preprocessing and Word Embedding

机译：通过短语级预处理和词嵌入来表示多词化学术语

获取原文

掌桥外文数据库（机构版） >>

开具论文收录证明 >>

文献代查 >>

页面导航

摘要
著录项
相似文献
相关主题

摘要

In recent years, data-driven methods and artificial intelligence have been widely used in chemoinformatic and material informatics domains, for which the success is critically determined by the availability of training data with good quality and large quantity. A potential approach to break this bottleneck is by leveraging the chemical literature such as papers and patents as alternative data resources to high throughput experiments and simulation. Compared to other domains where natural language processing techniques have established successes, the chemical literature contains a large portion of phrases of multiple words that create additional challenges for accurate identification and representation. Here, we introduce a chemistry domain suitable approach to identify multiword chemical terms and train word representations at the phrase level. Through a series of special-designed experiments, we demonstrate that our multiword identifying and representing method effectively and accurately identifies multiword chemical terms from 119, 166 chemical patents and is more robust and precise to preserve the semantic meaning of chemical phrases compared to the conventional approach, which represents constituent single words first and combine them afterward. Because the accurate representation of chemical terms is the first and essential step to provide learning features for downstream natural language processing tasks, our results pave the road to utilize the large volume of chemical literature in future data-driven studies.

机译：近年来，数据驱动的方法和人工智能已广泛应用于化学信息学和材料信息学领域，其成功与否取决于高质量和大量训练数据的可用性。解决这一瓶颈的一种潜在方法是利用化学文献（如论文和专利）作为高通量实验和模拟的替代数据资源。与自然语言处理技术已取得成功的其他领域相比，化学文献包含大部分由多个单词组成的短语，这为准确识别和表示语言带来了其他挑战。在这里，我们介绍一种适合化学领域的方法来识别多词化学术语并在短语级别训练词表示形式。通过一系列特殊设计的实验，我们证明了我们的多词识别和表示方法可有效，准确地识别119、166种化学专利中的多词化学术语，并且与传统方法相比，它在保留化学短语的语义上更强大，更精确。，它首先代表组成的单个单词，然后再组合它们。由于化学术语的准确表示是为下游自然语言处理任务提供学习功能的第一步，也是必不可少的步骤，因此我们的结果为在未来的数据驱动研究中利用大量化学文献铺平了道路。

著录项

来源
《ACS Omega》 |2019年第20期|共10页
作者
Liyuan Huang; Chen Ling;
展开▼
作者单位

展开▼
收录信息
原文格式 PDF
正文语种
中图分类化学;
关键词

相似文献

外文文献
中文文献
专利

1. A Method for Automatic Extraction of Multiword Units Representing Business Aspects From User Reviews [J] . Olga Vechtomova Journal of the American Society for Information Science and Technology . 2014,第7期

机译：一种自动从用户评论中提取代表业务方面的多字单元的方法
2. Semi-automatic extraction of multiword terms from domain-specific corpora [J] . Vesna Pajic, Stasa Vujicic Stankovic, Ranka Stankovic, The Electronic Library . 2018,第3期

机译：从特定领域语料库中半自动提取多词术语
3. Approaching secondary term formation through the analysis of multiword units [J] . Lara Sanz Vicente Terminology . 2012,第1期

机译：通过多词单元分析接近次级术语的形成
4. Representing Multiword Term Variation in a Terminological Knowledge Base: a Corpus-Based Study [C] . Pilar Leon-Arauz, Melania Cabezas-Garcia, Arianne Reimerink International Conference on Language Resources and Evaluation . 2020

机译：代表术语学知识库中的多字词变化：基于语料库的研究
5. The Effects of Using Textual Enhancement on Processing and Learning Multiword Expressions [D] . Alshaikhi, Adel Zain. 2018

机译：使用文本增强对处理和学习多个表达的影响
6. Representing Multiword Chemical Terms through Phrase-Level Preprocessingand Word Embedding [O] . Liyuan Huang, Chen Ling 2019

机译：通过短语级预处理表示多词化学术语和词嵌入
7. Representing Multiword Chemical Terms through Phrase-Level Preprocessing and Word Embedding [O] . -1

机译：通过短语级预处理和单词嵌入代表多字化学术语

Representing Multiword Chemical Terms through Phrase-Level Preprocessing and Word Embedding

摘要

著录项

相似文献

相关主题

期刊订阅