Comparing Neural- and N-Gram-Based Language Models for Word Segmentation

Doval Yerai; Gomez-Rodriguez Carlos

首页> 外文期刊>Journal of the American Society for Information Science and Technology >Comparing Neural- and N-Gram-Based Language Models for Word Segmentation

【24h】

Comparing Neural- and N-Gram-Based Language Models for Word Segmentation

机译：比较基于神经和基于N-Gram的语言模型进行分词

获取原文

获取原文并翻译 | 示例

掌桥外文数据库（机构版） >>

开具论文收录证明 >>

文献代查 >>

页面导航

摘要
著录项
相似文献
相关主题

摘要

Word segmentation is the task of inserting or deleting word boundary characters in order to separate character sequences that correspond to words in some language. In this article we propose an approach based on a beam search algorithm and a language model working at the byte/character level, the latter component implemented either as an n-gram model or a recurrent neural network. The resulting system analyzes the text input with no word boundaries one token at a time, which can be a character or a byte, and uses the information gathered by the language model to determine if a boundary must be placed in the current position or not. Our aim is to use this system in a preprocessing step for a microtext normalization system. This means that it needs to effectively cope with the data sparsity present on this kind of texts. We also strove to surpass the performance of two readily available word segmentation systems: The well-known and accessible Word Breaker by Microsoft, and the Python module WordSegment by Grant Jenks. The results show that we have met our objectives, and we hope to continue to improve both the precision and the efficiency of our system in the future.

机译：单词分段是插入或删除单词边界字符的任务，以便分离与某些语言中的单词相对应的字符序列。在本文中，我们提出了一种基于波束搜索算法和在字节/字符级别上工作的语言模型的方法，后一种组件以n元语法模型或递归神经网络的形式实现。生成的系统一次分析没有单词边界的文本输入，每次输入一个标记（可以是字符或字节），并使用语言模型收集的信息来确定是否必须在当前位置放置边界。我们的目标是在微文本规范化系统的预处理步骤中使用此系统。这意味着它需要有效地应对此类文本中存在的数据稀疏性。我们还力争超越两个现成的分词系统的性能：Microsoft众所周知且可访问的Word Breaker，以及Grant Jenks的Python模块WordSegment。结果表明我们已经达到了目标，我们希望将来继续提高系统的精度和效率。

著录项

来源
《Journal of the American Society for Information Science and Technology》 |2019年第2期|187-197|共11页
作者
Doval Yerai; Gomez-Rodriguez Carlos;
展开▼
作者单位

Univ Vigo, Grp COLE, Dept Informat ES Enxenaria Informat, Campus Lagoas, Orense 32004, Spain;

Univ A Coruna, Grp LYS, Dept Comp, Fac Informat, Campus Elvina, La Coruna 15071, Spain;

展开▼
收录信息
原文格式 PDF
正文语种 eng
中图分类
关键词

相似文献

外文文献
中文文献
专利

1. The Utility of Cognitive Plausibility in Language Acquisition Modeling: Evidence From Word Segmentation [J] . Lawrence Phillips, Lisa Pearl Cognitive science . 2015,第8期

机译：认知合理性在语言习得建模中的效用：来自分词的证据
2. A Chinese word segmentation based on language situation in processing ambiguous words [J] . Zhang MY, Lu ZD, Zou CY Information Sciences: An International Journal . 2004,第3a4期

机译：基于语言环境的歧义词中文分词
3. Morphological Segmentation to Improve Crosslingual Word Embeddings for Low Resource Languages [J] . Chimalamarri Santwana, Sitaram Dinkar, Jain Ashritha ACM transactions on Asian language information processing . 2020,第5期

机译：改善低资源语言的跨性词嵌入的形态分割
4. Comparing Fifty Natural Languages and Twelve Genetic Languages Using Word Embedding Language Divergence (WELD) as a Quantitative Measure of Language Distance [C] . Ehsaneddin Asgari, Mohammad R.K. Mofrad Workshop on multilingual and cross-lingual methods in NLP 2016 . 2016

机译：使用词嵌入语言差异（WELD）作为语言距离的定量度量，比较五十种自然语言和十二种遗传语言
5. Word segmentation, word recognition, and word learning: A computational model of first language acquisition. [D] . Daland, Robert. 2009

机译：分词，单词识别和单词学习：母语习得的计算模型。
6. Comparing neural‐ and N‐gram‐based language models for word segmentation [O] . Yerai Doval, Carlos Gómez‐Rodríguez -1

机译：比较基于神经和基于Ngram的语言模型进行分词
7. The Order of Words in the Ancient Languages compared with that of the Modern Languages [O] . Henri Weil 1978

机译：古代语言的单词顺序与现代语言相比

Comparing Neural- and N-Gram-Based Language Models for Word Segmentation

摘要

著录项

相似文献

相关主题

期刊订阅