...
首页> 外文期刊>Journal of the American Society for Information Science and Technology >Comparing Neural- and N-Gram-Based Language Models for Word Segmentation
【24h】

Comparing Neural- and N-Gram-Based Language Models for Word Segmentation

机译:比较基于神经和基于N-Gram的语言模型进行分词

获取原文
获取原文并翻译 | 示例
           

摘要

Word segmentation is the task of inserting or deleting word boundary characters in order to separate character sequences that correspond to words in some language. In this article we propose an approach based on a beam search algorithm and a language model working at the byte/character level, the latter component implemented either as an n-gram model or a recurrent neural network. The resulting system analyzes the text input with no word boundaries one token at a time, which can be a character or a byte, and uses the information gathered by the language model to determine if a boundary must be placed in the current position or not. Our aim is to use this system in a preprocessing step for a microtext normalization system. This means that it needs to effectively cope with the data sparsity present on this kind of texts. We also strove to surpass the performance of two readily available word segmentation systems: The well-known and accessible Word Breaker by Microsoft, and the Python module WordSegment by Grant Jenks. The results show that we have met our objectives, and we hope to continue to improve both the precision and the efficiency of our system in the future.
机译:单词分段是插入或删除单词边界字符的任务,以便分离与某些语言中的单词相对应的字符序列。在本文中,我们提出了一种基于波束搜索算法和在字节/字符级别上工作的语言模型的方法,后一种组件以n元语法模型或递归神经网络的形式实现。生成的系统一次分析没有单词边界的文本输入,每次输入一个标记(可以是字符或字节),并使用语言模型收集的信息来确定是否必须在当前位置放置边界。我们的目标是在微文本规范化系统的预处理步骤中使用此系统。这意味着它需要有效地应对此类文本中存在的数据稀疏性。我们还力争超越两个现成的分词系统的性能:Microsoft众所周知且可访问的Word Breaker,以及Grant Jenks的Python模块WordSegment。结果表明我们已经达到了目标,我们希望将来继续提高系统的精度和效率。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号