【24h】

Improving Lemmatization of Non-Standard Languages with Joint Learning

机译:通过联合学习改善非标准语言的lemmatization

获取原文

摘要

Lemmatization of standard languages is concerned with (ⅰ) abstracting over morphological differences and (ⅱ) resolving token-lemma ambiguities of inflected words in order to map them to a dictionary headword. In the present paper we aim to improve lemmatization performance on a set of non-standard historical languages in which the difficulty is increased by an additional aspect (ⅲ): spelling variation due to lacking orthographic standards. We approach lemmatization as a string-transduction task with an encoder-decoder architecture which we enrich with sentence context information using a hierarchical sentence encoder. We show significant improvements over the state-of-thc-art when training the sentence encoder jointly for lemmatization and language modeling. Crucially, our architecture does not require POS or morphological annotations, which are not always available for historical corpora. Additionally, we also test the proposed model on a set of typolog-ically diverse standard languages showing results on par or better than a model without enhanced sentence representations and previous state-of-the-art systems. Finally, to encourage future work on processing of non-standard varieties, we release the dataset of non-standard languages underlying the present study, based on openly accessible sources.
机译:标准语言的词形还原涉及(ⅰ)提炼过的形态差异和(ⅱ)的解决屈折话令牌引理模糊度,以便将它们映射到一个词典词条。在本论文中,我们的目标是提高在一组的非标准历史语言中的难度增加了一个另外的方面(ⅲ)词形还原性能:拼写变化,由于缺乏正投影标准。我们接近词形还原与我们使用分层编码一句句子的上下文信息丰富的编码器,解码器架构的字符串转任务。我们发现在国家的THC-艺术显著改进联合训练一句编码器词形还原和语言模型时。最重要的是,我们的架构不需要POS或形态的注释,这并不总是可用的历史语料库。此外,我们还测试了一套显示了不加重刑罚的表示和以前的国家的最先进的系统看齐或比模型更好的结果typolog-ically不同标准的语言提出的模型。最后,为了鼓励非标加工品种今后的工作中,我们释放非标准语言基础的本研究的数据集的基础上,公开查阅的来源。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号