首页> 外文期刊>ACM transactions on Asian language information processing >Unsupervised Joint PoS Tagging and Stemming for Agglutinative Languages
【24h】

Unsupervised Joint PoS Tagging and Stemming for Agglutinative Languages

机译:胶合语言的无监督联合PoS标记和词干

获取原文
获取原文并翻译 | 示例
           

摘要

The number of possible word forms is theoretically infinite in agglutinative languages. This brings up the out-of-vocabulary (OOV) issue for part-of-speech (PoS) tagging in agglutinative languages. Since inflectional morphology does not change the PoS tag of a word, we propose to learn stems along with PoS tags simultaneously. Therefore, we aim to overcome the sparsity problem by reducing word forms into their stems. We adopt a Bayesian model that is fully unsupervised. We build a Hidden Markov Model for PoS tagging where the stems are emitted through hidden states. Several versions of the model are introduced in order to observe the effects of different dependencies throughout the corpus, such as the dependency between stems and PoS tags or between PoS tags and affixes. Additionally, we use neural word embeddings to estimate the semantic similarity between the word form and stem. We use the semantic similarity as prior information to discover the actual stem of a word since inflection does not change the meaning of a word. We compare our models with other unsupervised stemming and PoS tagging models on Turkish, Hungarian, Finnish, Basque, and English. The results show that a joint model for PoS tagging and stemming improves on an independent PoS tagger and stemmer in agglutinative languages.
机译:理论上,在凝集性语言中,可能的单词形式是无限的。这就提出了针对凝集语言的词性(PoS)标记的语音不足(OOV)问题。由于屈折形态不会改变单词的PoS标签,因此我们建议同时学习词干和PoS标签。因此,我们旨在通过将单词形式简化为词干来克服稀疏性问题。我们采用完全不受监督的贝叶斯模型。我们为PoS标签建立了隐马尔可夫模型,其中茎通过隐藏状态发出。引入了几种版本的模型,以观察整个语料库中不同依赖项的影响,例如词干与PoS标签之间或PoS标签与词缀之间的依赖关系。此外,我们使用神经词嵌入来估计词形和词干之间的语义相似性。我们使用语义相似性作为先验信息来发现单词的实际词干,因为变形不会改变单词的含义。我们将我们的模型与土耳其,匈牙利,芬兰,巴斯克和英语上的其他无监督词干和PoS标记模型进行了比较。结果表明,针对PoS标记和词干的联合模型在凝集语言中改进了独立的PoS标记器和词干。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号