首页> 外文会议>International Conference on Computational Linguistics >Unsupervised Deep Language and Dialect Identification for Short Texts
【24h】

Unsupervised Deep Language and Dialect Identification for Short Texts

机译:无监督的深语和简短文本方言识别

获取原文

摘要

Automatic Language Identification (LI) or Dialect Identification (DI) of short texts of closely related languages or dialects, is one of the primary steps in many natural language processing pipelines. Language identification is considered a solved task in many cases; however, in the case of very closely related languages, or in an unsupervised scenario (where the languages are not known in advance), performance is still poor. In this paper, we propose the Unsupervised Deep Language and Dialect Identification (UDLDI) method, which can simultaneously learn sentence embeddings and cluster assignments from short texts. The UDLDI model understands the sentence constructions of languages by applying attention to character relations which helps to optimize the clustering of languages. We have performed our experiments on three short-text datasets for different language families, each consisting of closely related languages or dialects, with very minimal training sets. Our experimental evaluations on these datasets have shown significant improvement over state-of-the-art unsupervised methods and our model has outperformed state-of-the-art LI and DI systems in supervised settings.
机译:自动语言识别(Li)或方言识别(DI)密切相关的语言或方言的简短文本,是许多自然语言处理管道的主要步骤之一。语言识别在许多情况下被认为是一个解决的任务;然而,在非常密切地相关的语言的情况下,或者在一个无人监督的场景中(在预先知道语言的情况下),性能仍然很差。在本文中,我们提出了无监督的深语和方言识别(UDLDI)方法,它可以同时从短篇文本中学习句子嵌入和群集分配。 UDLDI模型通过对角色关系的关注来了解语言的句子结构,有助于优化语言的聚类。我们对不同语言系列的三个短文本数据集进行了实验,每个都包括密切相关的语言或方言,具有非常少的培训集。我们对这些数据集的实验评估显示出对最先进的无监督的方法,我们的模型在监督设置中具有优于最先进的LI和DI系统。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号