Unsupervised Deep Language and Dialect Identification for Short Texts

机译：无监督的深语和简短文本方言识别

获取原文

页面导航

摘要
著录项
相似文献
相关主题

摘要

Automatic Language Identification (LI) or Dialect Identification (DI) of short texts of closely related languages or dialects, is one of the primary steps in many natural language processing pipelines. Language identification is considered a solved task in many cases; however, in the case of very closely related languages, or in an unsupervised scenario (where the languages are not known in advance), performance is still poor. In this paper, we propose the Unsupervised Deep Language and Dialect Identification (UDLDI) method, which can simultaneously learn sentence embeddings and cluster assignments from short texts. The UDLDI model understands the sentence constructions of languages by applying attention to character relations which helps to optimize the clustering of languages. We have performed our experiments on three short-text datasets for different language families, each consisting of closely related languages or dialects, with very minimal training sets. Our experimental evaluations on these datasets have shown significant improvement over state-of-the-art unsupervised methods and our model has outperformed state-of-the-art LI and DI systems in supervised settings.

机译：自动语言识别（Li）或方言识别（DI）密切相关的语言或方言的简短文本，是许多自然语言处理管道的主要步骤之一。语言识别在许多情况下被认为是一个解决的任务;然而，在非常密切地相关的语言的情况下，或者在一个无人监督的场景中（在预先知道语言的情况下），性能仍然很差。在本文中，我们提出了无监督的深语和方言识别（UDLDI）方法，它可以同时从短篇文本中学习句子嵌入和群集分配。 UDLDI模型通过对角色关系的关注来了解语言的句子结构，有助于优化语言的聚类。我们对不同语言系列的三个短文本数据集进行了实验，每个都包括密切相关的语言或方言，具有非常少的培训集。我们对这些数据集的实验评估显示出对最先进的无监督的方法，我们的模型在监督设置中具有优于最先进的LI和DI系统。

著录项

来源
《International Conference on Computational Linguistics》|2020年|1606-1617|共12页
会议地点
作者
Koustava Goswami; Rajdeep Sarkar; Bharathi Raja Chakravarthi; Theodorus Fransen; John P. McCrae;
展开▼
作者单位

展开▼
会议组织
原文格式 PDF
正文语种
中图分类
关键词

相似文献

外文文献
中文文献
专利

1. Language model adaptation for language and dialect identification of text [J] . Jauhiainen T., Linden K., Jauhiainen H. Natural language engineering . 2019,第5期

机译：语言模型适应文本的语言和方言识别
2. Language independent unsupervised learning of short message service dialect [J] . Sreangsu Acharyya, Sumit Negi, L. Venkata Subramaniam, International Journal on Document Analysis and Recognition . 2009,第3期

机译：短信服务方言的语言独立无监督学习
3. Identification of regional dialects of Telugu language using text independent speech processing models [J] . S. Shivaprasad, M. Sadanandam International journal of speech technology . 2020,第2期

机译：使用文本独立语音处理模型识别Teludu语言的区域方言
4. Using Artificial Neural Networks in Dialect Identification in Less-resourced Languages - The Case of Kurdish Dialects Identification [C] . Hossein Hassani, Oussama H. Hamid International Joint Conference on Computational Intelligence . 2017

机译：在较少资源语言中使用人工神经网络在方言识别中 - 基于库尔德方言识别的情况
5. Unsupervised Speaker Identification of Quotes in Literary Text利用統計を見る [D] . Tohda Satoshi 2019

机译：文学文本中引语的无监督说话者识别查看用法统计
6. Modeling language and cognition with deep unsupervised learning: a tutorial overview [O] . Marco Zorzi, Alberto Testolin, Ivilin P. Stoianov 2013

机译：通过深度无监督学习为语言和认知建模：教程概述
7. Using Artificial Neural Networks in Dialect Identification in Less-resourced Languages - The Case of Kurdish Dialects Identification [O] . Hossein Hassani, Oussama H. Hamid 2017

机译：在较少资源语言中使用人工神经网络在方言识别中 - 基于库尔德方言识别的情况

Unsupervised Deep Language and Dialect Identification for Short Texts

摘要

著录项

相似文献

相关主题

期刊订阅