首页> 外文期刊>Computer speech and language >Multi-level embeddings for processing Arabic social media contents
【24h】

Multi-level embeddings for processing Arabic social media contents

机译:用于处理阿拉伯社交媒体内容的多级嵌入物

获取原文
获取原文并翻译 | 示例
       

摘要

Embeddings are very popular representations that allow computing semantic and syntactic similarities between linguistic units from text co-occurrence matrix. Units can vary from character n-grams to words, including more coarse-grained units such as sentences and documents. Recently, multi-level embeddings combining representations from different units have been proposed as an alternative to single-level embeddings to account for the internal structure of words (i.e., morphology) and help systems to generalise well over out of vocabulary words. These representations, either pre-trained or learned, have shown to be quite effective, outperforming word-level baselines in several NLP tasks such as machine translation, part of speech tagging and named entity recognition. Our aim here is to contribute to this line of research proposing for the first time in Arabic NLP an in-depth study of the impact of various subwords configurations ranging from character to character n-grams (including word) for social media text classification. We propose several neural architectures to learn character, subword and word embeddings, as well as a combination of these three levels, exploring different composition functions to obtain the final representation of a given text. To evaluate the effectiveness of these representations, we perform extrinsic evaluations on three text classification tasks (sentiment analysis, emotion detection and irony detection) while accounting for different Arabic varieties (Modern Standard Arabic, dialects (Levantine and Maghrebi)). For each task, we experiment with well-known dialect-agnostic and dialect-specific datasets, including those that have been recently used in shared tasks to better compare our results with those reported in previous studies on the same datasets. The results show that the multi-level embeddings we propose outperform current static and contextual-ised embeddings as well as best performing state of the art models in sentiment and emotion detection. In addition, we achieve competitive results in irony detection. Our models are also the most productive across dialects observing that different dialects require different composition configurations. We finally show that these performances tend to increase when coupling the multi-level representations with task-specific features.
机译:Embeddings是非常流行的表示,允许从文本共同发生矩阵之间计算语言单位之间的语义和句法相似性。单位可以因字符n-gram而异,包括更多粗粒的单位,如句子和文档。最近,已经提出了多级嵌入与不同单位的表示,作为单级嵌入的替代方案,以考虑单词(即形态学)的内部结构和帮助系统,以概括出在词汇中的良好状态。这些表示是预先培训或学习的,已经显示出在几个NLP任务中的单词级基线,如机器翻译,包括语音标记和命名实体识别的一部分。我们的目标是促进这一研究系列,这是阿拉伯语NLP第一次提出的关于社交媒体文本分类的各种子字配置的深入研究,从字符到角色n-gram(包括字)。我们提出了几个神经架构来学习字符,子字和Word Embeddings,以及这三个级别的组合,探索不同的构图函数来获得给定文本的最终表示。为了评估这些陈述的有效性,我们对三个文本分类任务(情感检测和讽刺检测)进行外在评估,同时对不同的阿拉伯品种(现代标准阿拉伯语,方言(Levantine和Maghrebi))进行核算。对于每项任务,我们尝试使用众所周知的方言 - 不可知论者和方言特定的数据集,包括最近用于共享任务的那些,以便更好地将我们的结果与在同一数据集上的先前研究中报告的结果进行比较。结果表明,多级嵌入式我们提出了优于当前的静态和上下文的嵌入品,以及在情感和情感检测中最佳表现最佳状态。此外,我们在讽刺检测中实现了竞争结果。我们的模型也是跨方言最富有成效的方言,观察不同的方言需要不同的组成配置。我们终于表明,当耦合特定任务特征的多级表示时,这些表演往往会增加。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号