...
首页> 外文期刊>ACM transactions on Asian and low-resource language information processing >Machine Normalization: Bringing Social Media Text from Non-Standard to Standard Form
【24h】

Machine Normalization: Bringing Social Media Text from Non-Standard to Standard Form

机译:机器归一化:将非标准的社交媒体文本带到标准形式

获取原文
获取原文并翻译 | 示例
           

摘要

User-generated text in social media communication (SMC) is mainly characterized by non-standard form. It may contain code switching (CS) text, a widespread phenomenon in SMC, in addition to noisy elements used, especially in written conversations (use of abbreviations, symbols, emoticons) or misspelled words. All of these factors constitute a wall in front of text mining applications. Common text mining tools are dedicated to standard use of standard languages but cannot deal with other forms, especially written text in social media. To overcome these problems, in this work we present our solution for the normalization of nonstandard use of standard and non-standard languages (dialects) in SMC text with the use of existent resources and tools. The main processing in our solution consists of CS normalization from multiple to one language by the use of a machine translation-like approach. This processing relies on a linguistic approach of CS, which aims at identifying automatically the translation source and target languages (without human intervention). The remaining processing operations concern the normalization of SMC special expressions and spelling correction of out-of-vocabulary words. To preserve the coded-switched sentence meaning across translation, we adopt a knowledge-based approach for word sense translation disambiguation reinforced with a multilingual vertical context. All of these processes are embedded in what we refer to as the machine normalization system. Our solution can be used as a front-end of text mining processing, enabling the analysis of SMC noisy text. The conducted experiments show that our system performs better than considered baselines.
机译:社交媒体通信(SMC)中的用户生成的文本主要由非标准表格表征。它可能包含代码切换(CS)文本,SMC中的广泛现象,除了使用的嘈杂元素,尤其是书面对话(使用缩写,符号,表情符号)或拼写错误的单词。所有这些因素都构成了文本挖掘应用前的墙壁。常见的文本挖掘工具致力于标准使用标准语言,但无法处理其他表格,特别是社交媒体中的书面文本。为了克服这些问题,在这项工作中,我们在SMC文本中使用存在的资源和工具来介绍我们的解决方案,以便在SMC文本中的标准和非标准语言(方言)正常化。我们解决方案中的主要处理包括通过使用机器平移方式从多重到一种语言的CS归一化。这种处理依赖于CS的语言方法,其目的在于自动识别翻译源和目标语言(没有人为干预)。剩余的处理操作涉及SMC特殊表达的标准化和词汇外单词的拼写校正。为了保留横跨翻译的编码切换句子,我们采用了一种基于知识的方法,用于使用多语言垂直上下文加强的单词感应歧义。所有这些过程都嵌入到我们所指的是机器归一化系统中。我们的解决方案可以用作文本挖掘处理的前端,从而可以分析SMC噪音文本。进行的实验表明,我们的系统比考虑基线表现更好。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号