首页> 外文期刊>Information Processing & Management >Effective language identification of forum texts based on statistical approaches
【24h】

Effective language identification of forum texts based on statistical approaches

机译:基于统计方法的论坛文本有效语言识别

获取原文
获取原文并翻译 | 示例
           

摘要

This investigation deals with the problem of language identification of noisy texts, which could represent the primary step of many natural language processing or information retrieval tasks. Language identification is the task of automatically identifying the language of a given text. Although there exists several methods in the literature, their performances are not so convincing in practice. In this contribution, we propose two statistical approaches: the high frequency approach and the nearest prototype approach. In the first one, 5 algorithms of language identification are proposed and implemented, namely: character based identification (CBA), word based identification (WBA), special characters based identification (SCA), sequential hybrid algorithm (HA1) and parallel hybrid algorithm (HA2). In the second one, we use 11 similarity measures combined with several types of character N-Grams. For the evaluation task, the proposed methods are tested on forum datasets containing 32 different languages. Furthermore, an experimental comparison is made between the proposed approaches and some referential language identification tools such as: LIGA, NTC, Google translate and Microsoft Word. Results show that the proposed approaches are interesting and outperform the baseline methods of language identification on forum texts.
机译:这项研究解决了嘈杂文本的语言识别问题,这可能代表了许多自然语言处理或信息检索任务的第一步。语言识别是自动识别给定文本的语言的任务。尽管文献中存在几种方法,但它们的性能在实践中并不那么令人信服。在这一贡献中,我们提出了两种统计方法:高频方法和最近的原型方法。在第一个中,提出并实现了5种语言识别算法,分别是:基于字符的识别(CBA),基于单词的识别(WBA),基于特殊字符的识别(SCA),顺序混合算法(HA1)和并行混合算法( HA2)。在第二个中,我们使用11种相似性度量结合几种类型的字符N语法。对于评估任务,对包含32种不同语言的论坛数据集进行了测试。此外,在提议的方法与一些参考语言识别工具(例如:LIGA,NTC,Google翻译和Microsoft Word)之间进行了实验比较。结果表明,所提出的方法是有趣的,并且优于论坛文本上的语言识别的基线方法。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号