首页> 外文会议>Workshop on multilingual and cross-lingual methods in NLP 2016 >Leveraging Data-Driven Methods in Word-Level Language Identification for a Multilingual Alpine Heritage Corpus
【24h】

Leveraging Data-Driven Methods in Word-Level Language Identification for a Multilingual Alpine Heritage Corpus

机译:在多语言高山遗产​​语料库的单词级语言识别中利用数据驱动方法

获取原文
获取原文并翻译 | 示例

摘要

This paper presents a data-driven, simple cluster-and-label approach using optimized count-based methods for word-level language identification for a large domain-specific multilingual diachronic corpus of periodicals published at least yearly between 1864 and 2014 in Switzerland. Our system requires no annotated data or training, only minimal human effort in evaluating and labeling 50 clusters for a corpus of almost 40 million tokens. Despite being unsupervised, our results show an accuracy that is comparable to the corpus annotations which result from an existing code switching algorithm and the combined usage of two supervised systems using character and byte n-gram models (Volk and Clematide, 2014).
机译:本文介绍了一种数据驱动的,简单的聚类和标签方法,该方法使用基于计数的优化方法来识别大型领域特定的多语种历时性语料库,用于单词级语言识别,该语料库至少每年于1864年至2014年在瑞士出版。我们的系统不需要带注释的数据或培训,仅需极少的人力即可评估和标记近40百万个令牌的50个簇。尽管不受监督,但我们的结果显示出与语料库注释相当的准确性,该注释由现有代码转换算法以及使用字符和字节n-gram模型的两个受监督系统的组合使用所产生(Volk和Clematide,2014年)。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号