Leveraging Data-Driven Methods in Word-Level Language Identification for a Multilingual Alpine Heritage Corpus

机译：在多语言高山遗产语料库的单词级语言识别中利用数据驱动方法

获取原文

获取原文并翻译 | 示例

页面导航

摘要
著录项
相似文献
相关主题

摘要

This paper presents a data-driven, simple cluster-and-label approach using optimized count-based methods for word-level language identification for a large domain-specific multilingual diachronic corpus of periodicals published at least yearly between 1864 and 2014 in Switzerland. Our system requires no annotated data or training, only minimal human effort in evaluating and labeling 50 clusters for a corpus of almost 40 million tokens. Despite being unsupervised, our results show an accuracy that is comparable to the corpus annotations which result from an existing code switching algorithm and the combined usage of two supervised systems using character and byte n-gram models (Volk and Clematide, 2014).

机译：本文介绍了一种数据驱动的，简单的聚类和标签方法，该方法使用基于计数的优化方法来识别大型领域特定的多语种历时性语料库，用于单词级语言识别，该语料库至少每年于1864年至2014年在瑞士出版。我们的系统不需要带注释的数据或培训，仅需极少的人力即可评估和标记近40百万个令牌的50个簇。尽管不受监督，但我们的结果显示出与语料库注释相当的准确性，该注释由现有代码转换算法以及使用字符和字节n-gram模型的两个受监督系统的组合使用所产生（Volk和Clematide，2014年）。

著录项

来源
《Workshop on multilingual and cross-lingual methods in NLP 2016》|2016年|45-54|共10页
会议地点 San Diego CA(US)
作者
Ada Wan;
展开▼
作者单位

University of Zurich;

展开▼
会议组织
原文格式 PDF
正文语种 eng
中图分类
关键词

相似文献

外文文献
中文文献
专利

1. Multilingual Speech Corpus in Low-Resource Eastern and Northeastern Indian Languages for Speaker and Language Identification [J] . Basu Joyanta, Khan Soma, Roy Rajib, Circuits, systems and signal processing . 2021,第10期

机译：用于扬声器和语言识别的低资源东部和东北印度语言语言的多语种演讲语料库
2. Towards an integrated second-language pedagogy for foreign and community/heritage languages in multilingual Britain [J] . Jim Anderson Language Learning Journal . 2008,第1期

机译：在多语言英国中针对外语和社区/传统语言的综合第二语言教学法
3. Multilingual awareness and heritage language education: children's multimodal representations of their multilingualism [J] . Melo-Pfeifer Silvia Language Awareness . 2015,第3期

机译：多语言意识和传承语言教育：儿童对其多语言能力的多模式表征
4. Leveraging Data-Driven Methods in Word-Level Language Identification for a Multilingual Alpine Heritage Corpus [C] . Ada Wan Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies . 2016

机译：利用多语言高山遗产语料库中的单词级语言识别数据驱动的方法
5. Preservation of the Nom heritage: Keyboard input methods for presenting the Vietnamese Quoc Ngu' and Chu' Nom on multilingual Web pages. [D] . LeDynh, Bot L. (Le Dinh Bot). 2004

机译：保留Nom遗产：键盘输入法，用于在多语言网页上显示越南语Quoc Ngu'和Chu'Nom。
6. Estimating geographic subjective well-being from Twitter: A comparison of dictionary and data-driven language methods [O] . Kokil Jaidka, Salvatore Giorgi, H. Andrew Schwartz, 2020

机译：从Twitter估算地理主观幸福感：字典和数据驱动语言方法的比较
7. Detecting Code-Switching in a Multilingual Alpine Heritage Corpus [O] . Martin Volk, Simon Clematide 2015

机译：检测多语种高山遗产语料库中的代码切换

Leveraging Data-Driven Methods in Word-Level Language Identification for a Multilingual Alpine Heritage Corpus

摘要

著录项

相似文献

相关主题

期刊订阅