首页> 外文期刊>Language Resources and Evaluation >Modeling under-resourced languages for speech recognition
【24h】

Modeling under-resourced languages for speech recognition

机译:为语音识别建模资源不足的语言

获取原文
获取原文并翻译 | 示例
           

摘要

One particular problem in large vocabulary continuous speech recognition for low-resourced languages is finding relevant training data for the statistical language models. Large amount of data is required, because models should estimate the probability for all possible word sequences. For Finnish, Estonian and the other fenno-ugric languages a special problem with the data is the huge amount of different word forms that are common in normal speech. The same problem exists also in other language technology applications such as machine translation, information retrieval, and in some extent also in other morphologically rich languages. In this paper we present methods and evaluations in four recent language modeling topics: selecting conversational data from the Internet, adapting models for foreign words, multi-domain and adapted neural network language modeling, and decoding with subword units. Our evaluations show that the same methods work in more than one language and that they scale down to smaller data resources.
机译:资源匮乏的语言在大词汇量连续语音识别中的一个特殊问题是找到统计语言模型的相关训练数据。由于模型应该估计所有可能的单词序列的概率,因此需要大量数据。对于芬兰语,爱沙尼亚语和其他芬诺语/俄语语言,数据存在一个特殊问题,那就是正常语音中常见的大量不同单词形式。在其他语言技术应用(例如机器翻译,信息检索)中以及在某种程度上在其他形态丰富的语言中也存在相同的问题。在本文中,我们介绍了四个最近的语言建模主题中的方法和评估:从Internet选择会话数据,为外来词改编模型,多域和改编的神经网络语言建模以及使用子词单元进行解码。我们的评估表明,相同的方法可以使用多种语言,并且可以缩小为较小的数据资源。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号