首页> 外文期刊>Computer speech and language >Morphologically motivated word classes for very large vocabulary speech recognition of Finnish and Estonian
【24h】

Morphologically motivated word classes for very large vocabulary speech recognition of Finnish and Estonian

机译:非常大的词汇语音识别芬兰语和爱沙尼亚人的形态上的词课程

获取原文
获取原文并翻译 | 示例
       

摘要

We study class-based n-gram and neural network language models for very large vocabulary speech recognition of two morphologically rich languages: Finnish and Estonian. Due to morphological processes such as derivation, inflection and compounding, the models need to be trained with vocabulary sizes of several millions of word types. Class-based language modelling is in this case a powerful approach to alleviate the data sparsity and reduce the computational load. For a very large vocabulary, bigram statistics may not be an optimal way to derive the classes. We thus study utilizing the output of a morphological analyzer to achieve efficient word classes. We show that efficient classes can be learned by refining the morphological classes to smaller equivalence classes using merging, splitting and exchange procedures with suitable constraints. This type of classification can improve the results, particularly when language model training data is not very large. We also extend the previous analyses by rescoring the hypotheses obtained from a very large vocabulary recognizer using class-based neural network language models. We show that despite the fixed vocabulary, carefully constructed classes for word-based language models can in some cases result in lower error rates than subword-based unlimited vocabulary language models.
机译:我们研究基于课程的N-GRAM和神经网络语言模型,为非常大的词汇语音识别两种形态学丰富的语言:芬兰和爱沙尼亚人。由于衍生,拐点和复合等形态学过程,所需的模型需要用数百万字类型的词汇量培训。基于类的语言建模在于,在这种情况下是一种强大的方法来缓解数据稀疏性并降低计算负荷。对于一个非常大的词汇,Bigram统计数据可能不是导出类的最佳方式。因此,我们利用形态分析仪的输出来研究实现有效的单词类。我们表明,可以通过使用合并,分割和交换程序将形态学课程精炼到较小的等价类别来学习有效的类。这种类型的分类可以改善结果,特别是当语言模型训练数据不是很大时。我们还通过使用基于类的神经网络语言模型来重新使用从非常大的词汇识别器获得的假设来扩展先前的分析。我们表明,尽管在某些情况下,仔细构造了基于Word的语言模型的课程可以在某些情况下导致较低的误差率比基于子字的无限词汇语言模型更低。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号