Morphologically motivated word classes for very large vocabulary speech recognition of Finnish and Estonian

Matti Varjokallio; Sami Virpioja; Mikko Kurimo

首页> 外文期刊>Computer speech and language >Morphologically motivated word classes for very large vocabulary speech recognition of Finnish and Estonian

【24h】

Morphologically motivated word classes for very large vocabulary speech recognition of Finnish and Estonian

机译：非常大的词汇语音识别芬兰语和爱沙尼亚人的形态上的词课程

获取原文

获取原文并翻译 | 示例

掌桥外文数据库（机构版） >>

开具论文收录证明 >>

页面导航

摘要
著录项
相似文献
相关主题

摘要

We study class-based n-gram and neural network language models for very large vocabulary speech recognition of two morphologically rich languages: Finnish and Estonian. Due to morphological processes such as derivation, inflection and compounding, the models need to be trained with vocabulary sizes of several millions of word types. Class-based language modelling is in this case a powerful approach to alleviate the data sparsity and reduce the computational load. For a very large vocabulary, bigram statistics may not be an optimal way to derive the classes. We thus study utilizing the output of a morphological analyzer to achieve efficient word classes. We show that efficient classes can be learned by refining the morphological classes to smaller equivalence classes using merging, splitting and exchange procedures with suitable constraints. This type of classification can improve the results, particularly when language model training data is not very large. We also extend the previous analyses by rescoring the hypotheses obtained from a very large vocabulary recognizer using class-based neural network language models. We show that despite the fixed vocabulary, carefully constructed classes for word-based language models can in some cases result in lower error rates than subword-based unlimited vocabulary language models.

机译：我们研究基于课程的N-GRAM和神经网络语言模型，为非常大的词汇语音识别两种形态学丰富的语言：芬兰和爱沙尼亚人。由于衍生，拐点和复合等形态学过程，所需的模型需要用数百万字类型的词汇量培训。基于类的语言建模在于，在这种情况下是一种强大的方法来缓解数据稀疏性并降低计算负荷。对于一个非常大的词汇，Bigram统计数据可能不是导出类的最佳方式。因此，我们利用形态分析仪的输出来研究实现有效的单词类。我们表明，可以通过使用合并，分割和交换程序将形态学课程精炼到较小的等价类别来学习有效的类。这种类型的分类可以改善结果，特别是当语言模型训练数据不是很大时。我们还通过使用基于类的神经网络语言模型来重新使用从非常大的词汇识别器获得的假设来扩展先前的分析。我们表明，尽管在某些情况下，仔细构造了基于Word的语言模型的课程可以在某些情况下导致较低的误差率比基于子字的无限词汇语言模型更低。

著录项

来源
《Computer speech and language》 |2021年第3期|101141.1-101141.19|共19页
作者
Matti Varjokallio; Sami Virpioja; Mikko Kurimo;
展开▼
作者单位

Department of Signal Processing and Acoustics School of Electrical Engineering Aalto University Espoo Finland;

Department of Digital Humanities Faculty of Arts University of Helsinki Finland;

Department of Signal Processing and Acoustics School of Electrical Engineering Aalto University Espoo Finland;

展开▼
收录信息美国《科学引文索引》(SCI);美国《工程索引》(EI);
原文格式 PDF
正文语种 eng
中图分类
关键词
Language modelling; Class-based language models; Morphologically rich languages;

机译：语言建模;基于类的语言模型;形态学丰富的语言;

相似文献

外文文献
中文文献
专利

1. Automatic Speech Recognition With Very Large Conversational Finnish and Estonian Vocabularies [J] . Seppo Enarvi, Peter Smit, Sami Virpioja, Audio, Speech, and Language Processing, IEEE/ACM Transactions on . 2017,第11期

机译：具有大量会话芬兰语和爱沙尼亚语词汇的自动语音识别
2. An improved two-stage mixed language model approach for handling out-of-vocabulary words in large vocabulary continuous speech recognition [J] . Bert Reveil, Kris Demuynck, Jean-Pierre Martens Computer speech and language . 2014,第1期

机译：一种改进的两阶段混合语言模型方法，用于处理大词汇量连续语音识别中的词汇外单词
3. Dealing with Out-of vocabulary Words and Filled Pauses in Word N-gram Based Speech Recognition System [J] . ATSUHIKO KAI, YOSHIFUMI HIROSE, SEIICHI NAKAGAWA 情報処理学会論文誌 . 1999,第4期

机译：基于单词N-gram的语音识别系统处理词汇外单词和填充的暂停
4. Class n-Gram Models for Very Large Vocabulary Speech Recognition of Finnish and Estonian [C] . Matti Varjokallio, Mikko Kurimo, Sami Virpioja International conference on statistical language and speech processing . 2016

机译：用于芬兰语和爱沙尼亚语的非常大的词汇语音识别的n-Gram类模型
5. Learning Out-of-Vocabulary Words in Automatic Speech Recognition. [D] . Qin, Long. 2013

机译：在自动语音识别中学习词汇外单词。
6. Using Morphological Data in Language Modeling for Serbian Large Vocabulary Speech Recognition [O] . Edvin Pakoci, Branislav Popović, Darko Pekar 2019

机译：在塞尔维亚大型词汇语音识别的语言建模中使用形态学数据
7. Automatic Speech Recognition with Very Large Conversational Finnish and Estonian Vocabularies [O] . Enarvi, Seppo, Smit, Peter, Virpioja, Sami, 2017

机译：具有非常大的会话芬兰语和爱沙尼亚语词汇量的自动语音识别

Morphologically motivated word classes for very large vocabulary speech recognition of Finnish and Estonian

摘要

著录项

相似文献

相关主题

期刊订阅