Building Statistical Language Models for Persian Continuous Speech Recognition Systems Using the Peykare Corpus

Mohammad Bahrani; Hossein Sameti

首页> 外文期刊>International journal of computer processing of languages >Building Statistical Language Models for Persian Continuous Speech Recognition Systems Using the Peykare Corpus

【24h】

Building Statistical Language Models for Persian Continuous Speech Recognition Systems Using the Peykare Corpus

机译：使用Peykare语料库为波斯语连续语音识别系统建立统计语言模型

获取原文

获取原文并翻译 | 示例

掌桥外文数据库（机构版） >>

开具论文收录证明 >>

文献代查 >>

页面导航

摘要
著录项
相似文献
相关主题

摘要

In this paper, we build statistical language models for the Persian language using a Persian corpus called Peykare. Then, we incorporate the constructed language models in a Persian continuous speech recognition (CSR) system. First, we unify the different orthographies of words to make the texts of the corpus consistent. In addition, we decrease the number of POS tags used in the corpus by manual clustering. Then, the word-based and the class-based n-gram language models are built using the unified and reduced-tag-set corpus. For building the class-based language models, several methods are used including a new method called LGM-based word clustering. We present the procedure of incorporating language models in a Persian CSR system. Using these language models absolute reductions of up to 13.2% in word error rate were achieved.

机译：在本文中，我们使用称为Peykare的波斯语料库为波斯语言建立统计语言模型。然后，我们将构建的语言模型合并到波斯语连续语音识别（CSR）系统中。首先，我们统一不同的单词拼写法，以使语料库的文本保持一致。此外，我们通过手动聚类减少了语料库中使用的POS标签数量。然后，使用统一的，标签减少的语料库构建基于单词和基于类的n-gram语言模型。为了构建基于类的语言模型，使用了多种方法，包括称为基于LGM的单词聚类的新方法。我们介绍在波斯CSR系统中合并语言模型的过程。使用这些语言模型，可以将单词错误率的绝对值降低多达13.2％。

著录项

来源
《International journal of computer processing of languages》 |2011年第1期|p.1-20|共20页
作者
Mohammad Bahrani; Hossein Sameti;
展开▼
作者单位

Speech Processing Lab, Department of Computer Engineering,Sharif University of Technology, Tehran, Iran;

Speech Processing Lab, Department of Computer Engineering,Sharif University of Technology, Tehran, Iran;

展开▼
收录信息
原文格式 PDF
正文语种 eng
中图分类
关键词
statistical language models; continuous speech recognition; peykare corpus; persian language.;

机译：统计语言模型;连续语音识别皮卡语语料库;波斯语。;

相似文献

外文文献
中文文献
专利

1. A large vocabulary continuous speech recognition system for Persian language [J] . Hossein Sameti, Hadi Veisi, Mohammad Bahrani, EURASIP journal on audio, speech, and music processing . 2011,第1期

机译：波斯语大词汇量连续语音识别系统
2. A large vocabulary continuous speech recognition system for Persian language [J] . Hossein Sameti, Hadi Veisi, Mohammad Bahrani, EURASIP journal on audio, speech, and music processing . 2011,第1期

机译：波斯语大词汇量连续语音识别系统
3. A study of neural network Russian language models for automatic continuous speech recognition systems [J] . Kipyatkova I. S., Karpov A. A. Automation and Remote Control . 2017,第5期

机译：自动持续语音识别系统神经网络俄语模型的研究
4. A New Word Clustering Method for Building N-Gram Language Models in Continuous Speech Recognition Systems [C] . Mohammad Bahrani, Hossein Sameti, Nazila Hafezi, International Conference on Industrial, Engineering and Other Applications of Applied Intelligent Systems . 2008

机译：一种新的单词聚类方法，用于在连续语音识别系统中构建N-Gram语言模型
5. Integrate template matching and statistical modeling for continuous speech recognition. [D] . Sun, Xie. 2011

机译：集成模板匹配和统计建模，可进行连续语音识别。
6. Retrospective Analysis of Clinical Performance of an Estonian Speech Recognition System for Radiology: Effects of Different Acoustic and Language Models [O] . A. Paats, T. Alumäe, E. Meister, 2018

机译：一项爱沙尼亚放射线语音识别系统临床表现的回顾性分析：不同声学和语言模型的影响
7. A New Word Clustering Method for Building N-Gram Language Models in Continuous Speech Recognition Systems [O] . Mohammad Bahrani, Hossein Sameti, Nazila Hafezi, 2013

机译：连续语音识别系统中构建N-gram语言模型的新词聚类方法
8. Statistical Modeling for Continuous Speech Recognition [R] . Schwartz, R., Chow, Y. L., Derr, A., 1988

机译：连续语音识别的统计建模

Building Statistical Language Models for Persian Continuous Speech Recognition Systems Using the Peykare Corpus

摘要

著录项

相似文献

相关主题

期刊订阅