Deep neural network training for whispered speech recognition using small databases and generative model sampling

Ghaffarzadegan Shabnam; Bořil Hynek; Hansen John H.L.

首页> 外文期刊>International journal of speech technology >Deep neural network training for whispered speech recognition using small databases and generative model sampling

【24h】

Deep neural network training for whispered speech recognition using small databases and generative model sampling

机译：使用小型数据库和生成模型采样进行低声语音识别的深度神经网络训练

获取原文

获取原文并翻译 | 示例

掌桥外文数据库（机构版） >>

开具论文收录证明 >>

文献代查 >>

页面导航

摘要
著录项
相似文献
相关主题

摘要

State-of-the-art speech recognition solutions currently employ hidden Markov models (HMMs) to capture the time variability in a speech signal and deep neural networks (DNNs) to model the HMM state distributions. It has been shown that DNN-HMM hybrid systems outperform traditional HMM and Gaussian mixture model (GMM) hybrids in many applications. This improvement is mainly attributed to the ability of DNNs to model more complex data structures. However, having sufficient data samples is one key point in training a high accuracy DNN as a discriminative model. This barrier makes DNNs unsuitable for many applications with limited amounts of data. In this study, we introduce a method to produce an excessive amount of pseudo-samples that requires availability of only a small amount of transcribed data from the target domain. In this method, a universal background model (UBM) is trained to capture a parametric estimate of the data distributions. Next, random sampling is used to generate a large amount of pseudo-samples from the UBM. Frame-Shuffling is then applied to smooth the temporal cepstral trajectories in the generated pseudo-sample sequences to better resemble the temporal characteristics of a natural speech signal. Finally, the pseudo-sample sequences are combined with the original training data to train the DNN-HMM acoustic model of a speech recognizer. The proposed method is evaluated on small-sized sets of neutral and whisper data drawn from the UT-Vocal Effort II corpus. It is shown that phoneme error rates (PERs) of a DNN-HMM based speech recognizer are considerably reduced when incorporating the generated pseudo-samples in the training process, with + 79.0 and + 45.6% relative PER improvements for neutral-neutral training/test and whisper-whisper training/test scenarios, respectively.

机译：当前，最先进的语音识别解决方案采用隐马尔可夫模型（HMM）捕获语音信号中的时间变化，并采用深度神经网络（DNN）建模HMM状态分布。已经表明，在许多应用中，DNN-HMM混合系统优于传统的HMM和高斯混合模型（GMM）混合。这种改进主要归因于DNN建模更复杂数据结构的能力。但是，拥有足够的数据样本是训练高精度DNN作为判别模型的关键点。这种障碍使DNN不适合数据量有限的许多应用。在这项研究中，我们介绍了一种生成过多伪样本的方法，该方法仅需要从目标域获取少量转录数据的可用性。在这种方法中，训练通用背景模型（UBM）来捕获数据分布的参数估计。接下来，使用随机采样从UBM生成大量伪采样。然后，应用帧改组来平滑生成的伪样本序列中的时间倒谱轨迹，以更好地类似于自然语音信号的时间特性。最后，将伪样本序列与原始训练数据组合，以训练语音识别器的DNN-HMM声学模型。拟议的方法是从UT-Vocal Effort II语料库中提取的小型中性和耳语数据集进行评估的。结果表明，在训练过程中合并生成的伪样本时，基于DNN-HMM的语音识别器的音素错误率（PER）显着降低，中性/中性训练/测试的相对PER改善了+79.0和+ 45.6％和耳语训练/测试场景。

著录项

来源
《International journal of speech technology》 |2017年第4期|1063-1075|共13页
作者
Ghaffarzadegan Shabnam; Bořil Hynek; Hansen John H.L.;
展开▼
作者单位

Center for Robust Speech Systems (CRSS), University of Texas at Dallas, Richardson, TX, United States;

Center for Robust Speech Systems (CRSS), University of Texas at Dallas, Richardson, TX, United States,Electrical Engineering Department, UW-Platteville, Platteville, WI, United States;

Center for Robust Speech Systems (CRSS), University of Texas at Dallas, Richardson, TX, United States;

展开▼
收录信息
原文格式 PDF
正文语种 eng
中图分类
关键词
Deep neural networks; Gaussian mixture models; Random sampling; Small datasets; Speech recognition;

机译：深度神经网络;高斯混合模型;随机抽样;小型数据集;语音识别;

相似文献

外文文献
中文文献
专利

1. Distributed Training of Deep Neural Network Acoustic Models for Automatic Speech Recognition: A comparison of current training strategies [J] . Cui Xiaodong, Zhang Wei, Finkler Ulrich, IEEE Signal Processing Magazine . 2020,第3期

机译：自动语音识别深神经网络声学模型的分布式训练：当前训练策略的比较
2. Generative Modeling of Pseudo-Whisper for Robust Whispered Speech Recognition [J] . Shabnam Ghaffarzadegan, Hynek Bořil, John H. L. Hansen Audio, Speech, and Language Processing, IEEE/ACM Transactions on . 2016,第10期

机译：伪耳语的生成模型用于鲁棒耳语识别
3. A Speaker-Dependent Approach to Single-Channel Joint Speech Separation and Acoustic Modeling Based on Deep Neural Networks for Robust Recognition of Multi-Talker Speech [J] . Yan-Hui Tu, Jun Du, Chin-Hui Lee Journal of signal processing systems for signal, image, and video technology . 2018,第7期

机译：基于说话者的基于深度神经网络的单通道联合语音分离和声学建模方法，用于多语音对话的鲁棒识别
4. Generative modeling of pseudo-target domain adaptation samples for whispered speech recognition [C] . Ghaffarzadegan Shabnam, Boril Hynek, Hansen John H.L. IEEE International Conference on Acoustics, Speech and Signal Processing . 2015

机译：耳语识别的伪目标域自适应样本的生成模型
5. Dysarthric Speech Recognition and Offline Handwriting Recognition using Deep Neural Networks. [D] . Pillai, Suhas Balkrishna. 2017

机译：使用深度神经网络的表情异常语音识别和离线手写识别。
6. Improving Robustness of Deep Neural Network Acoustic Models via Speech Separation and Joint Adaptive Training [O] . Arun Narayanan, DeLiang Wang -1

机译：通过语音分离和联合自适应训练提高深度神经网络声学模型的鲁棒性
7. Multi-task deep neural network acoustic models with model adaptation using discriminative speaker identity for whisper recognition [O] . Li Jingjie, McLoughlin Ian Vince, Liu Cong, 2016

机译：具有判别性说话人身份的模型自适应的多任务深度神经网络声学模型用于耳语识别

Deep neural network training for whispered speech recognition using small databases and generative model sampling

摘要

著录项

相似文献

相关主题

期刊订阅