...
首页> 外文期刊>International journal of speech technology >Deep neural network training for whispered speech recognition using small databases and generative model sampling
【24h】

Deep neural network training for whispered speech recognition using small databases and generative model sampling

机译:使用小型数据库和生成模型采样进行低声语音识别的深度神经网络训练

获取原文
获取原文并翻译 | 示例
           

摘要

State-of-the-art speech recognition solutions currently employ hidden Markov models (HMMs) to capture the time variability in a speech signal and deep neural networks (DNNs) to model the HMM state distributions. It has been shown that DNN-HMM hybrid systems outperform traditional HMM and Gaussian mixture model (GMM) hybrids in many applications. This improvement is mainly attributed to the ability of DNNs to model more complex data structures. However, having sufficient data samples is one key point in training a high accuracy DNN as a discriminative model. This barrier makes DNNs unsuitable for many applications with limited amounts of data. In this study, we introduce a method to produce an excessive amount of pseudo-samples that requires availability of only a small amount of transcribed data from the target domain. In this method, a universal background model (UBM) is trained to capture a parametric estimate of the data distributions. Next, random sampling is used to generate a large amount of pseudo-samples from the UBM. Frame-Shuffling is then applied to smooth the temporal cepstral trajectories in the generated pseudo-sample sequences to better resemble the temporal characteristics of a natural speech signal. Finally, the pseudo-sample sequences are combined with the original training data to train the DNN-HMM acoustic model of a speech recognizer. The proposed method is evaluated on small-sized sets of neutral and whisper data drawn from the UT-Vocal Effort II corpus. It is shown that phoneme error rates (PERs) of a DNN-HMM based speech recognizer are considerably reduced when incorporating the generated pseudo-samples in the training process, with + 79.0 and + 45.6% relative PER improvements for neutral-neutral training/test and whisper-whisper training/test scenarios, respectively.
机译:当前,最先进的语音识别解决方案采用隐马尔可夫模型(HMM)捕获语音信号中的时间变化,并采用深度神经网络(DNN)建模HMM状态分布。已经表明,在许多应用中,DNN-HMM混合系统优于传统的HMM和高斯混合模型(GMM)混合。这种改进主要归因于DNN建模更复杂数据结构的能力。但是,拥有足够的数据样本是训练高精度DNN作为判别模型的关键点。这种障碍使DNN不适合数据量有限的许多应用。在这项研究中,我们介绍了一种生成过多伪样本的方法,该方法仅需要从目标域获取少量转录数据的可用性。在这种方法中,训练通用背景模型(UBM)来捕获数据分布的参数估计。接下来,使用随机采样从UBM生成大量伪采样。然后,应用帧改组来平滑生成的伪样本序列中的时间倒谱轨迹,以更好地类似于自然语音信号的时间特性。最后,将伪样本序列与原始训练数据组合,以训练语音识别器的DNN-HMM声学模型。拟议的方法是从UT-Vocal Effort II语料库中提取的小型中性和耳语数据集进行评估的。结果表明,在训练过程中合并生成的伪样本时,基于DNN-HMM的语音识别器的音素错误率(PER)显着降低,中性/中性训练/测试的相对PER改善了+79.0和+ 45.6%和耳语训练/测试场景。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号