首页> 外文期刊>IEICE transactions on information and systems >Discriminative Learning of Filterbank Layer within Deep Neural Network Based Speech Recognition for Speaker Adaptation
【24h】

Discriminative Learning of Filterbank Layer within Deep Neural Network Based Speech Recognition for Speaker Adaptation

机译:基于深度神经网络的说话人自适应语音识别的判别学习

获取原文
           

摘要

Deep neural networks (DNNs) have achieved significant success in the field of automatic speech recognition. One main advantage of DNNs is automatic feature extraction without human intervention. However, adaptation under limited available data remains a major challenge for DNN-based systems because of their enormous free parameters. In this paper, we propose a filterbank-incorporated DNN that incorporates a filterbank layer that presents the filter shape/center frequency and a DNN-based acoustic model. The filterbank layer and the following networks of the proposed model are trained jointly by exploiting the advantages of the hierarchical feature extraction, while most systems use pre-defined mel-scale filterbank features as input acoustic features to DNNs. Filters in the filterbank layer are parameterized to represent speaker characteristics while minimizing a number of parameters. The optimization of one type of parameters corresponds to the Vocal Tract Length Normalization (VTLN), and another type corresponds to feature-space Maximum Linear Likelihood Regression (fMLLR) and feature-space Discriminative Linear Regression (fDLR). Since the filterbank layer consists of just a few parameters, it is advantageous in adaptation under limited available data. In the experiment, filterbank-incorporated DNNs showed effectiveness in speaker/gender adaptations under limited adaptation data. Experimental results on CSJ task demonstrate that the adaptation of proposed model showed 5.8% word error reduction ratio with 10 utterances against the un-adapted model.
机译:深度神经网络(DNN)在自动语音识别领域取得了巨大的成功。 DNN的主要优势之一是无需人工干预即可自动提取特征。但是,由于基于DNN的系统具有巨大的免费参数,因此在有限的可用数据下进行自适应仍然是一项重大挑战。在本文中,我们提出了一个结合了滤波器组的DNN,其中包含了一个表示滤波器形状/中心频率的滤波器组层和一个基于DNN的声学模型。通过利用分层特征提取的优势,共同训练了所提出模型的滤波器组层和以下网络,而大多数系统使用预定义的梅尔尺度滤波器组特征作为DNN的输入声学特征。对滤波器组层中的滤波器进行参数化,以表示扬声器的特性,同时最大程度地减少参数数量。一种类型的参数的优化对应于人声道长度归一化(VTLN),另一种类型的优化对应于特征空间最大线性似然回归(fMLLR)和特征空间判别线性回归(fDLR)。由于滤波器组层仅由几个参数组成,因此在有限的可用数据下进行适配是有利的。在实验中,结合了滤波器组的DNN在有限的适应性数据下显示出对说话人/性别适应的有效性。 CSJ任务的实验结果表明,所提出的模型对未适应模型的适应性显示出了5.8%的单词错误减少率和10种发音。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号