首页> 外文会议>Asia-Pacific Signal and Information Processing Association Annual Summit and Conference >Unsupervised Speaker Adaptation for DNN-based Speech Synthesis using Input Codes
【24h】

Unsupervised Speaker Adaptation for DNN-based Speech Synthesis using Input Codes

机译:使用输入代码的基于DNN的语音合成的无监督说话人适应

获取原文

摘要

A new speaker-adaptation technique for deep neural network (DNN)-based speech synthesis - which requires only speech data without orthographic transcriptions - is proposed. This technique is based on a DNN-based speech-synthesis model that takes speaker, gender, and age into consideration as additional inputs and outputs acoustic parameters of corresponding voices from text in order to construct a multi-speaker model and perform speaker adaptation. It uses a new input code that represents acoustic similarity to each of the training speakers in a probability. The new input code, called “speaker-similarity vector,” is obtained by concatenating posterior probabilities calculated from each model of the training speakers. GMM-UBM or i-vector/PLDA, which are widely used in text-independent speaker verification, are used to represent the speaker models, since they can be used without text information. Text and the speaker-similarity vectors of the training speakers are used as input to first train a multi-speaker speech-synthesis model, which outputs acoustic parameters of the training speakers. A new speaker-similarity vector is then estimated by using a small amount of speech data uttered by an unknown target speaker on the basis of the separately trained speaker models. It is expected that inputting the estimated speaker-similarity vector into the multi-speaker speech-synthesis model can generate synthetic speech that resembles the target speaker's voice. In objective and subjective experiments, adaptation performance of the proposed technique was evaluated using not only studio-quality adaptation data but also low-quality (i.e., noisy and reverberant) data. The results of the experiments indicate that the proposed technique makes it possible to rapidly construct a voice for the target speaker in DNN-based speech synthesis.
机译:提出了一种新的基于深度神经网络(DNN)的语音合成说话人自适应技术,该技术仅需要语音数据而无需正交拼写。该技术基于基于DNN的语音合成模型,该模型将说话者,性别和年龄作为附加输入,并从文本中输出相应语音的声学参数,以构建多说话者模型并执行说话者自适应。它使用一个新的输入代码,该输入代码以概率表示与每个训练说话者的声音相似性。新的输入代码称为“说话者相似度向量”,是通过将根据每种训练说话者模型计算出的后验概率进行级联而获得的。 GMM-UBM或i-vector / PLDA在不依赖于文本的说话者验证中被广泛使用,因为它们可以在没有文本信息的情况下使用,所以它们被用来代表说话者模型。训练说话者的文本和说话者相似度矢量被用作输入来首先训练多说话者语音合成模型,该模型输出训练说话者的声学参数。然后,在单独训练的说话者模型的基础上,通过使用未知目标说话者发出的少量语音数据来估算新的说话者相似度向量。期望将估计的说话者相似度向量输入到多说话者语音合成模型中可以生成类似于目标说话者语音的合成语音。在客观和主观实验中,不仅使用录音室质量的适应性数据,而且还使用低质量的(即嘈杂的和混响的)数据来评估所提出技术的适应性性能。实验结果表明,所提出的技术使得在基于DNN的语音合成中为目标说话者快速构建语音成为可能。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号