Unsupervised Speaker Adaptation for DNN-based Speech Synthesis using Input Codes

机译：使用输入代码的基于DNN的语音合成的无监督说话人适应

获取原文

页面导航

摘要
著录项
相似文献
相关主题

摘要

A new speaker-adaptation technique for deep neural network (DNN)-based speech synthesis - which requires only speech data without orthographic transcriptions - is proposed. This technique is based on a DNN-based speech-synthesis model that takes speaker, gender, and age into consideration as additional inputs and outputs acoustic parameters of corresponding voices from text in order to construct a multi-speaker model and perform speaker adaptation. It uses a new input code that represents acoustic similarity to each of the training speakers in a probability. The new input code, called “speaker-similarity vector,” is obtained by concatenating posterior probabilities calculated from each model of the training speakers. GMM-UBM or i-vector/PLDA, which are widely used in text-independent speaker verification, are used to represent the speaker models, since they can be used without text information. Text and the speaker-similarity vectors of the training speakers are used as input to first train a multi-speaker speech-synthesis model, which outputs acoustic parameters of the training speakers. A new speaker-similarity vector is then estimated by using a small amount of speech data uttered by an unknown target speaker on the basis of the separately trained speaker models. It is expected that inputting the estimated speaker-similarity vector into the multi-speaker speech-synthesis model can generate synthetic speech that resembles the target speaker's voice. In objective and subjective experiments, adaptation performance of the proposed technique was evaluated using not only studio-quality adaptation data but also low-quality (i.e., noisy and reverberant) data. The results of the experiments indicate that the proposed technique makes it possible to rapidly construct a voice for the target speaker in DNN-based speech synthesis.

机译：提出了一种新的基于深度神经网络（DNN）的语音合成说话人自适应技术，该技术仅需要语音数据而无需正交拼写。该技术基于基于DNN的语音合成模型，该模型将说话者，性别和年龄作为附加输入，并从文本中输出相应语音的声学参数，以构建多说话者模型并执行说话者自适应。它使用一个新的输入代码，该输入代码以概率表示与每个训练说话者的声音相似性。新的输入代码称为“说话者相似度向量”，是通过将根据每种训练说话者模型计算出的后验概率进行级联而获得的。 GMM-UBM或i-vector / PLDA在不依赖于文本的说话者验证中被广泛使用，因为它们可以在没有文本信息的情况下使用，所以它们被用来代表说话者模型。训练说话者的文本和说话者相似度矢量被用作输入来首先训练多说话者语音合成模型，该模型输出训练说话者的声学参数。然后，在单独训练的说话者模型的基础上，通过使用未知目标说话者发出的少量语音数据来估算新的说话者相似度向量。期望将估计的说话者相似度向量输入到多说话者语音合成模型中可以生成类似于目标说话者语音的合成语音。在客观和主观实验中，不仅使用录音室质量的适应性数据，而且还使用低质量的（即嘈杂的和混响的）数据来评估所提出技术的适应性性能。实验结果表明，所提出的技术使得在基于DNN的语音合成中为目标说话者快速构建语音成为可能。

著录项

来源
《Asia-Pacific Signal and Information Processing Association Annual Summit and Conference》|2018年|649-658|共10页
会议地点
作者
Shinji Takaki; Yoshikazu Nishimura; Junichi Yamagishi;
展开▼
作者单位

展开▼
会议组织
原文格式 PDF
正文语种
中图分类
关键词
Adaptation models; Speech synthesis; Data models; Training; Computational modeling; Acoustics; Databases;

机译：适应模型;语音合成;数据模型;训练;计算模型;声学;数据库;

相似文献

外文文献
中文文献
专利

1. DNN-Based Speech Synthesis Using Speaker Codes [J] . Nobukatsu HOJO, Yusuke IJIMA, Hideyuki MIZUNO IEICE transactions on information and systems . 2018,第2期

机译：使用说话者代码的基于DNN的语音合成
2. Personalising speech-to-speech translation: Unsupervised cross-lingual speaker adaptation for HMM-based speech synthesis [J] . John Dines, Hui Liang, Lakshmi Saheer, Computer speech and language . 2013,第2期

机译：个性化语音到语音翻译：基于HMM的语音合成的无监督跨语言说话者自适应
3. Unsupervised Speaker Adaptation Using Speaker-Class Models for Lecture Speech Recognition [J] . Tetsuo KOSAKA, Yuui TAKEDA, Takashi ITO, IEICE transactions on information and systems . 2010,第9期

机译：使用演讲者级模型的演讲者语音识别的无监督演讲者自适应
4. Unsupervised Speaker Adaptation for DNN-based Speech Synthesis using Input Codes [C] . Shinji Takaki, Yoshikazu Nishimura, Junichi Yamagishi Asia-Pacific Signal and Information Processing Association Annual Summit and Conference . 2018

机译：使用输入代码的基于DNN的语音合成的无监督者适应
5. Relationship of onset age of ESL acquisition and extent of informal input to appropriateness and nativeness in performing four speech acts in English: A study of native Korean adult speakers of ESL. [D] . Kim, In-Ok. 2000

机译：ESL习得的开始年龄和非正式输入的程度与英语中四种言语行为的适当性和本地性的关系：对韩国母语为ESL的韩国成年人的研究。
6. Speech dynamics are coded in the left motor cortex in fluent speakers but not in adults who stutter [O] . Nicole E. Neef, T. N. Linh Hoang, Andreas Neef, -1

机译：流畅的说话者的言语动力学编码在左运动皮层但口吃的成年人则没有
7. DNN-Based Speech Synthesis Using Speaker Codes [O] . Nobukatsu HOJO, Yusuke IJIMA, Hideyuki MIZUNO 2018

机译：基于DNN的语音合成使用扬声器代码

Unsupervised Speaker Adaptation for DNN-based Speech Synthesis using Input Codes

摘要

著录项

相似文献

相关主题

期刊订阅