Prosodic Clustering for Phoneme-Level Prosody Control in End-to-End Speech Synthesis

机译：韵律聚类用于端到端语音合成中的音素级韵律控制

获取原文

页面导航

摘要
著录项
相似文献
相关主题

摘要

This paper presents a method for controlling the prosody at the phoneme level in an autoregressive attention-based text-to-speech system. Instead of learning latent prosodic features with a variational framework as is commonly done, we directly extract phoneme-level F0 and duration features from the speech data in the training set. Each prosodic feature is discretized using unsupervised clustering in order to produce a sequence of prosodic labels for each utterance. This sequence is used in parallel to the phoneme sequence in order to condition the decoder with the utilization of a prosodic encoder and a corresponding attention module. Experimental results show that the proposed method retains the high quality of generated speech, while allowing phoneme-level control of F0 and duration. By replacing the F0 cluster centroids with musical notes, the model can also provide control over the note and octave within the range of the speaker.

机译：本文介绍了一种在自回归关注文本到语音系统中控制音素级别的韵律的方法。除了通常完成的常见框架中，我们将从培训集中的语音数据中直接提取音素级F0和持续时间特征，而不是学习潜在韵律特征。每个韵律特征是使用无监督聚类离散化，以便为每个话语产生一系列韵律标签。该序列与音素序列并行使用，以便通过利用韵律编码器和相应的注意模块来调节解码器。实验结果表明，该方法保留了高质量的生成语音，同时允许对F0和持续时间进行音素级控制。通过用音符替换F0集群质心，该模型还可以在扬声器范围内提供对音符和八度音的控制。

著录项

来源
《IEEE International Conference on Acoustics, Speech and Signal Processing》|2021年|5719-5723|共5页
会议地点
作者
Alexandra Vioni; Myrsini Christidou; Nikolaos Ellinas; Georgios Vamvoukakis; Panos Kakoulidis; Taehoon Kim; June Sig Sung; Hyoungmin Park; Aimilios Chalamandaris; Pirros Tsiakoulis;
展开▼
作者单位

展开▼
会议组织
原文格式 PDF
正文语种
中图分类
关键词
Training; Signal processing algorithms; Music; Signal processing; Feature extraction; Control systems; Encoding;

机译：训练;信号处理算法;音乐;信号处理;特征提取;控制系统;编码;

相似文献

外文文献
中文文献
专利

1. Towards End-to-End Prosody Transfer for Expressive Speech Synthesis with Tacotron [J] . RJ Skerry-Ryan, Eric Battenberg, Ying Xiao, JMLR: Workshop and Conference Proceedings . 2018,第2010期

机译：达到最终韵律转移，用于塔歇尔斯竞争语言合成
2. Hierarchical Prosody Conversion Using Regression-Based Clustering for Emotional Speech Synthesis [J] . Chung-Hsien Wu, Chi-Chun Hsia, Chung-Han Lee, Audio, Speech, and Language Processing, IEEE Transactions on . 2010,第6期

机译：使用基于回归的聚类进行层次韵律转换以进行情感语音合成
3. Prosodic control for Japanese text-to-speech synthesis [J] . Yasushi Ishikawa 電子情報通信学会技術研究報告. 音声. Speech . 2000,第392期

机译：日语文本到语音合成的韵律控制
4. Robust and Fine-grained Prosody Control of End-to-end Speech Synthesis [C] . Younggun Lee, Taesu Kim IEEE International Conference on Acoustics, Speech and Signal Processing . 2019

机译：端到端语音合成的鲁棒细粒度韵律控制
5. Building a prosodically sensitive diphone database for a Korean text-to-speech synthesis system. [D] . Yoon, Kyuchul. 2005

机译：为韩国文字转语音合成系统建立一个对韵律敏感的diphone数据库。
6. The Prosodic Marionette: a method to visualize speech prosody and assess perceptual and expressive prosodic abilities [O] . Jonathan S. Brumberg, Jill C. Thorson, Rupal Patel -1

机译：韵律木偶：一种可视化语音韵律并评估感知和表达韵律能力的方法
7. Robust and Fine-grained Prosody Control of End-to-end Speech Synthesis [O] . Younggun Lee, Taesu Kim 2019

机译：对端到端语音合成的鲁棒和细粒度的韵律控制

Prosodic Clustering for Phoneme-Level Prosody Control in End-to-End Speech Synthesis

摘要

著录项

相似文献

相关主题

期刊订阅