CAMNet: A controllable acoustic model for efficient, expressive, high-quality text-to-speech

Alvarez Jesus Monge; Francois Holly; Sung HosangChoi SeungdoJeong JonghoonChoo KihyunMin KyoungboPark Sangjun

首页> 外文期刊>Applied acoustics >CAMNet: A controllable acoustic model for efficient, expressive, high-quality text-to-speech

【24h】

CAMNet: A controllable acoustic model for efficient, expressive, high-quality text-to-speech

机译：CAMNet: A controllable acoustic model for efficient, expressive, high-quality text-to-speech

获取原文

获取原文并翻译 | 示例

掌桥外文数据库（机构版） >>

开具论文收录证明 >>

文献代查 >>

页面导航

摘要
著录项
相关主题

摘要

Spoken language is becoming one of the key components of human-machine interaction, both to send information to the machine - e.g. voice control - and to receive from it - e.g. virtual assistants. In this scenario, text-to-speech (TTS) models have become an essential artificial intelligence capacity. Even though this interaction can be based on neutral style speech, generating speech with different styles, pitches and speaking rates may improve user experience. With this in view, this paper presents CAMNet, a controllable acoustic model for efficient, expressive, high-quality TTS. CAMNet is based on deep convolutional TTS (DCTTS), a state-of-art acoustic model which is efficient and produces neutral speech. DCTTS was first adapted to generate Bark cepstrum acoustic features in order to integrate well with the LPCNet (linear prediction coefficient) neural vocoder and to remove the reduction factor which demanded the presence of an upsampling network before the vocoder - i.e. the CAMNet output can be directly fed into LPCNet. Next, style transfer functionality was added by means of a novel characterisation of the prosodic information from the Bark cepstrum acoustic features and a new approach to inject this information into the convolutional layers. Finally, controllability is provided via a variational autoencoder module which creates a smoothed disentangled latent space which allows interpolation and extrapolation of reference styles as well as independent and simultaneous control of two generative factors: pitch and speaking rate. Moreover, this controllability is implemented using a simple offset-based approach. To sum up, CAMNet is an efficient acoustic model which provides a simple but consistent controllability on coarse-grained expression, pitch and speaking rate while still providing high-quality synthesised speech. (C) 2021 Elsevier Ltd. All rights reserved.

著录项

来源
《Applied acoustics》 |2022年第1期|108439.1-108439.11|共11页
作者
Alvarez Jesus Monge; Francois Holly; Sung HosangChoi SeungdoJeong JonghoonChoo KihyunMin KyoungboPark Sangjun;
展开▼
作者单位

Samsung Res UK, Commun House,South St, Staines Upon Thames TW18 4QE, England;

Samsung Res, Speech Proc Lab, 56 Seongchon Gil, Seoul, South Korea;

展开▼
收录信息美国《科学引文索引》(SCI);美国《工程索引》(EI);
原文格式 PDF
正文语种英语
中图分类
关键词
Text-to-speech; Expressive TTS; Acoustic model; VAE; Disentanglement; Speech synthesis;

CAMNet: A controllable acoustic model for efficient, expressive, high-quality text-to-speech

摘要

著录项

相关主题

期刊订阅