Investigation of learning abilities on linguistic features in sequence-to-sequence text-to-speech synthesis

Yusuke Yasuda; Xin Wang; Junichi Yamagishi

首页> 外文期刊>Computer speech and language >Investigation of learning abilities on linguistic features in sequence-to-sequence text-to-speech synthesis

【24h】

Investigation of learning abilities on linguistic features in sequence-to-sequence text-to-speech synthesis

机译：序列到序列文本与语音合成中语言特征的学习能力研究

获取原文

获取原文并翻译 | 示例

掌桥外文数据库（机构版） >>

开具论文收录证明 >>

页面导航

摘要
著录项
相似文献
相关主题

摘要

Neural sequence-to-sequence text-to-speech synthesis (ITS) can produce high-quality speech directly from text or simple linguistic features such as phonemes. Unlike traditional pipeline TTS, the neural sequence-to-sequence TTS does not require manually annotated and complicated linguistic features such as part-of-speech tags and syntactic structures for system training. However, it must be carefully designed and well optimized so that it can implicitly extract useful linguistic features from the input features. In this paper we investigate under what conditions the neural sequence-to-sequence TTS can work well in Japanese and English along with comparisons with deep neural network (DNN) based pipeline TTS systems. Unlike past comparative studies, the pipeline systems also use neural autoregressive (AR) probabilistic modeling and a neural vocoder in the same way as the sequence-to-sequence systems do for a fair and deep analysis in this paper. We investigated systems from three aspects: a) model architecture, b) model parameter size, and c) language. For the model architecture aspect, we adopt modified Tacotron systems that we previously proposed and their variants using an encoder from Tacotron or Tacotron2. For the model parameter size aspect, we investigate two model parameter sizes. For the language aspect, we conduct listening tests in both Japanese and English to see if our findings can be generalized across languages. Our experiments on Japanese demonstrated that the Tacotron TTS systems with increased parameter size and input of phonemes and accentual type labels outperformed the DNN-based pipeline systems using the complicated linguistic features and that its encoder could learn to compensate for a lack of rich linguistic features. Our experiments on English demonstrated that, when using a suitable encoder, the Tacotron TTS system with characters as input can disambiguate pronunciations and produce natural speech as good as those of the systems using phonemes. However, we also found that the encoder could not learn English stressed syllables from characters perfectly and hence resulted in flatter fundamental frequency. In summary, these experimental results suggest that a) a neural sequence-to-sequence TTS system should have a sufficient number of model parameters to produce high quality speech, b) it should also use a powerful encoder when it takes characters as inputs, and c) the encoder still has a room for improvement and needs to have an improved architecture to learn supra-segmental features more appropriately.

机译：神经序列到序列文本到语音合成（其）可以直接从文本或简单的语言特征（如音素）产生高质量的语音。与传统管道TTS不同，神经序列到序列TTS不需要手动注释和复杂的语言特征，例如语音标签和用于系统训练的句法结构。但是，必须精心设计和精通优化，以便可以隐式从输入功能中提取有用的语言特征。在本文中，我们在神经序列到序列TTS可以用日语和英语井合作的条件下进行调查，以及与基于深度神经网络（DNN）的管道TTS系统的比较。与过去的比较研究不同，管道系统也使用神经自回归（AR）概率模型和神经探测器，与序列到序列系统在本文中的公平和深度分析相同。我们调查了三个方面的系统：a）模型架构，b）模型参数大小和c）语言。对于模型架构方面，我们采用先前提出的修改型TacoTron系统及其使用来自塔克罗伦或Tacotron2的编码器的变体。对于模型参数大小方面，我们调查了两个模型参数大小。对于语言方面，我们在日语和英语中进行聆听测试，看看我们的调查结果是否可以遍历跨语言。我们对日语的实验表明，具有增加的参数大小和音素的输入和智能型标签的Tacotron TTS系统优于使用复杂的语言特征表现基于DNN的管道系统，并且其编码器可以学会弥补缺乏丰富的语言特征。我们的英语实验证明，在使用合适的编码器时，带有字符的Tacotron TTS系统可以消除发音，并产生与使用音素的系统的自然语音一样好。但是，我们还发现编码器无法完美地从角色学习英语压力音节，因此导致更平坦的基础频率。总之，这些实验结果表明a）神经序列到序列TTS系统应该具有足够数量的模型参数来产生高质量的语音，b）它应该在需要字符作为输入时使用强大的编码器，并且c）编码器仍然具有改进的空间，并且需要具有改进的架构来更适当地学习Supra-semmental功能。

著录项

来源
《Computer speech and language》 |2021年第5期|101183.1-101183.18|共18页
作者
Yusuke Yasuda; Xin Wang; Junichi Yamagishi;
展开▼
作者单位

National Institute of Informatics 2-1-2 Hitotsubashi Chiyoda-ku Tokyo 101-8430 Japan SOKENDAI (The Graduate University for Advanced Studies) Shonan Village Hayama Kanagawa 240-0793 Japan;

National Institute of Informatics 2-1-2 Hitotsubashi Chiyoda-ku Tokyo 101 -8430 Japan;

National Institute of Informatics 2-1-2 Hitotsubashi Chiyoda-ku Tokyo 101 -8430 Japan SOKENDAI (The Graduate University for Advanced Studies) Shonan Village Hayama Kanagawa 240-0793 Japan Centre for Speech Technology Research University of Edinburgh 10 Crichton Street Edinburgh EHS 9AB UK;

展开▼
收录信息美国《科学引文索引》(SCI);美国《工程索引》(EI);
原文格式 PDF
正文语种 eng
中图分类
关键词
Text-to-speech synthesis; Deep learning; Sequence-to-sequence model; End-to-end learning; Tacotron;

机译：文本到语音合成;深度学习;序列到序列模型;最终学习;塔克罗斯龙;

相似文献

外文文献
中文文献
专利

1. Investigation of Using Continuous Representation of Various Linguistic Units in Neural Network Based Text-to-Speech Synthesis [J] . Xin WANG, Shinji TAKAKI, Junichi YAMAGISHI IEICE transactions on information and systems . 2016,第10期

机译：基于神经网络的文本语音合成中使用各种语言单元连续表示的研究
2. The Effect of the Contextual, The Problem-Based, and the Group Investigation Learning Models on the Short Story Appreciation Ability Viewed from the Verbal Linguistic Intelligences [J] . Purwadi ., Sarwiji Suwandi, Budiyono ., Journal of Education and Practice . 2013,第12期

机译：从言语语言智能角度看语境，基于问题和群体调查学习模型对短篇小说欣赏能力的影响
3. Improved Customer Lifetime Value Prediction With Sequence-To-Sequence Learning and Feature-Based Models [J] . Bauer Josef, Jannach Dietmar ACM transactions on knowledge discovery from data . 2021,第5期

机译：用序列到序列学习和基于功能的模型提高客户终身值预测
4. Investigating the importance of linguistic complexity features across different datasets related to language learning [C] . Ildiko Pilan, Elena Volodina Workshop on linguistic complexity and natural language processing . 2018

机译：研究与语言学习相关的不同数据集的语言复杂性功能的重要性
5. An Investigation into Approaches to Text-to-Speech Synthesis for Modern Standard Arabic [D] . ?Alabbad, Dena A. 2019

机译：现代标准阿拉伯文文本综合综合方法的调查
6. Linguistic Labels Dynamic Visual Features and Attention in Infant Category Learning [O] . Wei (Sophia) Deng, Vladimir M. Sloutsky -1

机译：语言标签动态视觉功能和婴儿类别学习中的注意
7. Investigation of Using Continuous Representation of Various Linguistic Units in Neural Network Based Text-to-Speech Synthesis [O] . Wang, Xin, Takaki, Shinji, Yamagishi, Junichi 2016

机译：基于神经网络的文本语音合成中各种语言单元连续表示的研究

Investigation of learning abilities on linguistic features in sequence-to-sequence text-to-speech synthesis

摘要

著录项

相似文献

相关主题

期刊订阅