首页> 外文期刊>Computer speech and language >Investigation of learning abilities on linguistic features in sequence-to-sequence text-to-speech synthesis
【24h】

Investigation of learning abilities on linguistic features in sequence-to-sequence text-to-speech synthesis

机译:序列到序列文本与语音合成中语言特征的学习能力研究

获取原文
获取原文并翻译 | 示例
       

摘要

Neural sequence-to-sequence text-to-speech synthesis (ITS) can produce high-quality speech directly from text or simple linguistic features such as phonemes. Unlike traditional pipeline TTS, the neural sequence-to-sequence TTS does not require manually annotated and complicated linguistic features such as part-of-speech tags and syntactic structures for system training. However, it must be carefully designed and well optimized so that it can implicitly extract useful linguistic features from the input features. In this paper we investigate under what conditions the neural sequence-to-sequence TTS can work well in Japanese and English along with comparisons with deep neural network (DNN) based pipeline TTS systems. Unlike past comparative studies, the pipeline systems also use neural autoregressive (AR) probabilistic modeling and a neural vocoder in the same way as the sequence-to-sequence systems do for a fair and deep analysis in this paper. We investigated systems from three aspects: a) model architecture, b) model parameter size, and c) language. For the model architecture aspect, we adopt modified Tacotron systems that we previously proposed and their variants using an encoder from Tacotron or Tacotron2. For the model parameter size aspect, we investigate two model parameter sizes. For the language aspect, we conduct listening tests in both Japanese and English to see if our findings can be generalized across languages. Our experiments on Japanese demonstrated that the Tacotron TTS systems with increased parameter size and input of phonemes and accentual type labels outperformed the DNN-based pipeline systems using the complicated linguistic features and that its encoder could learn to compensate for a lack of rich linguistic features. Our experiments on English demonstrated that, when using a suitable encoder, the Tacotron TTS system with characters as input can disambiguate pronunciations and produce natural speech as good as those of the systems using phonemes. However, we also found that the encoder could not learn English stressed syllables from characters perfectly and hence resulted in flatter fundamental frequency. In summary, these experimental results suggest that a) a neural sequence-to-sequence TTS system should have a sufficient number of model parameters to produce high quality speech, b) it should also use a powerful encoder when it takes characters as inputs, and c) the encoder still has a room for improvement and needs to have an improved architecture to learn supra-segmental features more appropriately.
机译:神经序列到序列文本到语音合成(其)可以直接从文本或简单的语言特征(如音素)产生高质量的语音。与传统管道TTS不同,神经序列到序列TTS不需要手动注释和复杂的语言特征,例如语音标签和用于系统训练的句法结构。但是,必须精心设计和精通优化,以便可以隐式从输入功能中提取有用的语言特征。在本文中,我们在神经序列到序列TTS可以用日语和英语井合作的条件下进行调查,以及与基于深度神经网络(DNN)的管道TTS系统的比较。与过去的比较研究不同,管道系统也使用神经自回归(AR)概率模型和神经探测器,与序列到序列系统在本文中的公平和深度分析相同。我们调查了三个方面的系统:a)模型架构,b)模型参数大小和c)语言。对于模型架构方面,我们采用先前提出的修改型TacoTron系统及其使用来自塔克罗伦或Tacotron2的编码器的变体。对于模型参数大小方面,我们调查了两个模型参数大小。对于语言方面,我们在日语和英语中进行聆听测试,看看我们的调查结果是否可以遍历跨语言。我们对日语的实验表明,具有增加的参数大小和音素的输入和智能型标签的Tacotron TTS系统优于使用复杂的语言特征表现基于DNN的管道系统,并且其编码器可以学会弥补缺乏丰富的语言特征。我们的英语实验证明,在使用合适的编码器时,带有字符的Tacotron TTS系统可以消除发音,并产生与使用音素的系统的自然语音一样好。但是,我们还发现编码器无法完美地从角色学习英语压力音节,因此导致更平坦的基础频率。总之,这些实验结果表明a)神经序列到序列TTS系统应该具有足够数量的模型参数来产生高质量的语音,b)它应该在需要字符作为输入时使用强大的编码器,并且c)编码器仍然具有改进的空间,并且需要具有改进的架构来更适当地学习Supra-semmental功能。

著录项

  • 来源
    《Computer speech and language》 |2021年第5期|101183.1-101183.18|共18页
  • 作者单位

    National Institute of Informatics 2-1-2 Hitotsubashi Chiyoda-ku Tokyo 101-8430 Japan SOKENDAI (The Graduate University for Advanced Studies) Shonan Village Hayama Kanagawa 240-0793 Japan;

    National Institute of Informatics 2-1-2 Hitotsubashi Chiyoda-ku Tokyo 101 -8430 Japan;

    National Institute of Informatics 2-1-2 Hitotsubashi Chiyoda-ku Tokyo 101 -8430 Japan SOKENDAI (The Graduate University for Advanced Studies) Shonan Village Hayama Kanagawa 240-0793 Japan Centre for Speech Technology Research University of Edinburgh 10 Crichton Street Edinburgh EHS 9AB UK;

  • 收录信息 美国《科学引文索引》(SCI);美国《工程索引》(EI);
  • 原文格式 PDF
  • 正文语种 eng
  • 中图分类
  • 关键词

    Text-to-speech synthesis; Deep learning; Sequence-to-sequence model; End-to-end learning; Tacotron;

    机译:文本到语音合成;深度学习;序列到序列模型;最终学习;塔克罗斯龙;

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号