首页> 外文会议>IEEE International Conference on Acoustics, Speech and Signal Processing >Automatic Singing Transcription Based on Encoder-decoder Recurrent Neural Networks with a Weakly-supervised Attention Mechanism
【24h】

Automatic Singing Transcription Based on Encoder-decoder Recurrent Neural Networks with a Weakly-supervised Attention Mechanism

机译:基于编码器解码器经常性神经网络的自动唱歌转录,具有弱监督的注意力机制

获取原文

摘要

This paper describes neural singing transcription that estimates a sequence of musical notes directly from the audio signal of singing voice in an end-to-end manner without time-aligned training data. A conventional approach to singing transcription is to perform vocal F0 estimation followed by musical note estimation. The performance of this approach, however, is severely limited because the F0 estimation errors propagate to the note estimation step and rich acoustic information cannot be used. In addition, it is difficult and time-consuming to split continuous signals of singing voices into segments corresponding to musical notes for making precise time-aligned transcriptions. To solve these problems, we use an encoder-decoder model with an attention mechanism that can automatically learn an input-output alignment and mapping, even from non-aligned training data. The main challenge of our study is to estimate temporal categories (note values) in addition to instantaneous categories (pitches). We thus propose a novel loss function for the attention weights of time-aligned notes for semi-supervised alignment training. By gradually reducing the weight of the loss function, a better input-output alignment can be learned much more quickly. We showed that our method performed well for isolated singing voice in popular music.
机译:本文介绍了神经歌唱转录,其直接从端到端的方式从唱歌语音的音频信号估计一个音符序列,而没有时间对齐的训练数据。传统的唱歌转录方法是进行声音F0估计,然后进行音符估计。然而,这种方法的性能受到严重限制,因为F0估计误差传播到音符估计步骤,并且不能使用丰富的声学信息。另外,困难且耗时的是将唱歌的连续信号分成与音乐票据相对应的段,用于制作精确的时间排列的转录。为了解决这些问题,我们使用具有注意机制的编码器 - 解码器模型,即使来自非对齐训练数据,也可以自动学习输入输出对准和映射。除了瞬时类别(音高)之外,我们研究的主要挑战是估计时间类别(注释值)。因此,我们提出了一种新的损失函数,以便对半监督对准训练的时间对齐笔记的注意力重量。通过逐渐减小损失函数的重量,可以更快地学习更好的输入输出对准。我们表明,我们的方法在流行音乐中孤立的歌声表现良好。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号