首页> 外文会议>IEEE International Conference on Acoustics, Speech and Signal Processing >Deep Encoded Linguistic and Acoustic Cues for Attention Based End to End Speech Emotion Recognition
【24h】

Deep Encoded Linguistic and Acoustic Cues for Attention Based End to End Speech Emotion Recognition

机译:基于注意力的端到端语音情感识别的深度编码语言和声学提示

获取原文

摘要

An End-to-End model with convolutional layers and multi-head self attention mechanism is proposed for Speech Emotion Recognition (SER) task. As inputs, we propose to use both the deep encoded linguistic features that carry the language related context of emotion and the audio spectrogram that are representatives of acoustic cues. To facilitate the deep linguistic feature representation, we use outputs from the intermediate layers of a pre-trained Automatic Speech Recognition (ASR) model, where the layer is selected empirically. The influence of both acoustic and linguistic features, both separately and in combination, for emotion recognition in different scenarios (scripted and spontaneous recording of emotional speech samples) have been studied. Extensive experiments on the standard IEMOCAP database are conducted to investigate the efficacy of our proposed approach. To address the class imbalance, we carried out down sampling and ensembling, which further improved the SER accuracy. Overall, we observe that the acoustic features perform best for improvised recordings which is due to the spontaneity in speech with less linguistic correlation. But the linguistic features are found to be effective for the scripted as well as for the combined (scripted and improvised recordings together) scenario that reflects more linguistic information in spoken utterances.
机译:提出了一种卷积层和多头自我注意机制的端到端模型,用于语音情感识别(SER)任务。作为输入,我们建议使用携带语言相关背景的深度编码的语言特征以及声音线索代表的音频谱图。为了便于深度语言特征表示,我们使用从预训练的自动语音识别(ASR)模型的中间层的输出,在经验上选择该图层。研究了声学和语言特征的影响,分别和组合,用于不同场景中的情感识别(脚本和情绪语音样本的自发记录)。对标准IEMocap数据库进行了广泛的实验,以调查我们提出的方法的功效。为了解决类别的不平衡,我们进行了抽样和合奏,进一步提高了SER准确性。总的来说,我们观察到声学特征对于即兴记录的最佳性能,这是由于具有较少语言相关性的言语的自发性。但是,语言特征被发现对脚本的效果是有效的,以及组合(脚本和简易录音)的情景,这些情况反映了口语话语中的更多语言信息。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号