Deep Encoded Linguistic and Acoustic Cues for Attention Based End to End Speech Emotion Recognition

机译：基于注意力的端到端语音情感识别的深度编码语言和声学提示

获取原文

页面导航

摘要
著录项
相似文献
相关主题

摘要

An End-to-End model with convolutional layers and multi-head self attention mechanism is proposed for Speech Emotion Recognition (SER) task. As inputs, we propose to use both the deep encoded linguistic features that carry the language related context of emotion and the audio spectrogram that are representatives of acoustic cues. To facilitate the deep linguistic feature representation, we use outputs from the intermediate layers of a pre-trained Automatic Speech Recognition (ASR) model, where the layer is selected empirically. The influence of both acoustic and linguistic features, both separately and in combination, for emotion recognition in different scenarios (scripted and spontaneous recording of emotional speech samples) have been studied. Extensive experiments on the standard IEMOCAP database are conducted to investigate the efficacy of our proposed approach. To address the class imbalance, we carried out down sampling and ensembling, which further improved the SER accuracy. Overall, we observe that the acoustic features perform best for improvised recordings which is due to the spontaneity in speech with less linguistic correlation. But the linguistic features are found to be effective for the scripted as well as for the combined (scripted and improvised recordings together) scenario that reflects more linguistic information in spoken utterances.

机译：提出了一种卷积层和多头自我注意机制的端到端模型，用于语音情感识别（SER）任务。作为输入，我们建议使用携带语言相关背景的深度编码的语言特征以及声音线索代表的音频谱图。为了便于深度语言特征表示，我们使用从预训练的自动语音识别（ASR）模型的中间层的输出，在经验上选择该图层。研究了声学和语言特征的影响，分别和组合，用于不同场景中的情感识别（脚本和情绪语音样本的自发记录）。对标准IEMocap数据库进行了广泛的实验，以调查我们提出的方法的功效。为了解决类别的不平衡，我们进行了抽样和合奏，进一步提高了SER准确性。总的来说，我们观察到声学特征对于即兴记录的最佳性能，这是由于具有较少语言相关性的言语的自发性。但是，语言特征被发现对脚本的效果是有效的，以及组合（脚本和简易录音）的情景，这些情况反映了口语话语中的更多语言信息。

著录项

来源
《IEEE International Conference on Acoustics, Speech and Signal Processing》|2020年|7189-7193|共5页
会议地点
作者
Swapnil Bhosale; Rupayan Chakraborty; Sunil Kumar Kopparapu;
展开▼
作者单位

展开▼
会议组织
原文格式 PDF
正文语种
中图分类
关键词
Speech emotion recognition; End-to-End system; multi-head self attention; linguistic features; acoustic features;

机译：语音情感识别;端到端系统;多头自我关注;语言特征;声学特征;

相似文献

外文文献
中文文献
专利

1. A novel speech emotion recognition algorithm based on wavelet kernel sparse classifier in stacked deep auto-encoder model [J] . Wei Pengcheng, Zhao Yu Personal and Ubiquitous Computing . 2019,第3a4期

机译：堆叠深度自动编码器模型中基于小波核稀疏分类器的语音情感识别新算法
2. A novel speech emotion recognition algorithm based on wavelet kernel sparse classifier in stacked deep auto-encoder model [J] . Wei Pengcheng, Zhao Yu Personal and Ubiquitous Computing . 2019,第3a4期

机译：一种基于小波核稀疏分类器的新型语音情感识别算法，堆叠深自动编码器模型
3. On-line emotion recognition in a 3-D activation-valence-time continuum using acoustic and linguistic cues [J] . Florian Eyben, Martin Woellmer, Alex Graves, Journal on multimodal user interfaces . 2010,第1a2期

机译：使用声音和语言线索在3-D激活价时间连续体内进行在线情感识别
4. Deep Encoded Linguistic and Acoustic Cues for Attention Based End to End Speech Emotion Recognition [C] . Swapnil Bhosale, Rupayan Chakraborty, Sunil Kumar Kopparapu IEEE International Conference on Acoustics, Speech and Signal Processing . 2020

机译：深度编码的语言和声学线索的关注结束结束语音情感识别
5. Perceiving speech in context: Compensation for contextual variability during acoustic cue encoding and categorization. [D] . Toscano, Joseph Christopher. 2011

机译：在上下文中感知语音：在声学提示编码和分类过程中补偿上下文变化。
6. Recognition of Emotions in Mexican Spanish Speech: An Approach Based on Acoustic Modelling of Emotion-Specific Vowels [O] . Santiago-Omar Caballero-Morales 2013

机译：墨西哥西班牙语语音中的情绪识别：一种基于情绪特定元音声学模型的方法
7. On-line Emotion Recognition in a 3-D Activation-Valence-Time Continuum using Acoustic and Linguistic Cues [O] . Eyben, F., Wollmer, M., Graves, A., 2010

机译：使用声音和语言提示的3-D激活价时间连续体中的在线情感识别

Deep Encoded Linguistic and Acoustic Cues for Attention Based End to End Speech Emotion Recognition

摘要

著录项

相似文献

相关主题

期刊订阅