Knowledge-Augmented Multimodal Deep Regression Bayesian Networks for Emotion Video Tagging

Wang Shangfei; Hao Longfei; Ji Qiang

首页> 外文期刊>IEEE transactions on multimedia >Knowledge-Augmented Multimodal Deep Regression Bayesian Networks for Emotion Video Tagging

【24h】

Knowledge-Augmented Multimodal Deep Regression Bayesian Networks for Emotion Video Tagging

机译：知识增强多模式深度回归贝叶斯网络用于情感视频标记

获取原文

获取原文并翻译 | 示例

掌桥外文数据库（机构版） >>

开具论文收录证明 >>

文献代查 >>

页面导航

摘要
著录项
相似文献
相关主题

摘要

The immanent dependencies between audio and visual modalities extracted from video content and the well-established film grammar (i.e., domain knowledge) are important for emotion video recognition and regression. However, these tools have yet to be exploited successfully. Therefore, we propose a multimodal deep regression Bayesian network (MMDRBN) to capture the relationship between audio and visual modalities for emotion video tagging. We then modify the structure of the MMDRBN to incorporate domain knowledge. A regression Bayesian network (RBN) is formed from one latent layer, one visible layer and directed links from the latent layer to the visible layer. RBN is able to fully represent the data, since it captures the dependencies not only among the visible variables but also among the latent variables given visible variables. For the MMDRBN, first, we learn several layers of RBNs using audio and visual modalities, and then stack these RBNs to form two deep networks. A joint representation is obtained from the top layers of the two deep networks, capturing the deep dependencies between audio and visual modalities. We also summarize the main audio and visual elements used by filmmakers to convey emotions and formulate them as semantical meaningful middle-level representation, i.e., attributes. Through these attributes, we construct the knowledge-augmented MMDRBN, which learns a hybrid middle-level video representation using video data and the summarized attributes. Experimental results of both emotion recognition and regression from videos on the LIRIS-ACCEDE database demonstrate that the proposed model can successfully capture the intrinsic connections between audio and visual modalities, and integrate the middle-level representation learning from video data and semantical attributes summarized from film grammar. Thus, it achieves superior performance on emotion video tagging compared to state-of-the-art methods.

机译：从视频内容和良好的胶片语法（即域知识）提取的音频和视觉模型之间的内在依赖性对情感视频识别和回归很重要。但是，这些工具尚未成功利用。因此，我们提出了一种多模式深度回归贝叶斯网络（MMDRBN）来捕获音频和视觉模式之间的关系进行情感视频标记。然后，我们修改MMDRBN的结构以合并域知识。回归贝叶斯网络（RBN）由一个潜在的层，一个可见层和从潜在的层的定向链接形成到可见层。 RBN能够完全代表数据，因为它不仅捕获了可见变量中的依赖性，而且还可以在给定可见变量的潜在变量中捕获依赖关系。对于MMDRBN，首先，我们使用音频和可视模式学习几层RBN，然后堆叠这些RBN以形成两个深网络。从两个深网络的顶层获得联合表示，捕获音频和视觉模态之间的深度依赖性。我们还总结了电影制作人使用的主要音频和视觉元素，以传达情绪，并将其作为语义有意义的中级表示，即属性。通过这些属性，我们构建了知识增强的MMDRBN，它使用视频数据和汇总属性来学习混合中级视频表示。 Liris-Accene数据库中视频的情感识别和回归的实验结果表明，所提出的模型可以成功地捕获音频和视觉模当之间的内在连接，并将中间级表示从视频数据和从电影汇总的语义属性集成语法。因此，与最先进的方法相比，它在情感视频标记上实现了卓越的性能。

著录项

来源
《IEEE transactions on multimedia》 |2020年第4期|1084-1097|共14页
作者
Wang Shangfei; Hao Longfei; Ji Qiang;
展开▼
作者单位

Univ Sci & Technol China Key Lab Comp & Commun Software Anhui Prov Sch Comp Sci & Technol Hefei 230027 Peoples R China|Univ Sci & Technol China Sch Data Sci Hefei 230027 Peoples R China;

Univ Sci & Technol China Sch Comp Sci & Technol Hefei 230027 Peoples R China;

Rensselaer Polytech Inst Dept Elect Comp & Syst Engn Troy NY 12180 USA;

展开▼
收录信息
原文格式 PDF
正文语种 eng
中图分类
关键词
Visualization; Tagging; Bayes methods; Feature extraction; Grammar; Emotion recognition; Knowledge engineering; Regression Bayesian network; Multi-modal deep network; Domain knowledge; Emotion video tagging;

机译：可视化;标记;贝叶斯方法;特征提取;语法;情感识别;知识工程;回归贝叶斯网络;多模态深网络;域知识;情绪视频标签;

相似文献

外文文献
中文文献
专利

1. Deep learning-based late fusion of multimodal information for emotion classification of music video [J] . Yagya Raj Pandeya, Joonwhoan Lee Multimedia Tools and Applications . 2021,第2期

机译：基于深度学习的音乐视频情感分类的多峰信息深融合
2. Emotion recognition using multimodal deep learning in multiple psychophysiological signals and video [J] . Wang Zhongmin, Zhou Xiaoxiao, Wang Wenlang, International journal of machine learning and cybernetics . 2020,第4期

机译：使用多模式深度学习在多种心理生理信号和视频中进行情绪识别
3. EmoNets: Multimodal deep learning approaches for emotion recognition in video [J] . Kahou Samira Ebrahimi, Bouthillier Xavier, Lamblin Pascal, Journal on multimodal user interfaces . 2016,第2期

机译：EmoNets：用于视频情感识别的多模式深度学习方法
4. A Multimodal Deep Regression Bayesian Network for Affective Video Content Analyses [C] . Quan Gan, Shangfei Wang, Longfei Hao, IEEE International Conference on Computer Vision . 2017

机译：一种用于情感视频内容分析的多模式深回归贝叶斯网络
5. Multimodal Sensing and Data Processing for Speaker and Emotion Recognition Using Deep Learning Models with Audio, Video and Biomedical Sensors [D] . Abtahi, Farnaz. 2018

机译：使用具有音频，视频和生物医学传感器的深度学习模型，对说话人和情感识别进行多模式传感和数据处理
6. Deep-Learning-Based Multimodal Emotion Classification for Music Videos [O] . Yagya Raj Pandeya, Bhuwan Bhattarai, Joonwhoan Lee 2021

机译：基于深度学习的音乐视频的多模式情感分类
7. Classification-oriented structure learning in Bayesian networks for multimodal event detection in videos [O] . Gravier, Guillaume, Demarty, Claire-Hélène, Baghdadi, Siwar, 2012

机译：贝叶斯网络中面向分类的结构学习，用于视频中的多模式事件检测
8. How Deep Neural Networks Can Improve Emotion Recognition on Video Data. [R] . Brady, K., Dagli, C., Khorrami, P., 2016

机译：深度神经网络如何改善视频数据的情感识别。

Knowledge-Augmented Multimodal Deep Regression Bayesian Networks for Emotion Video Tagging

摘要

著录项

相似文献

相关主题

期刊订阅