...
首页> 外文期刊>IEEE transactions on multimedia >Knowledge-Augmented Multimodal Deep Regression Bayesian Networks for Emotion Video Tagging
【24h】

Knowledge-Augmented Multimodal Deep Regression Bayesian Networks for Emotion Video Tagging

机译:知识增强多模式深度回归贝叶斯网络用于情感视频标记

获取原文
获取原文并翻译 | 示例
           

摘要

The immanent dependencies between audio and visual modalities extracted from video content and the well-established film grammar (i.e., domain knowledge) are important for emotion video recognition and regression. However, these tools have yet to be exploited successfully. Therefore, we propose a multimodal deep regression Bayesian network (MMDRBN) to capture the relationship between audio and visual modalities for emotion video tagging. We then modify the structure of the MMDRBN to incorporate domain knowledge. A regression Bayesian network (RBN) is formed from one latent layer, one visible layer and directed links from the latent layer to the visible layer. RBN is able to fully represent the data, since it captures the dependencies not only among the visible variables but also among the latent variables given visible variables. For the MMDRBN, first, we learn several layers of RBNs using audio and visual modalities, and then stack these RBNs to form two deep networks. A joint representation is obtained from the top layers of the two deep networks, capturing the deep dependencies between audio and visual modalities. We also summarize the main audio and visual elements used by filmmakers to convey emotions and formulate them as semantical meaningful middle-level representation, i.e., attributes. Through these attributes, we construct the knowledge-augmented MMDRBN, which learns a hybrid middle-level video representation using video data and the summarized attributes. Experimental results of both emotion recognition and regression from videos on the LIRIS-ACCEDE database demonstrate that the proposed model can successfully capture the intrinsic connections between audio and visual modalities, and integrate the middle-level representation learning from video data and semantical attributes summarized from film grammar. Thus, it achieves superior performance on emotion video tagging compared to state-of-the-art methods.
机译:从视频内容和良好的胶片语法(即域知识)提取的音频和视觉模型之间的内在依赖性对情感视频识别和回归很重要。但是,这些工具尚未成功利用。因此,我们提出了一种多模式深度回归贝叶斯网络(MMDRBN)来捕获音频和视觉模式之间的关系进行情感视频标记。然后,我们修改MMDRBN的结构以合并域知识。回归贝叶斯网络(RBN)由一个潜在的层,一个可见层和从潜在的层的定向链接形成到可见层。 RBN能够完全代表数据,因为它不仅捕获了可见变量中的依赖性,而且还可以在给定可见变量的潜在变量中捕获依赖关系。对于MMDRBN,首先,我们使用音频和可视模式学习几层RBN,然后堆叠这些RBN以形成两个深网络。从两个深网络的顶层获得联合表示,捕获音频和视觉模态之间的深度依赖性。我们还总结了电影制作人使用的主要音频和视觉元素,以传达情绪,并将其作为语义有意义的中级表示,即属性。通过这些属性,我们构建了知识增强的MMDRBN,它使用视频数据和汇总属性来学习混合中级视频表示。 Liris-Accene数据库中视频的情感识别和回归的实验结果表明,所提出的模型可以成功地捕获音频和视觉模当之间的内在连接,并将中间级表示从视频数据和从电影汇总的语义属性集成语法。因此,与最先进的方法相比,它在情感视频标记上实现了卓越的性能。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号