【24h】

Deep Visual-semantic for Crowded Video Understanding

机译:深入的视觉语义,适合拥挤的视频理解

获取原文

摘要

Visual-semantic features play a vital role for crowded video understanding. Convolutional Neural Networks (CNNs) have experienced a significant breakthrough in learning representations from images. However, the learning of visual-semantic features, and how it can be effectively extracted for video analysis, still remains a challenging task. In this study, we propose a novel visual-semantic method to capture both appearance and dynamic representations. In particular, we propose a spatial context method, based on the fractional Fisher vector (FV) encoding on CNN features, which can be regarded as our main contribution. In addition, to capture temporal context information, we also applied fractional encoding method on dynamic images. Experimental results on the WWW crowed video dataset demonstrate that the proposed method outperform the state of the art.
机译:视觉语义功能对于拥挤的视频理解起着至关重要的作用。卷积神经网络(CNN)在从图像中学习表示形式方面取得了重大突破。然而,视觉语义特征的学习以及如何有效地提取它进行视频分析仍然是一项艰巨的任务。在这项研究中,我们提出了一种新颖的视觉语义方法来捕获外观和动态表示。特别是,我们提出了一种基于CNN特征的分数Fisher向量(FV)编码的空间上下文方法,这可以被视为我们的主要贡献。此外,为了捕获时间上下文信息,我们还对动态图像应用了分数编码方法。在WWW拥挤的视频数据集上的实验结果表明,所提出的方法优于现有技术。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号