首页> 外文期刊>Information Processing & Management >SLTFNet: A spatial and language-temporal tensor fusion network for video moment retrieval
【24h】

SLTFNet: A spatial and language-temporal tensor fusion network for video moment retrieval

机译:SLTFNet:用于视频矩检索的空间和语言-时间张量融合网络

获取原文
获取原文并翻译 | 示例
           

摘要

This paper focuses on temporal retrieval of activities in videos via sentence queries. Given a sentence query describing an activity, temporal moment retrieval aims at localizing the temporal segment within the video that best describes the textual query. This is a general yet challenging task as it requires the comprehending of both video and language. Existing research pre-dominantly employ coarse frame-level features as the visual representation, obfuscating the specific details (e.g., the desired objects "girl", "cup" and action "pour") within the video which may provide critical cues for localizing the desired moment. In this paper, we propose a novel Spatial and Language-Temporal Tensor Fusion (SLTF) approach to resolve those issues. Specifically, the SLTF method first takes advantage of object-level local features and attends to the most relevant local features (e.g., the local features "girl", "cup") by spatial attention. Then we encode the sequence of the local features on consecutive frames by employing LSTM network, which can capture the motion information and interactions among these objects (e.g., the interaction "pour" involving these two objects). Meanwhile, language-temporal attention is utilized to emphasize the keywords based on moment context information. Thereafter, a tensor fusion network learns both the intra-modality and inter-modality dynamics, which can enhance the learning of moment-query representation. Therefore, our proposed two attention sub-networks can adaptively recognize the most relevant objects and interactions in the video, and simultaneously highlight the keywords in the query for retrieving the desired moment. Experimental results on three public benchmark datasets (obtained from TACOS, Charades-STA, and DiDeMo) show that the SLTF model significantly outperforms current state-of-the-art approaches, and demonstrate the benefits produced by new technologies incorporated into SLTF.
机译:本文着重于通过句子查询来临时检索视频中的活动。给定描述活动的句子查​​询,时间矩检索旨在定位视频中最能描述文字查询的时间段。这是一项普遍但具有挑战性的任务,因为它需要理解视频和语言。现有研究主要采用粗略的帧级特征作为视觉表示,混淆了视频中的特定细节(例如所需的对象“女孩”,“杯子”和动作“倒”),这可能会为定位视频提供关键线索。所需的时刻。在本文中,我们提出了一种新颖的时空和语言-时间张量融合(SLTF)方法来解决这些问题。具体而言,SLTF方法首先利用对象级别的局部特征,并通过空间关注来关注最相关的局部特征(例如,局部特征“ girl”,“ cup”)。然后,我们使用LSTM网络对连续帧上的局部特征序列进行编码,该网络可以捕获运动信息以及这些对象之间的相互作用(例如,涉及这两个对象的相互作用“倾泻”)。同时,基于时态上下文信息,利用时空注意力来强调关键词。此后,张量融合网络学习模态内和模态间的动力学,这可以增强对矩查询表示的学习。因此,我们提出的两个注意力子网可以自适应地识别视频中最相关的对象和交互,并同时在查询中突出显示关键字以检索所需的时刻。在三个公共基准数据集(从TACOS,Charades-STA和DiDeMo获得)上的实验结果表明,SLTF模型明显优于当前的最新方法,并证明了SLTF中包含的新技术所带来的好处。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号