SLTFNet: A spatial and language-temporal tensor fusion network for video moment retrieval

Jiang Bin; Huanga Xin; Yanga Chao; Yuan Junsong

首页> 外文期刊>Information Processing & Management >SLTFNet: A spatial and language-temporal tensor fusion network for video moment retrieval

【24h】

SLTFNet: A spatial and language-temporal tensor fusion network for video moment retrieval

机译：SLTFNet：用于视频矩检索的空间和语言-时间张量融合网络

获取原文

获取原文并翻译 | 示例

掌桥外文数据库（机构版） >>

开具论文收录证明 >>

文献代查 >>

页面导航

摘要
著录项
相似文献
相关主题

摘要

This paper focuses on temporal retrieval of activities in videos via sentence queries. Given a sentence query describing an activity, temporal moment retrieval aims at localizing the temporal segment within the video that best describes the textual query. This is a general yet challenging task as it requires the comprehending of both video and language. Existing research pre-dominantly employ coarse frame-level features as the visual representation, obfuscating the specific details (e.g., the desired objects "girl", "cup" and action "pour") within the video which may provide critical cues for localizing the desired moment. In this paper, we propose a novel Spatial and Language-Temporal Tensor Fusion (SLTF) approach to resolve those issues. Specifically, the SLTF method first takes advantage of object-level local features and attends to the most relevant local features (e.g., the local features "girl", "cup") by spatial attention. Then we encode the sequence of the local features on consecutive frames by employing LSTM network, which can capture the motion information and interactions among these objects (e.g., the interaction "pour" involving these two objects). Meanwhile, language-temporal attention is utilized to emphasize the keywords based on moment context information. Thereafter, a tensor fusion network learns both the intra-modality and inter-modality dynamics, which can enhance the learning of moment-query representation. Therefore, our proposed two attention sub-networks can adaptively recognize the most relevant objects and interactions in the video, and simultaneously highlight the keywords in the query for retrieving the desired moment. Experimental results on three public benchmark datasets (obtained from TACOS, Charades-STA, and DiDeMo) show that the SLTF model significantly outperforms current state-of-the-art approaches, and demonstrate the benefits produced by new technologies incorporated into SLTF.

机译：本文着重于通过句子查询来临时检索视频中的活动。给定描述活动的句子查询，时间矩检索旨在定位视频中最能描述文字查询的时间段。这是一项普遍但具有挑战性的任务，因为它需要理解视频和语言。现有研究主要采用粗略的帧级特征作为视觉表示，混淆了视频中的特定细节（例如所需的对象“女孩”，“杯子”和动作“倒”），这可能会为定位视频提供关键线索。所需的时刻。在本文中，我们提出了一种新颖的时空和语言-时间张量融合（SLTF）方法来解决这些问题。具体而言，SLTF方法首先利用对象级别的局部特征，并通过空间关注来关注最相关的局部特征（例如，局部特征“ girl”，“ cup”）。然后，我们使用LSTM网络对连续帧上的局部特征序列进行编码，该网络可以捕获运动信息以及这些对象之间的相互作用（例如，涉及这两个对象的相互作用“倾泻”）。同时，基于时态上下文信息，利用时空注意力来强调关键词。此后，张量融合网络学习模态内和模态间的动力学，这可以增强对矩查询表示的学习。因此，我们提出的两个注意力子网可以自适应地识别视频中最相关的对象和交互，并同时在查询中突出显示关键字以检索所需的时刻。在三个公共基准数据集（从TACOS，Charades-STA和DiDeMo获得）上的实验结果表明，SLTF模型明显优于当前的最新方法，并证明了SLTF中包含的新技术所带来的好处。

著录项

来源
《Information Processing & Management》 |2019年第6期|102104.1-102104.16|共16页
作者
Jiang Bin; Huanga Xin; Yanga Chao; Yuan Junsong;
展开▼
作者单位

Hunan Univ Coll Comp Sci & Elect Engn Lushan Rd S Changsha Hunan Peoples R China;

SUNY Buffalo Comp Sci & Engn Dept Buffalo NY 14260 USA;

展开▼
收录信息
原文格式 PDF
正文语种 eng
中图分类
关键词
Cross-modal retrieval; Moment localization; Spatial attention network; Language-temporal attention network; Tensor fusion network;

机译：跨模式检索;瞬间定位;空间关注网络;语言-时间注意网络;张量融合网络;

相似文献

外文文献
中文文献
专利

1. Multi-level feature fusion based Locality-Constrained Spatial Transformer network for video crowd counting [J] . Fang Yanyan, Gao Shenghua, Li Jing, Neurocomputing . 2020,第Juna7期

机译：基于多级特征融合的地区受限空间变压器网络，用于视频人群计数
2. SIFT and Tensor Based Object Detection and Classification in Videos Using Deep Neural Networks [J] . N. Najva, K. Edet Bijoy Procedia Computer Science . 2016,第1期

机译：使用深度神经网络的视频中基于SIFT和Tensor的视频对象检测和分类
3. Amplitude ratios for complete moment tensor retrieval [J] . Jechumtalova Z, Sileny J Geophysical Research Letters . 2005,第22期

机译：完整矩张量检索的振幅比
4. Weakly-Supervised Moment Retrieval Network for Video Corpus Moment Retrieval [C] . Sunjae Yoon, Dahyun Kim, Ji Woo Hong, IEEE International Conference on Image Processing . 2021

机译：视频语料库时刻检索的弱监督时刻检索网络
5. Motion trajectory-based video retrieval and recognition: Tensor analysis and multi-dimensional HMM. [D] . Ma, Xiang. 2009

机译：基于运动轨迹的视频检索和识别：张量分析和多维HMM。
6. Real-time moment-to-moment emotional responses to narrative and informational breast cancer videos in African American women [O] . Sarah Bollinger, Matthew W. Kreuter -1

机译：非洲裔美国女性对叙事性和信息性乳腺癌视频的即时瞬间的情感反应
7. Cross-Modal Interaction Networks for Query-Based Moment Retrieval in Videos [O] . Zhu Zhang, Zhijie Lin, Zhou Zhao, 2019

机译：视频查询时刻检索的跨模型交互网络
8. Ultrashort Pulse retrieval using FROG trace irradiance moments and the adaptive neural networks backpropagation algorithm [R] . Ladera, C. L. , DeLong, K. W. , Trebino, R. , 1995

机译：使用FROG跟踪辐照度矩的超短脉冲检索和自适应神经网络反向传播算法

SLTFNet: A spatial and language-temporal tensor fusion network for video moment retrieval

摘要

著录项

相似文献

相关主题

期刊订阅