Beyond RNNs: Positional Self-Attention with Co-Attention for Video Question Answering

机译：超越RNNS：使用共同关注的位置自我关注视频问题应答

获取原文

页面导航

摘要
著录项
相似文献
相关主题

摘要

Most of the recent progresses on visual question answering are based on recurrent neural networks (RNNs) with attention. Despite the success, these models are often time-consuming and having difficulties in modeling long range dependencies due to the sequential nature of RNNs. We propose a new architecture, Positional Self-Attention with Co-attention (PSAC), which does not require RNNs for video question answering. Specifically, inspired by the success of self-attention in machine translation task, we propose a Positional Self-Attention to calculate the response at each position by attending to all positions within the same sequence, and then add representations of absolute positions. Therefore, PSAC can exploit the global dependencies of question and temporal information in the video, and make the process of question and video encoding executed in parallel. Furthermore, in addition to attending to the video features relevant to the given questions (i.e., video attention), we utilize the co-attention mechanism by simultaneously modeling "what words to listen to" (question attention). To the best of our knowledge, this is the first work of replacing RNNs with self-attention for the task of visual question answering. Experimental results of four tasks on the benchmark dataset show that our model significantly outperforms the state-of-the-art on three tasks and attains comparable result on the Count task. Our model requires less computation time and achieves better performance compared with the RNNs-based methods. Additional ablation study demonstrates the effect of each component of our proposed model.

机译：最近最近的视觉问题的进展基于反复性神经网络（RNNS）。尽管取得了成功，但这些模型往往耗时，并且由于RNN的顺序性质，在长距离依赖性建模的困难。我们提出了一种新的架构，具有共同关注的位置自我关注（PSAC），这不需要RNN进行视频问题的回答。具体而言，通过机器翻译任务的自我关注的成功启发，我们提出了位置自我注意，通过参加相同序列内的所有位置来计算每个位置的响应，然后添加绝对位置的表示。因此，PSAC可以利用视频中的问题和时间信息的全局依赖性，并使并行执行的问题和视频编码过程。此外，除了参加与给定问题相关的视频特征之外，我们还通过同时建模“要收听的词语”（问题注意）来利用共同关注机制。据我们所知，这是替换RNN的第一项工作，以自我关注视觉问题的任务。基准数据集的四个任务的实验结果表明，我们的模型在三个任务中显着优于最先进的，并在COUNT任务上获得了可比的结果。与基于RNNS的方法相比，我们的模型需要较少的计算时间并实现更好的性能。额外的消融研究表明我们提出的模型的每个组成部分的效果。

著录项

来源
《AAAI Conference on Artificial Intelligence》|2019年|8561-9355p|共8页
会议地点
作者
Xiangpeng Li; Jingkuan Song; Lianli Gao; Xianglong Liu; Wenbing Huang; Xiangnan He; Chuang Gan;
展开▼
作者单位

展开▼
会议组织
原文格式 PDF
正文语种
中图分类 TP18-53;
关键词

相似文献

外文文献
中文文献
专利

1. CAM-RNN: Co-Attention Model Based RNN for Video Captioning [J] . Bin Zhao, Xuelong Li, Xiaoqiang Lu IEEE Transactions on Image Processing . 2019,第11期

机译：CAM-RNN：基于共同注意模型的RNN用于视频字幕
2. Unifying the Video and Question Attentions for Open-Ended Video Question Answering [J] . Hongyang Xue, Zhou Zhao, Deng Cai IEEE Transactions on Image Processing . 2017,第12期

机译：统一开放式视频问答的视频和问题注意
3. Enhancing Recurrent Neural Networks with Positional Attention for Question Answering [J] . Qin Chen, Qinmin Hu, Jimmy Xiangji Huang, ACM SIGIR FORUM . 2017,第cd期

机译：通过位置关注来增强循环神经网络的问题解答
4. Beyond RNNs: Positional Self-Attention with Co-Attention for Video Question Answering [C] . Xiangpeng Li, Jingkuan Song, Lianli Gao, AAAI Conference on Artificial Intelligence . 2019

机译：超越RNNS：使用共同关注的位置自我关注视频问题应答
5. Automatic Neural Question Generation Using Community-Based Question Answering Systems [D] . Baghaee, Tina. 2018

机译：使用基于社区的问题应答系统的自动神经问题
6. An Effective Dense Co-Attention Networks for Visual Question Answering [O] . Shirong He, Dezhi Han 2020

机译：用于视觉问题的有效密集的联合网络
7. Beyond RNNs: Positional Self-Attention with Co-Attention for Video Question Answering [O] . Xiangpeng Li, Jingkuan Song, Lianli Gao, 2019

机译：超越RNNS：使用共同关注的位置自我关注视频问题应答
8. First Steps Toward Linking Dialogues: Mediating Between Free-text Questions and Pre-recorded Video Answers [R] . Gandhe, S. , Gordon, A. , Leuski, A. , 2004

机译：连接对话的第一步：在自由文本问题和预先录制的视频答案之间进行调解

Beyond RNNs: Positional Self-Attention with Co-Attention for Video Question Answering

摘要

著录项

相似文献

相关主题

期刊订阅