首页> 外文会议>AAAI Conference on Artificial Intelligence >Beyond RNNs: Positional Self-Attention with Co-Attention for Video Question Answering
【24h】

Beyond RNNs: Positional Self-Attention with Co-Attention for Video Question Answering

机译:超越RNNS:使用共同关注的位置自我关注视频问题应答

获取原文

摘要

Most of the recent progresses on visual question answering are based on recurrent neural networks (RNNs) with attention. Despite the success, these models are often time-consuming and having difficulties in modeling long range dependencies due to the sequential nature of RNNs. We propose a new architecture, Positional Self-Attention with Co-attention (PSAC), which does not require RNNs for video question answering. Specifically, inspired by the success of self-attention in machine translation task, we propose a Positional Self-Attention to calculate the response at each position by attending to all positions within the same sequence, and then add representations of absolute positions. Therefore, PSAC can exploit the global dependencies of question and temporal information in the video, and make the process of question and video encoding executed in parallel. Furthermore, in addition to attending to the video features relevant to the given questions (i.e., video attention), we utilize the co-attention mechanism by simultaneously modeling "what words to listen to" (question attention). To the best of our knowledge, this is the first work of replacing RNNs with self-attention for the task of visual question answering. Experimental results of four tasks on the benchmark dataset show that our model significantly outperforms the state-of-the-art on three tasks and attains comparable result on the Count task. Our model requires less computation time and achieves better performance compared with the RNNs-based methods. Additional ablation study demonstrates the effect of each component of our proposed model.
机译:最近最近的视觉问题的进展基于反复性神经网络(RNNS)。尽管取得了成功,但这些模型往往耗时,并且由于RNN的顺序性质,在长距离依赖性建模的困难。我们提出了一种新的架构,具有共同关注的位置自我关注(PSAC),这不需要RNN进行视频问题的回答。具体而言,通过机器翻译任务的自我关注的成功启发,我们提出了位置自我注意,通过参加相同序列内的所有位置来计算每个位置的响应,然后添加绝对位置的表示。因此,PSAC可以利用视频中的问题和时间信息的全局依赖性,并使并行执行的问题和视频编码过程。此外,除了参加与给定问题相关的视频特征之外,我们还通过同时建模“要收听的词语”(问题注意)来利用共同关注机制。据我们所知,这是替换RNN的第一项工作,以自我关注视觉问题的任务。基准数据集的四个任务的实验结果表明,我们的模型在三个任务中显着优于最先进的,并在COUNT任务上获得了可比的结果。与基于RNNS的方法相比,我们的模型需要较少的计算时间并实现更好的性能。额外的消融研究表明我们提出的模型的每个组成部分的效果。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号