【24h】

The World in My Mind: Visual Dialog with Adversarial Multi-modal Feature Encoding

机译:我脑海中的世界:对对抗的对话对话对话进行对抗多模态特征编码

获取原文

摘要

Visual Dialog is a multi-modal task that requires a model to participate in a multi-turn human dialog grounded on an image, and generate correct, human-like responses. In this paper, we propose a novel Adversarial Multi-modal Feature Encoding (AMFE) framework for effective and robust auxiliary training of visual dialog systems. AMFE can force the language-encoding part of a model to generate hidden states in a distribution closely related to the distribution of real-world images, resulting in language features containing general knowledge from both modalities by nature, which can help generate both more correct and more general responses with reasonably low time cost. Experimental results show that AMFE can steadily bring performance gains lo different models on different scales of data. Our method outperforms both the supervised learning baselines and other fine-tuning methods, achieving state-of-the-art results on most metrics of VisDial v0.5/v0.9 generative tasks.
机译:Visual对话框是一种多模态任务,需要一个模型参与在图像上接地的多转员对话,并生成正确的人类响应。在本文中,我们为视觉对话系统的有效和鲁棒辅助训练提出了一种新的对抗性多模态特征编码(AMFE)框架。 AMFE可以强制编码模型的语言部分,以在与现实世界图像分布密切相关的分发中生成隐藏状态,导致大自然中均有一般知识的语言特征,这可以帮助生成更正确的更一般的反应,具有相当低的时间成本。实验结果表明,AMFE可以在不同的数据尺度上稳步实现性能获益不同模型。我们的方法优于受监督的学习基线和其他微调方法,实现最先进的结果,在大多数vidial v0.5 / v0.9生成任务中的大多数度量。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号