首页> 外文会议>IEEE/CVF Conference on Computer Vision and Pattern Recognition >MCEN: Bridging Cross-Modal Gap between Cooking Recipes and Dish Images with Latent Variable Model
【24h】

MCEN: Bridging Cross-Modal Gap between Cooking Recipes and Dish Images with Latent Variable Model

机译:MCEN:利用潜在变量模型弥合烹饪食谱和菜品图像之间的跨模态差距

获取原文

摘要

Nowadays, driven by the increasing concern on diet and health, food computing has attracted enormous attention from both industry and research community. One of the most popular research topics in this domain is Food Retrieval, due to its profound influence on health-oriented applications. In this paper, we focus on the task of cross-modal retrieval between food images and cooking recipes. We present Modality-Consistent Embedding Network (MCEN) that learns modality-invariant representations by projecting images and texts to the same embedding space. To capture the latent alignments between modalities, we incorporate stochastic latent variables to explicitly exploit the interactions between textual and visual features. Importantly, our method learns the cross-modal alignments during training but computes embeddings of different modalities independently at inference time for the sake of efficiency. Extensive experimental results clearly demonstrate that the proposed MCEN outperforms all existing approaches on the benchmark Recipe1M dataset and requires less computational cost.
机译:如今,在人们日益关注饮食和健康的推动下,食品计算已经引起了业界和研究界的极大关注。由于它对面向健康的应用产生了深远的影响,因此在该领域最受欢迎的研究主题之一是食品检索。在本文中,我们专注于在食物图像和烹饪食谱之间进行跨模式检索的任务。我们提出了模态一致的嵌入网络(MCEN),该网络通过将图像和文本投影到相同的嵌入空间来学习模态不变的表示形式。为了捕获模态之间的潜在对齐方式,我们合并了随机潜在变量以显式利用文本和视觉功能之间的相互作用。重要的是,为了提高效率,我们的方法在训练过程中学习了跨模态比对,但在推理时独立地计算了不同模态的嵌入。大量的实验结果清楚地表明,所提出的MCEN优于基准Recipe1M数据集上的所有现有方法,并且所需的计算成本更低。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号