首页> 外文学位 >Visual-Linguistic Semantic Alignment: Fusing Human Gaze and Spoken Narratives for Image Region Annotation
【24h】

Visual-Linguistic Semantic Alignment: Fusing Human Gaze and Spoken Narratives for Image Region Annotation

机译:视觉语言语义对齐:融合人类凝视和语音叙事的图像区域注释

获取原文
获取原文并翻译 | 示例

摘要

Advanced image-based application systems such as image retrieval and visual question answering depend heavily on semantic image region annotation. However, improvements in image region annotation are limited because of our inability to understand how humans, the end users, process these images and image regions. In this work, we expand a framework for capturing image region annotations where interpreting an image is influenced by the end user's visual perception skills, conceptual knowledge, and task-oriented goals. Human image understanding is reflected by individuals' visual and linguistic behaviors, but the meaningful computational integration and interpretation of their multimodal representations (e.g. gaze, text) remain a challenge. Our work explores the hypothesis that eye movements can help us understand experts' perceptual processes and that spoken language descriptions can reveal conceptual elements of image inspection tasks. We propose that there exists a meaningful relation between gaze, spoken narratives, and image content. Using unsupervised bitext alignment, we create meaningful mappings between participants' eye movements (which reveal key areas of images) and spoken descriptions of those images. The resulting alignments are then used to annotate image regions with concept labels. Our alignment accuracy exceeds baseline alignments that are obtained using both simultaneous and a fixed-delay temporal correspondence. Additionally, comparison of alignment accuracy between a method that identifies clusters in the images based on eye movements and a method that identifies clusters using image features shows that the two approaches perform well on different types of images and concept labels. This suggests that an image annotation framework could integrate information from more than one technique to handle heterogeneous images. The resulting alignments can be used to create a database of low-level image features and high-level semantic annotations corresponding to perceptually important image regions. We demonstrate the applicability of the proposed framework with two datasets: one consisting of general-domain images and another with images from the domain of medicine. This work is an important contribution toward the highly challenging problem of fusing human-elicited multimodal data sources, a problem that will become increasingly important as low-resource scenarios become more common.
机译:基于高级图像的应用系统(例如图像检索和视觉问题解答)在很大程度上取决于语义图像区域注释。但是,由于我们无法理解人类,最终用户如何处理这些图像和图像区域,因此图像区域注释的改进受到限制。在这项工作中,我们扩展了一个用于捕获图像区域注释的框架,在该框架中,解释图像受最终用户的视觉感知技能,概念知识和面向任务的目标的影响。人们对图像的理解是通过个人的视觉和语言行为来反映的,但是对他们的多模态表示(例如凝视,文本)的有意义的计算整合和解释仍然是一个挑战。我们的工作探索了一种假设,即眼动可以帮助我们理解专家的知觉过程,而口头语言描述可以揭示图像检查任务的概念性要素。我们建议在注视,口头叙述和图像内容之间存在有意义的关系。使用无监督的bitext对齐,我们在参与者的眼动(揭示图像的关键区域)和这些图像的语音描述之间创建了有意义的映射。然后将所得的对齐方式用于使用概念标签注释图像区域。我们的对齐精度超过了使用同时和固定延迟时间对应关系获得的基线对齐。另外,基于眼睛运动识别图像中的聚类的方法和使用图像特征识别聚类的方法之间的对准精度比较表明,这两种方法在不同类型的图像和概念标签上表现良好。这表明图像注释框架可以集成来自多种技术的信息来处理异构图像。产生的对齐方式可用于创建一个低层图像特征和与语义上重要的图像区域相对应的高层语义注释的数据库。我们用两个数据集证明了提出的框架的适用性:一个由通用域图像组成,另一个与来自医学领域的图像组成。这项工作是对融合人类引发的多模式数据源这一极具挑战性的问题的重要贡献,随着资源短缺的情况越来越普遍,该问题将变得越来越重要。

著录项

  • 作者

    Vaidyanathan, Preethi.;

  • 作者单位

    Rochester Institute of Technology.;

  • 授予单位 Rochester Institute of Technology.;
  • 学科 Computer science.
  • 学位 Ph.D.
  • 年度 2017
  • 页码 109 p.
  • 总页数 109
  • 原文格式 PDF
  • 正文语种 eng
  • 中图分类 公共建筑;
  • 关键词

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号