首页> 外文学位 >Visual-Linguistic Semantic Alignment: Fusing Human Gaze and Spoken Narratives for Image Region Annotation

【24h】

Visual-Linguistic Semantic Alignment: Fusing Human Gaze and Spoken Narratives for Image Region Annotation

机译：视觉语言语义对齐：融合人类凝视和语音叙事的图像区域注释

获取原文

获取原文并翻译 | 示例

页面导航

摘要
著录项
相似文献
相关主题

摘要

Advanced image-based application systems such as image retrieval and visual question answering depend heavily on semantic image region annotation. However, improvements in image region annotation are limited because of our inability to understand how humans, the end users, process these images and image regions. In this work, we expand a framework for capturing image region annotations where interpreting an image is influenced by the end user's visual perception skills, conceptual knowledge, and task-oriented goals. Human image understanding is reflected by individuals' visual and linguistic behaviors, but the meaningful computational integration and interpretation of their multimodal representations (e.g. gaze, text) remain a challenge. Our work explores the hypothesis that eye movements can help us understand experts' perceptual processes and that spoken language descriptions can reveal conceptual elements of image inspection tasks. We propose that there exists a meaningful relation between gaze, spoken narratives, and image content. Using unsupervised bitext alignment, we create meaningful mappings between participants' eye movements (which reveal key areas of images) and spoken descriptions of those images. The resulting alignments are then used to annotate image regions with concept labels. Our alignment accuracy exceeds baseline alignments that are obtained using both simultaneous and a fixed-delay temporal correspondence. Additionally, comparison of alignment accuracy between a method that identifies clusters in the images based on eye movements and a method that identifies clusters using image features shows that the two approaches perform well on different types of images and concept labels. This suggests that an image annotation framework could integrate information from more than one technique to handle heterogeneous images. The resulting alignments can be used to create a database of low-level image features and high-level semantic annotations corresponding to perceptually important image regions. We demonstrate the applicability of the proposed framework with two datasets: one consisting of general-domain images and another with images from the domain of medicine. This work is an important contribution toward the highly challenging problem of fusing human-elicited multimodal data sources, a problem that will become increasingly important as low-resource scenarios become more common.

机译：基于高级图像的应用系统（例如图像检索和视觉问题解答）在很大程度上取决于语义图像区域注释。但是，由于我们无法理解人类，最终用户如何处理这些图像和图像区域，因此图像区域注释的改进受到限制。在这项工作中，我们扩展了一个用于捕获图像区域注释的框架，在该框架中，解释图像受最终用户的视觉感知技能，概念知识和面向任务的目标的影响。人们对图像的理解是通过个人的视觉和语言行为来反映的，但是对他们的多模态表示（例如凝视，文本）的有意义的计算整合和解释仍然是一个挑战。我们的工作探索了一种假设，即眼动可以帮助我们理解专家的知觉过程，而口头语言描述可以揭示图像检查任务的概念性要素。我们建议在注视，口头叙述和图像内容之间存在有意义的关系。使用无监督的bitext对齐，我们在参与者的眼动（揭示图像的关键区域）和这些图像的语音描述之间创建了有意义的映射。然后将所得的对齐方式用于使用概念标签注释图像区域。我们的对齐精度超过了使用同时和固定延迟时间对应关系获得的基线对齐。另外，基于眼睛运动识别图像中的聚类的方法和使用图像特征识别聚类的方法之间的对准精度比较表明，这两种方法在不同类型的图像和概念标签上表现良好。这表明图像注释框架可以集成来自多种技术的信息来处理异构图像。产生的对齐方式可用于创建一个低层图像特征和与语义上重要的图像区域相对应的高层语义注释的数据库。我们用两个数据集证明了提出的框架的适用性：一个由通用域图像组成，另一个与来自医学领域的图像组成。这项工作是对融合人类引发的多模式数据源这一极具挑战性的问题的重要贡献，随着资源短缺的情况越来越普遍，该问题将变得越来越重要。

著录项

作者
Vaidyanathan, Preethi.;
展开▼
作者单位

Rochester Institute of Technology.;

展开▼
授予单位 Rochester Institute of Technology.;
学科 Computer science.
学位 Ph.D.
年度 2017
页码 109 p.
总页数 109
原文格式 PDF
正文语种 eng
中图分类公共建筑;
关键词

相似文献

外文文献
中文文献
专利

1. Data equilibrium based automatic image annotation by fusing deep model and semantic propagation [J] . Ke Xiao, Zhou Mingke, Niu Yuzhen, Pattern Recognition: The Journal of the Pattern Recognition Society . 2017,第期

机译：融合深层模型和语义传播的数据均衡基于自动图像注释
2. Fusing semantic aspects for image annotation and retrieval [J] . Zhixin Li, Zhiping Shi, Xi Liu, Journal of visual communication & image representation . 2010,第8期

机译：融合语义方面进行图像注释和检索
3. Advancing Semantic Interoperability of Image Annotations: Automated Conversion of Non-standard Image Annotations in a Commercial PACS to the Annotation and Image Markup [J] . Journal of digital imaging: the official journal of the Society for Computer Applications in Radiology . 2020,第1期

机译：推进图像注释的语义互操作性：将商业PAC中的非标准图像注释的自动转换为注释和图像标记
4. Using Co-Captured Face, Gaze and Verbal Reactions to Images of Varying Emotional Content for Analysis and Semantic Alignment [C] . Aliya Gangji, Trevor Walden, Preethi Vaidyanathan, AAAI Conference on Artificial Intelligence . 2017

机译：使用共同捕获的面部，凝视和口头反应对分析和语义对齐的不同情绪内容的图像
5. Simultaneous image classification and annotation via fusing multimodal heterogeneous image features. [D] . Wacker, Taylor. 2014

机译：通过融合多模式异构图像特征，同时进行图像分类和注释。
6. Computational framework for fusing eye movements and spoken narratives for image annotation [O] . Preethi Vaidyanathan, Emily Prudhommeaux, Cecilia O. Alm, -1

机译：用于融合眼动的计算框架和图像注释的口语叙述
7. Computational framework for fusing eye movements and spoken narratives for image annotation [O] . Preethi Vaidyanathan, Emily Prud’hommeaux, Cecilia O. Alm, 2020

机译：用于融合眼动的计算框架和图像注释的口语叙述
8. Acoustic Vehicle Classification by Fusing with Semantic Annotation [R] . Guo, B., Nixon, M. S., Damarla, T. 2009

机译：基于语义标注融合的声学车辆分类

Visual-Linguistic Semantic Alignment: Fusing Human Gaze and Spoken Narratives for Image Region Annotation

摘要

著录项

相似文献

相关主题

期刊订阅