【24h】

LEARNING SPOKEN WORDS FROM MULTISENSORY INPUT

机译:从多传感器输入中学习口语

获取原文
获取原文并翻译 | 示例

摘要

Speech recognition and speech translation are traditionally addressed by processing acoustic signals while nonlinguistic information is typically not used. In this paper, we present a new method which explores the spoken word learning from naturally co-occurring multisensory information in a dyadic(two-person) conversation. It has been noticed that the listener always has a strong tendency to look toward objects referred to by the speaker during the conversation. In light of this, we propose to use eye gaze to integrate acoustic and visual signals, and build the audio-visual lexicons of objects. With such data gathered from conversations in different languages, the spoken names of objects in different languages can be translated based on their visual semantics. We have developed a multimodal learning system and report the results of experiments using speech, video in concert with eye movement records as training data.
机译:传统上,语音识别和语音翻译通过处理声音信号来解决,而通常不使用非语言信息。在本文中,我们提出了一种新方法,该方法探索了在二元(两人)对话中从自然共现的多感官信息中学习口语单词的方法。已经注意到,在对话过程中,听者总是有很强的倾向去看说话者所指的对象。有鉴于此,我们建议使用视线来整合声音和视觉信号,并建立对象的视听词典。利用从不同语言的对话中收集的此类数据,可以基于其语言的视觉语义来翻译不同语言的对象的口语名称。我们已经开发了一种多模式学习系统,并使用语音,视频以及眼动记录作为训练数据来报告实验结果。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号