...
首页> 外文期刊>Multimedia, IEEE Transactions on >A Novel Lip Descriptor for Audio-Visual Keyword Spotting Based on Adaptive Decision Fusion
【24h】

A Novel Lip Descriptor for Audio-Visual Keyword Spotting Based on Adaptive Decision Fusion

机译:基于自适应决策融合的新型视听关键词识别口语描述符

获取原文
获取原文并翻译 | 示例
           

摘要

Keyword spotting remains a challenge when applied to real-world environments with dramatically changing noise. In recent studies, audio-visual integration methods have demonstrated superiorities since visual speech is not influenced by acoustic noise. However, for visual speech recognition, individual utterance mannerisms can lead to confusion and false recognition. To solve this problem, a novel lip descriptor is presented involving both geometry-based and appearance-based features in this paper. Specifically, a set of geometry-based features is proposed based on an advanced facial landmark localization method. In order to obtain robust and discriminative representation, a spatiotemporal lip feature is put forward concerning similarities among textons and mapping the feature to intra-class subspace. Moreover, a parallel two-step keyword spotting strategy based on decision fusion is proposed in order to make the best use of audio-visual speech and adapt to diverse noise conditions. Weights generated using a neural network combine acoustic and visual contributions. Experimental results on the OuluVS dataset and PKU-AV dataset demonstrate that the proposed lip descriptor shows competitive performance compared to the state of the art. Additionally, the proposed audio-visual keyword spotting (AV-KWS) method based on decision-level fusion significantly improves the noise robustness and attains better performance than feature-level fusion, which is also capable of adapting to various noisy conditions.
机译:当将关键字识别应用于噪声急剧变化的现实环境时,仍然是一个挑战。在最近的研究中,视听集成方法已显示出优越性,因为视觉语音不受声学噪声的影响。但是,对于视觉语音识别,个人话语举止会导致混乱和错误识别。为了解决这个问题,本文提出了一种新颖的唇形描述词,它同时包含了基于几何和基于外观的特征。具体来说,基于高级面部界标定位方法,提出了一组基于几何的特征。为了获得鲁棒的和有区别的表示,提出了一种时空嘴唇特征,涉及到文本之间的相似性,并将该特征映射到类内子空间。此外,提出了一种基于决策融合的并行两步关键词发现策略,以充分利用视听语音并适应各种噪声条件。使用神经网络生成的权重将声学和视觉贡献结合在一起。在OuluVS数据集和PKU-AV数据集上的实验结果表明,与现有技术相比,所提出的嘴唇描述符具有竞争优势。另外,基于特征级融合的视听关键词发现(AV-KWS)方法比特征级融合明显提高了噪声鲁棒性,并具有更好的性能,并且还能够适应各种噪声条件。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号