首页> 外文会议>IAPR TC3 Workshop on pattern recognition of social signals in human-computer-interaction >Audio Visual Speech Recognition Using Deep Recurrent Neural Networks
【24h】

Audio Visual Speech Recognition Using Deep Recurrent Neural Networks

机译:使用深度递归神经网络的视听语音识别

获取原文

摘要

In this work, we propose a training algorithm for an audiovisual automatic speech recognition (AV-ASR) system using deep recurrent neural network (RNN). First, we train a deep RNN acoustic model with a Connectionist Temporal Classification (CTC) objective function. The frame labels obtained from the acoustic model are then used to perform a non-linear dimensionality reduction of the visual features using a deep bottleneck network. Audio and visual features are fused and used to train a fusion RNN. The use of bottleneck features for visual modality helps the model to converge properly during training. Our system is evaluated on GRID corpus. Our results show that presence of visual modality gives significant improvement in character error rate (CER) at various levels of noise even when the model is trained without noisy data. We also provide a comparison of two fusion methods: feature fusion and decision fusion.
机译:在这项工作中,我们提出了一种使用深度递归神经网络(RNN)的视听自动语音识别(AV-ASR)系统的训练算法。首先,我们训练具有连接器时间分类(CTC)目标函数的深层RNN声学模型。从声学模型获得的帧标签然后用于使用深瓶颈网络执行视觉特征的非线性降维。音频和视觉功能已融合,并用于训练融合RNN。将瓶颈功能用于视觉模态可帮助模型在训练过程中正确收敛。我们的系统在GRID语料库上进行了评估。我们的结果表明,即使在没有噪声数据的情况下训练模型,视觉模态的存在也可以显着改善各种噪声水平下的字符错误率(CER)。我们还提供了两种融合方法的比较:特征融合和决策融合。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号