Highlights'/> Predicting speech intelligibility with deep neural networks
首页> 外文期刊>Computer speech and language >Predicting speech intelligibility with deep neural networks
【24h】

Predicting speech intelligibility with deep neural networks

机译:用深度神经网络预测语音清晰度

获取原文
获取原文并翻译 | 示例
           

摘要

HighlightsAn automatic speech recognizer using deep neural networks is proposed as model to predict speech intelligibility (SI).The DNN-based model predicts SI in normal-hearing listeners more accurately than four established SI models.In contrast to baseline models, the proposed model predicts intelligibility from the noisy speech signal and does not require separated noise and speech input.A relevance propagation algorithm shows that DNNs can listen in the dips in modulated maskers.Graphical abstractDisplay OmittedAbstractAn accurate objective prediction of human speech intelligibility is of interest for many applications such as the evaluation of signal processing algorithms. To predict the speech recognition threshold (SRT) of normal-hearing listeners, an automatic speech recognition (ASR) system is employed that uses a deep neural network (DNN) to convert the acoustic input into phoneme predictions, which are subsequently decoded into word transcripts. ASR results are obtained with and compared to data presented in Schubotz et al. (2016), which comprises eight different additive maskers that range from speech-shaped stationary noise to a single-talker interferer and responses from eight normal-hearing subjects. The task for listeners and ASR is to identify noisy words from a German matrix sentence test in monaural conditions. Two ASR training schemes typically used in applications are considered: (A) matched training, which uses the same noise type for training and testing and (B) multi-condition training, which covers all eight maskers. For both training schemes, ASR-based predictions outperform established measures such as the extended speech intelligibility index (ESII), the multi-resolution speech envelope power spectrum model (mr-sEPSM) and others. This result is obtained with a speaker-independent model that compares the word labels of the utterance with the ASR transcript, which does not require separate noise and speech signals. The best predictions are obtained for multi-condition training with amplitude modulation features, which implies that the noise type has been seen during training. Predictions and measurements are analyzed by comparing speech recognition thresholds and individual psychometric functions to the DNN-based results.
机译: 突出显示 提出了一种使用深度神经网络的自动语音识别器作为预测语音清晰度(SI)的模型。 基于DNN的模型预测正常听力中的SI侦听器比四个已建立的SI模型更准确。 相关性传播算法表明DNN可以侦听调制掩蔽器中的倾角。 图形摘要 < ce:abstract-sec id =“ abssec0002” view =“ all”> 省略显示 摘要 嗡嗡声的准确客观预测语音清晰度对于许多应用(例如信号处理算法的评估)都非常重要。为了预测正常听力的听众的语音识别阈值(SRT),采用了自动语音识别(ASR)系统,该系统使用深度神经网络(DNN)将声音输入转换为音素预测,随后将其解码为单词转录本。 ASR结果是与Schubotz等人提供的数据相比较而获得的。 (2016年),其中包括八个不同的加法掩蔽器,范围从语音形的固定噪声到单个讲话者的干扰源,以及八个正常听力对象的响应。听众和ASR的任务是从单声道条件下的德语矩阵句子测试中识别出嘈杂的单词。考虑了通常在应用程序中使用的两种ASR训练方案:(A)匹配训练,其使用相同的噪声类型进行训练和测试,以及(B)多条件训练,其涵盖了所有八个掩蔽器。对于这两种训练方案,基于ASR的预测均胜过已建立的措施,例如扩展语音清晰度指数(ESII),多分辨率语音包络功率谱模型(mr-sEPSM)等。该结果是通过与说话者无关的模型获得的,该模型将话语的单词标签与ASR成绩单进行比较,而ASR成绩单不需要单独的噪音和语音信号。对于具有调幅功能的多条件训练,可以获得最佳预测,这意味着在训练过程中已经看到了噪声类型。通过比较语音识别阈值和个人心理测验功能与基于DNN的结果进行分析和预测。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号