Combining Acoustic Embeddings and Decoding Features for End-of-Utterance Detection in Real-Time Far-Field Speech Recognition Systems

机译：组合声学嵌入和解码特征在实时远场语音识别系统中的话语末端检测

获取原文

页面导航

摘要
著录项
相似文献
相关主题

摘要

We present an end-of-utterance detector for real-time automatic speech recognition in far-field scenarios. The proposed system consists of three components: a long short-term memory (LSTM) neural network trained on acoustic features, an LSTM trained on l-best recognition hypotheses of the automatic speech recognition (ASR) decoder, and a feedforward deep neural network (DNN) combining embeddings derived from both LSTMs with pause duration features from the ASR decoder. At inference time, lower and upper latency (pause duration) bounds act as safeguards. Within the latency bounds, the utterance end-point is triggered as soon as the DNN posterior reaches a tuned threshold. Our experimental evaluation is carried out on real recordings of natural human interactions with voice-controlled far-field devices. We show that the acoustic embeddings are the single most powerful feature and particularly suitable for cross-lingual applications. We furthermore show the benefit of ASR decoder features, especially as a low cost alternative to ASR hypothesis em-beddings.

机译：我们在远场情景中呈现出用于实时自动语音识别的话语终止探测器。所提出的系统由三个组成部分组成：长期内存（LSTM）神经网络训练在声学特征上，LSTM培训于自动语音识别（ASR）解码器的L-BEST识别假设，以及前馈深神经网络（ DNN）组合从SSTMS派生的嵌入源与ASR解码器的暂停持续时间特征。在推理时间内，较低和上延迟（暂停持续时间）界限充当保护。在延迟界限内，一旦DNN后续到达调谐阈值，就会触发话语终点。我们的实验评估是在与语音控制的远场设备的自然人交互的真正记录中进行的。我们表明声学嵌入式是最强大的功能，特别适用于交叉舌应用。我们还展示了ASR解码器特征的好处，特别是作为ASR假设EM-BEDDINGS的低成本替代品。

著录项

来源
《IEEE International Conference on Acoustics, Speech and Signal Processing》|2018年|p5089-5738|共5页
会议地点
作者
Roland Maas; Ariya Rastrow; Chengyuan Ma; Guitang Lan; Kyle Goehner; Gautam Tiwari; Shaun Joseph; Bjorn Hoffmeister;
展开▼
作者单位

展开▼
会议组织
原文格式 PDF
正文语种
中图分类 TN912-53;
关键词
end-pointing; end-of-query detection; turn taking; dialog modeling; online speech recognition;

机译：最终指向;结束查询检测;反过来回吐;对话框建模;在线语音识别;

相似文献

外文文献
中文文献
专利

1. Speech recognition and acoustic features in combined electric and acoustic stimulation [J] . YoonY.-S., LiY., FuQ.-J. Journal of speech, language, and hearing research: JSLHR . 2012,第1期

机译：结合电刺激和声刺激的语音识别和声学特征
2. Noise-Robust Speech Recognition Through Auditory Feature Detection and Spike Sequence Decoding [J] . Phillip B. Schafer, Dezhe Z. Jin Neural computation . 2014,第3期

机译：通过听觉特征检测和尖峰序列解码的鲁棒语音识别
3. Acoustic coprocessor for hmm based embedded speech recognition systems [J] . IEEE Transactions on Consumer Electronics . 2013,第3期

机译：基于hmm的嵌入式语音识别系统的声学协处理器
4. Combining Acoustic Embeddings and Decoding Features for End-of-Utterance Detection in Real-Time Far-Field Speech Recognition Systems [C] . Roland Maas, Ariya Rastrow, Chengyuan Ma, IEEE International Conference on Acoustics, Speech and Signal Processing . 2018

机译：组合声学嵌入和解码特征在实时远场语音识别系统中的话语末端检测
5. Acoustic modeling and feature selection for speech recognition. [D] . Zheng, Yanli. 2005

机译：用于语音识别的声学建模和特征选择。
6. Speech Recognition and Acoustic Features in Combined Electric and Acoustic Stimulation [O] . Yang-soo Yoon, Yongxin Li, Qian-Jie Fu -1

机译：电声组合刺激中的语音识别和声学特征
7. Robust End-of-Utterance Detection for Real-Time Speech Recognition Applications [O] . Ramalingam Hariharan, Juha Häkkinen, Kari Laurila 2001

机译：实时语音识别应用中的健壮的末尾检测
8. Speech Recognition, Articulatory Feature Detection, and Speech Synthesis in Multiple Languages [R] . Ore, B. M. 2009

机译：语音识别，发音特征检测和多语言语音合成

Combining Acoustic Embeddings and Decoding Features for End-of-Utterance Detection in Real-Time Far-Field Speech Recognition Systems

摘要

著录项

相似文献

相关主题

期刊订阅