Robust indoor speaker recognition in a network of audio and video sensors

Eleonora DArca; Neil M. Robertson; James R. Hopgood

首页> 外文期刊>Signal processing >Robust indoor speaker recognition in a network of audio and video sensors

【24h】

Robust indoor speaker recognition in a network of audio and video sensors

机译：音频和视频传感器网络中的可靠室内说话人识别

获取原文

获取原文并翻译 | 示例

掌桥外文数据库（机构版） >>

开具论文收录证明 >>

文献代查 >>

页面导航

摘要
著录项
相似文献
相关主题

摘要

Situational awareness is achieved naturally by the human senses of sight and hearing in combination. Automatic scene understanding aims at replicating this human ability using microphones and cameras in cooperation. In this paper, audio and video signals are fused and integrated at different levels of semantic abstractions. We detect and track a speaker who is relatively unconstrained, i.e., free to move indoors within an area larger than the comparable reported work, which is usually limited to round table meetings. The system is relatively simple: consisting of just 4 microphone pairs and a single camera. Results show that the overall multimodal tracker is more reliable than single modality systems, tolerating large occlusions and cross-talk. System evaluation is performed on both single and multi-modality tracking. The performance improvement given by the audio-video integration and fusion is quantified in terms of tracking precision and accuracy as well as speaker diarisation error rate and precision-recall (recognition). Improvements vs. the closest works are evaluated: 56% sound source localisation computational cost over an audio only system, 8% speaker diarisation error rate over an audio only speaker recognition unit and 36% on the precision-recall metric over an audio-video dominant speaker recognition method.

机译：情景感知是通过人类视觉和听觉的组合自然而然地实现的。自动场景理解旨在协作使用麦克风和摄像头来复制这种人类能力。在本文中，音频和视频信号在语义抽象的不同级别被融合和集成。我们检测并跟踪相对不受限制的发言人，即可以在比可比的报告作品更大的区域内自由地在室内移动，该报告通常限于圆桌会议。该系统相对简单：仅包含4个麦克风对和一个摄像头。结果表明，整体多模式跟踪器比单模式系统更可靠，可以承受较大的遮挡和串扰。系统评估是在单模式和多模式跟踪上执行的。音频-视频集成和融合所带来的性能改进可通过跟踪精度和准确度以及说话者二值化错误率和精确调用（识别）来量化。评估与最接近的作品的改进：在仅音频的系统上，声源本地化计算成本为56％，在仅音频的扬声器识别单元上，说话者二分化错误率为8％，在音频视频主导系统上，精确召回率为36％说话人识别方法。

著录项

来源
《Signal processing》 |2016年第12期|137-149|共13页
作者
Eleonora DArca; Neil M. Robertson; James R. Hopgood;
展开▼
作者单位

Visionlab, ISSS, Heriot Watt University, Edinburgh EH14 4AS, UK;

Visionlab, ISSS, Heriot Watt University, Edinburgh EH14 4AS, UK;

University of Edinburgh, Edinburgh EH9 3JG, UK;

展开▼
收录信息
原文格式 PDF
正文语种 eng
中图分类
关键词
Surveillance; Speaker diarisation; Security biometric; Audio-video speaker tracking; Multimodal fusion;

机译：监视说话人二元化;安全生物识别;音视频扬声器跟踪;多峰融合;

相似文献

外文文献
中文文献
专利

1. Robust Audio-Visual Speech Recognition Under Noisy Audio-Video Conditions [J] . Stewart D., Seymour R., Pass A., Cybernetics, IEEE Transactions on . 2014,第2期

机译：嘈杂的视听条件下的鲁棒视听语音识别
2. Audio-Visual Speaker Recognition for Video Broadcast News [J] . BENOIT MAISON, CHALAPATHY NETI, ANDREW SENIOR Journal of VLSI signal processing . 2001,第1a2期

机译：视频广播新闻的视听说话人识别
3. A Low-Complexity Parabolic Lip Contour Model With Speaker Normalization for High-Level Feature Extraction in Noise-Robust Audiovisual Speech Recognition [J] . Borgstrom B.J., Alwan A. IEEE transactions on systems, man, and cybernetics. Part A, Systems and humans . 2008,第6期

机译：具有说话人归一化功能的低复杂度抛物线形嘴唇轮廓模型，用于噪声鲁棒的视听语音识别中的高级特征提取
4. An Audio-Video Database for Robust Audio-Video Speech Recognition [C] . You Zhang, Thomas S. Huang World multiconference on systemics, cybernetics and informatics . 1999

机译：用于强大音频视频语音识别的音频视频数据库
5. Multimodal Sensing and Data Processing for Speaker and Emotion Recognition Using Deep Learning Models with Audio, Video and Biomedical Sensors [D] . Abtahi, Farnaz. 2018

机译：使用具有音频，视频和生物医学传感器的深度学习模型，对说话人和情感识别进行多模式传感和数据处理
6. Evaluation of MPEG-7-Based Audio Descriptors for Animal Voice Recognition over Wireless Acoustic Sensor Networks [O] . Joaquín Luque, Diego F. Larios, Enrique Personal, 2016

机译：用于无线语音传感器网络上动物语音识别的基于MPEG-7的音频描述符的评估
7. Robust indoor speaker recognition in a network of audio and video sensors [O] . D'Arca, Eleonora, Robertson, Neil M., Hopgood, James R. 2016

机译：音频和视频传感器网络中的可靠室内说话人识别
8. Robust Speech Processing & Recognition: Speaker ID, Language ID, Speech Recognition/Keyword Spotting, Diarization/Co-Channel/Environmental Characterization, Speaker State Assessment. [R] . Hansen, J. H. 2015

机译：强大的语音处理和识别：说话者ID，语言ID，语音识别/关键字识别，Diarization / Co-Channel /环境表征，说话者状态评估。

Robust indoor speaker recognition in a network of audio and video sensors

摘要

著录项

相似文献

相关主题

期刊订阅