Automatic Speaker Recognition and Diarization in Co-Channel Speech

机译：同频道语音中的说话人自动识别和区分

获取原文

获取原文并翻译 | 示例

页面导航

摘要
著录项
相似文献
相关主题

摘要

This study investigates various aspects of multi-speaker interference and its impact on speaker recognition. Single-channel multi-speaker speech signals (aka co-channel speech) comprise a significant portion of speech processing data. Examples of co-channel signals are recordings from multiple speakers in meetings, conversations, debates, etc. The nuisances of co-channel speech are two-fold: 1) overlapped speech, and 2) non-overlapping speaker interference. In overlap, the direct effects of two stochastically similar, non-stationary signals added together disrupts speech processing performance, originally developed for single-speaker audio. For example, in speaker recognition, identifying speakers in overlapped segments is more difficult compared to single-speaker signals. Analyses in this study show that introducing overlapped speech increases speaker recognition error rates by an order of magnitude. In addition to the direct impact of overlap, its secondary effect is in how one speaker forces the other to change his/her speech characteristics. Different forms of co-channel data are investigated in this study. In scenarios where the focus is on overlap, independent cross-talk is used. Independent cross-talk refers to the summation of independent audio signals from different speakers to simulate overlap. The alternative form of data used in this study is real conversation recordings. Although conversations contain both overlapped and non-overlapped speech, independent cross-talk is a better source of overlap. The reason real conversations are not deemed sufficient for overlap analysis is the scarcity and non-uniformity of overlaps in typical conversations. Independent cross-talk is obtained from the GRID corpus, which was used in the speech separation challenge as a source of overlapped speech. Real conversations are obtained from the Switchboard telephone conversation corpus. Other real conversational data used throughout this study include: the AMI meeting corpus, Prof-lifelog, and UTDrive data. These datasets provide a perspective towards environment noise and co-channel interference in day-to-day speech. Categorizing datasets allows for a meticulous analysis of different aspects of co-channel speech. Most of the focus in analyzing overlaps is presented in the form of overlap detection techniques. This study proposes two overlap detection methods: 1) Pyknogram-based 2) Gammatone sub-band frequency modulation (GSFM). Both methods take advantage of the harmonic structure of speech to detect overlaps. Pyknograms do so by enhancing speech harmonics and evaluating dynamics across time, while GSFM magnifies the presence of multiple harmonics in different sub-bands. The other advancements proposed in this study use back-end modeling techniques to compensate for co-channel speech in real conversational data. These techniques are presented to reduce the impact of interfering speech in speaker-dependent models. Several methods are investigated, all of which propose a different modification to the popular probabilistic linear discriminant analysis (PLDA) used in state-of-the-art speaker recognition systems. In addition to model compensation techniques, this study presents CRSS-SpkrDiar, which is a speaker diarization research platform aimed at tackling conversational co-channel speech data. CRSS-SpkrDiar was developed during this study to alleviate end-to-end co-channel speech analysis. Taken collectively, the speech analysis, proposed features, and algorithmic advancements developed in this study all contribute to an improved understanding and measurable performance gain in speech/speaker technology for the co-channel speech problem.

机译：这项研究调查了多说话者干扰的各个方面及其对说话者识别的影响。单通道多扬声器语音信号（又名同通道语音）占语音处理数据的很大一部分。同频道信号的示例是在会议，对话，辩论等中来自多个发言人的录音。同频道语音的滋扰有两个方面：1）重叠语音，和2）不重叠的扬声器干扰。在重叠中，两个随机相似，非平稳信号加在一起的直接影响会破坏最初为单扬声器音频开发的语音处理性能。例如，在说话者识别中，与单说话者信号相比，识别重叠段中的说话者更加困难。这项研究的分析表明，引入重叠语音会使说话人识别错误率提高一个数量级。除了重叠的直接影响外，它的次要作用还在于一位讲话者如何迫使另一位演讲者改变其语音特性。在这项研究中研究了不同形式的同频道数据。在焦点重叠的情况下，使用独立的串扰。独立串扰是指来自不同扬声器的独立音频信号的总和，以模拟重叠。本研究中使用的另一种数据形式是真实的对话记录。尽管对话既包含重叠语音又包含非重叠语音，但是独立的串扰是重叠的更好来源。实际对话被认为不足以进行重叠分析的原因是典型对话中重叠的缺乏和不均匀。可从GRID语料库获得独立的串扰，该GRID语料在语音分离挑战中用作重叠语音的来源。真实对话是从总机电话对话语料库获得的。本研究中使用的其他真实对话数据包括：AMI会议语料库，Prof-lifelog和UTDrive数据。这些数据集提供了对日常语音中的环境噪声和同频道干扰的看法。对数据集进行分类可以对同频道语音的不同方面进行细致的分析。分析重叠的大多数焦点都以重叠检测技术的形式呈现。这项研究提出了两种重叠检测方法：1）基于龙骨图2）伽马通子带频率调制（GSFM）。两种方法都利用语音的谐波结构来检测重叠。人参图通过增强语音谐波并评估整个时间的动态来做到这一点，而GSFM则放大了不同子带中多个谐波的存在。本研究中提出的其他进展使用后端建模技术来补偿真实对话数据中的同频道语音。提出这些技术是为了减少说话者相关模型中干扰语音的影响。研究了几种方法，所有这些方法都对最新的说话人识别系统中使用的流行概率线性判别分析（PLDA）提出了不同的修改。除了模型补偿技术外，本研究还介绍了CRSS-SpkrDiar，这是一个旨在处理对话同频道语音数据的说话者歧视研究平台。在此研究中开发了CRSS-SpkrDiar，以减轻端到端同频道语音分析的麻烦。总体而言，本研究中的语音分析，提出的功能和算法改进都有助于提高对同频道语音问题的语音/扬声器技术的了解和可衡量的性能提升。

著录项

作者
Shokouhi, Navid.;
展开▼
作者单位

The University of Texas at Dallas.;

展开▼
授予单位 The University of Texas at Dallas.;
学科 Electrical engineering.;Computer science.
学位 Ph.D.
年度 2017
页码 160 p.
总页数 160
原文格式 PDF
正文语种 eng
中图分类康复医学;
关键词

相似文献

外文文献
中文文献
专利

1. On the Applicability of Speaker Diarization to Audio Indexing of Non-Speech and Mixed Non-Speech/ Speech Video Soundtracks [J] . Robert Mertens, Po-Sen Huang, Luke Gottlieb, International journal of multimedia data engineering & management . 2012,第3期

机译：说话者差异化在非语音和非语音/语音混合视频音轨的音频索引中的适用性
2. TEnet: target speaker extraction network with accumulated speaker embedding for automatic speech recognition [J] . Li Wenjie, Zhang Pengyuan, Yan Yonghong Electronics Letters . 2019,第14期

机译：TEnet：目标说话人提取网络，具有累积的说话人嵌入功能，可自动识别语音
3. Speaker indexing based on speaker model selection and automatic speech recognition in discussions [J] . Masafumi Nishida, Yuya Akita, Tatsuya Kawahara 電子情報通信学会技術研究報告. 音声. Speech . 2002,第530期

机译：讨论中基于说话人模型选择和自动语音识别的说话人索引
4. Automatic Speech Recognition of Co-Channel Speech: Integrated Speaker and Speech Recognition Approach [C] . Larry P. Heck, Mark Z. Mao International Conference on Spoken Language Processing; 20041004-08; Jeju(KR) . 2004

机译：同频道语音的自动语音识别：演讲者和语音识别的集成方法
5. Accent and speaker recognition for advanced automatic speech recognition. [D] . Angkititrakul, Pongtep. 2004

机译：口音和说话者识别功能可实现高级自动语音识别。
6. Brain-inspired speech segmentation for automatic speech recognition using the speech envelope as a temporal reference [O] . Byeongwook Lee, Kwang-Hyun Cho -1

机译：以语音包络作为时间参考的自动语音识别的大脑启发式语音分割
7. Integration of Speech Separation, Diarization, and Recognition for Multi-Speaker Meetings: System Description, Comparison, and Analysis [O] . Desh Raj, Pavel Denisov, Zhuo Chen, 2021

机译：言语分离，日复日记和识别的整合：系统描述，比较和分析
8. Robust Speech Processing & Recognition: Speaker ID, Language ID, Speech Recognition/Keyword Spotting, Diarization/Co-Channel/Environmental Characterization, Speaker State Assessment. [R] . Hansen, J. H. 2015

机译：强大的语音处理和识别：说话者ID，语言ID，语音识别/关键字识别，Diarization / Co-Channel /环境表征，说话者状态评估。

Automatic Speaker Recognition and Diarization in Co-Channel Speech

摘要

著录项

相似文献

相关主题

期刊订阅