...
首页> 外文期刊>Computer speech and language >Semi-supervised speech activity detection with an application to automatic speaker verification
【24h】

Semi-supervised speech activity detection with an application to automatic speaker verification

机译:半监督语音活动检测及其在自动说话者验证中的应用

获取原文
获取原文并翻译 | 示例
           

摘要

We propose a simple speech activity detector (SAD) based on recording-specific Gaussian mixture modeling (GMM) of speech and non-speech frames. We extend the conventional expectation-maximization (EM) algorithm for GMM training using semi-supervised learning. It provides a methodology to incorporate unlabeled data into the SAD training process, leading to more accurate statistical models by exploiting the structure of data distribution. It fits naturally to off-line applications that may require partial human assistance, or applications that involve processing large quantities of audio data, such as text-independent speaker verification, speaker diarization or audio surveillance. The proposed SAD does not require any off-line training data as supervised SADs do. Rather, it employs initial labels produced from a tiny fraction of a given audio recording with the help of another simpler SAD (or a human operator). The proposed SAD is analyzed for the different covariance types, the initialization methods for speech and non-speech class, the amount of labeled data required for initialization, and the speech features. In experiments with a stand-alone SAD system, we observe increased accuracy on the challenging dataset from the recent NIST OpenSAD evaluation. Our extensive automatic speaker verification (ASV) experiments, including text-independent experiments with NIST 2010 speaker recognition evaluation (SRE) data and text-dependent experiments with RSR2015 and RedDots corpora, show benefits of the new approach for the long speech segments containing non-stationary noise. For the shorter data conditions in the text-dependent experiments, simpler unsupervised SADs perform however better. Further, we study the impact of SAD misses and false alarms to ASV performance on the NIST 2010 SRE data. By deriving an empirical cost function with the two SAD errors, we have observed that ASV error rate reaches a minimum value around the same SAD operating point irrespective of SAD method and signal-to-noise ratio (SNR). The optimum ASV performance occurs approximately at an SAD operating region where falsely included non-speech is considered 4-5 times more costly than missed speech. Importantly, the proposed semi-supervised SAD is relatively less dependent on the SAD decision threshold compared to the other contrastive SAD methods.
机译:我们提出了一种基于语音和非语音帧的特定于记录的高斯混合建模(GMM)的简单语音活动检测器(SAD)。我们扩展了使用半监督学习的GMM训练的常规期望最大化(EM)算法。它提供了一种将未标记的数据合并到SAD培训过程中的方法,从而通过利用数据分布的结构而导致更准确的统计模型。它自然适用于可能需要部分人工协助的离线应用程序,或涉及处理大量音频数据的应用程序,例如独立于文本的扬声器验证,扬声器二值化或音频监视。拟议的SAD不需要像监督的SAD那样需要任何离线培训数据。相反,它使用了在另一个简单的SAD(或人工操作员)的帮助下从给定音频记录的一小部分产生的初始标签。针对不同的协方差类型,语音和非语音类的初始化方法,初始化所需的标记数据量以及语音特征,对提出的SAD进行了分析。在独立SAD系统的实验中,我们从最近的NIST OpenSAD评估中观察到具有挑战性的数据集的准确性有所提高。我们广泛的自动说话人验证(ASV)实验,包括使用NIST 2010说话人识别评估(SRE)数据的独立于文本的实验,以及使用RSR2015和RedDots语料库进行的独立于文本的实验,显示了这种新方法对于包含非平稳的噪音。对于文本相关实验中较短的数据条件,更简单的无监督SAD效果更好。此外,我们研究了SAD丢失和错误警报对NSV 2010 SRE数据对ASV性能的影响。通过推导带有两个SAD误差的经验成本函数,我们观察到,无论SAD方法和信噪比(SNR)如何,ASV误差率在相同的SAD工作点附近均达到最小值。最佳ASV性能大约发生在SAD操作区域,在该区域中错误地包含非语音被认为比丢失语音的成本高4-5倍。重要的是,与其他对比SAD方法相比,建议的半监督SAD对SAD决策阈值的依赖性相对较小。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号