首页> 外文学位 >Robust Back-End Processing for Speaker Verification Under Language and Acoustic Mismatch Conditions
【24h】

Robust Back-End Processing for Speaker Verification Under Language and Acoustic Mismatch Conditions

机译:语言和声学不匹配条件下用于说话人验证的强大后端处理

获取原文
获取原文并翻译 | 示例

摘要

Recently, due to the availability of large amounts of data and computation power, there has been a significant rise in Machine Learning/Artificial Intelligence technology. Today, the amount of digital data being generated is huge thanks to smart devices and the Internet of Things. Furthermore, Moore's law has ensured that the current hardware has the capability to reliably store, analyze and perform massive amount of computations in a reasonable amount of time. There are many applications of Machine Learning across different domains, like image processing, video processing, data mining and finance. Among all these application, integration of Speech Technology in mobile and online services has been a major area of research in recent times. Speech, being the primary means of human-to-human interaction is one of the preferred methods for human-to-machine interaction. Since the beginning of computer era, scientists, scholars and artists have dreamed of computers that can have a natural conversation with humans. Turing's test of computational intelligence, HAL 9000 in 2001: A Space Odyssey (film) are some of the examples of this futuristic vision.;Speech signal contains multiple levels of information conveying what is being spoken, who has spoken it, as well as information about the acoustic conditions of the environments in which speech was utterred. Moreover, speech can be conveniently acquired remotely using a telephone or over the internet. Due to these properties, speech technology has been in increasing demand over the past few years. In this study, we focus on "who has spoken it" part of speech signal, commonly known as speaker recognition.;There has been significant advancements made in the field of speaker recognition in recent years. However, robustness across mismatched conditions remains a difficult bottleneck to resolve. The mismatch can occur between enrollment and test conditions as well as between development and evaluation data. We define evaluation data as enrollment and test speech utterances, while development data is the one that is used to train system parameters. The mismatch can occur due to a variety of reasons like background noise, communication channel, different languages spoken by multi-lingual speakers, etc. In this study, we propose three methods to compensate acoustic and language mismatch scenarios in a speaker recognition system. The first two methods attempt to reduce mismatch between enrollment and test utterances while the last method attempts to suppress mismatch between the development and evaluation data of a speaker recognition system:;i) First method focuss on language mismatch scenario between enrollment and test conditions. We propose a method of Within-Class Covariance Correction (WCC) that enables us to get significant improvements under language mismatch condition of a speaker recognition system.;ii) Second method addresses the issue of multi-modality in development data-set caused due to variation in spoken languages and channels used by the speaker. We show that if a multi-lingual speaker speaks different languages or uses different microphones, it hampers the speaker recognition performance. We propose a method called Locally Weighted Linear Discriminant Analysis (LWLDA) to compensate this drop in performance.;iii) Third method enables us to employ unlabeled out-of-domain development data to evaluate speaker recognition trials. We show that when development data-set closely matches evaluation trials, we obtain excellent speaker recognition performance. This kind of development data-set is known as in-domain data. However, when there is acoustic or language mismatch between development and evaluation data, a sharp drop in performance is observed. This kind of development data-set is known as out-of-domain data. We propose a method called Unsupervised Probabilistic Feature Transformation (UPFT) to transform out-of-domain data towards in-domain data. Our proposed method has an added advantage of not requiring labeling of data-sets that saves a lot of time, money and resources.
机译:最近,由于大量数据的可用性和计算能力,机器学习/人工智能技术已经有了显着提高。如今,得益于智能设备和物联网,生成的数字数据量巨大。此外,摩尔定律确保了当前的硬件具有在合理的时间内可靠地存储,分析和执行大量计算的能力。机器学习在不同领域中有许多应用,例如图像处理,视频处理,数据挖掘和财务。在所有这些应用中,语音技术在移动和在线服务中的集成已成为最近的主要研究领域。语音是人与人互动的主要手段,是人与机器互动的首选方法之一。自计算机时代开始以来,科学家,学者和艺术家就梦想着可以与人类进行自然对话的计算机。图灵对计算智能的测试,2001年的HAL 9000:太空漫游(电影)是这种未来派愿景的一些示例。语音信号包含多个级别的信息,可以传达正在说的内容,说过的内容以及信息关于说出语音的环境的声学条件。而且,可以使用电话或通过互联网方便地远程获取语音。由于这些特性,语音技术在过去几年中的需求不断增长。在这项研究中,我们集中于语音信号的“谁说过”部分,通常称为说话者识别。;近年来,说话者识别领域取得了重大进展。但是,在不匹配条件下的鲁棒性仍然是难以解决的瓶颈。不匹配可能发生在注册和测试条件之间以及开发和评估数据之间。我们将评估数据定义为注册和测试语音发音,而开发数据则是用于训练系统参数的数据。失配可能是由于多种原因而发生的,例如背景噪声,通信通道,多语言说话者说的不同语言等。在这项研究中,我们提出了三种方法来补偿说话者识别系统中的声学和语言失配情况。前两种方法试图减少注册和测试话语之间的不匹配,而后一种方法试图抑制说话人识别系统的开发数据和评估数据之间的不匹配:; i)第一种方法着重于注册和测试条件之间的语言不匹配情况。我们提出了一种类内协方差校正(WCC)方法,该方法使我们能够在说话人识别系统的语言不匹配条件下获得显着改善。; ii)第二种方法解决了由于以下原因导致的开发数据集的多模式问题说话者所使用的语言和渠道的变化。我们表明,如果使用多种语言的说话者会说不同的语言或使用不同的麦克风,则会影响说话者的识别性能。我们提出了一种称为局部加权线性判别分析(LWLDA)的方法来补偿这种性能下降。; iii)第三种方法使我们能够使用未标记的域外开发数据来评估说话者识别试验。我们表明,当发展数据集与评估试验紧密匹配时,我们将获得出色的说话人识别性能。这种开发数据集称为域内数据。但是,当开发和评估数据之间存在声学或语言不匹配时,会发现性能急剧下降。这种开发数据集称为域外数据。我们提出了一种称为无监督概率特征变换(UPFT)的方法,可以将域外数据转换为域内数据。我们提出的方法的另一个优点是不需要标记数据集,从而节省了大量时间,金钱和资源。

著录项

  • 作者

    Misra, Abhinav.;

  • 作者单位

    The University of Texas at Dallas.;

  • 授予单位 The University of Texas at Dallas.;
  • 学科 Electrical engineering.;Computer science.
  • 学位 Ph.D.
  • 年度 2017
  • 页码 91 p.
  • 总页数 91
  • 原文格式 PDF
  • 正文语种 eng
  • 中图分类 康复医学;
  • 关键词

  • 入库时间 2022-08-17 11:36:46

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号