首页> 外文期刊>Computer speech and language >NEC-TT System for Mixed-Bandwidth and Multi-Domain Speaker Recognition
【24h】

NEC-TT System for Mixed-Bandwidth and Multi-Domain Speaker Recognition

机译:NEC-TT系统用于混合带宽和多域说话者识别

获取原文
获取原文并翻译 | 示例
           

摘要

This paper describes the NEC-TT speaker recognition system designed for the 2018 Speaker Recognition Evaluation (SRE'18) benchmarking. The NEC-TT submission was among the best-performing systems in this latest edition of SRE organized by the National Institute of Standards and Technology (NIST). It comprises multiple sub-systems based on a deep speaker embedding front-end followed by a probabilistic linear discriminant analysis (PLDA) back-end. Speaker embeddings are continuous-valued vector representations that allow easy comparison between speaker voices with simple geometric operations. The effectiveness of deep speaker embeddings relies on the quantity and diversity of the training data. To this end, we hinge on data augmentation and mixed-bandwidth training strategies to increase the number of training examples and speakers. By doing so, we not only increase the quantity of the training data but also expand the output softmax layer with a larger number of speaker classes. From a system design perspective, we adopted a two-stage pipeline consisting of a general multi-domain speaker embedding front-end followed by a domain-specific PLDA back-end. This has a significant benefit in commercial deployment since the same speaker embedding front-end could be used with multiple domain-adapted PLDA back-ends to cater to every specific deployment. This paper provides a detailed description and analysis of the design methodology, data augmentation, bandwidth extension, multi-head attention, PLDA adaptation, and other components that have contributed to good performance in NEC-TTs SRE'18 results.
机译:本文介绍了专为2018年说话者识别评估(SRE'18)基准测试而设计的NEC-TT说话者识别系统。由国家标准技术研究院(NIST)组织的最新版SRE中,NEC-TT提交的系统是性能最佳的系统之一。它由多个子系统组成,这些子系统基于深度发言人嵌入前端,然后是概率线性判别分析(PLDA)后端。说话人嵌​​入是连续值的矢量表示,可以通过简单的几何运算轻松比较说话人的声音。深度讲话者嵌入的有效性取决于训练数据的数量和多样性。为此,我们依靠数据扩充和混合带宽训练策略来增加训练示例和说话者的数量。通过这样做,我们不仅增加了训练数据的数量,而且还扩展了具有更多扬声器类别的输出softmax层。从系统设计的角度来看,我们采用了两个阶段的流水线,其中包括一般的多域扬声器嵌入前端,然后是特定于域的PLDA后端。这在商业部署中具有显着的优势,因为同一扬声器嵌入前端可以与多个适应域的PLDA后端配合使用,以迎合每个特定的部署。本文对设计方法,数据扩充,带宽扩展,多头关注,PLDA自适应以及其他有助于NEC-TTs SRE'18结果中的良好性能的组件进行了详细描述和分析。

著录项

  • 来源
    《Computer speech and language》 |2020年第5期|101033.1-101033.15|共15页
  • 作者

  • 作者单位

    Biometrics Research Laboratories NEC Corp. Kanagawa 211-8666 Japan;

    Department of Computer Science Tokyo Institute of Technology Tokyo 152-8552 Japan;

  • 收录信息 美国《科学引文索引》(SCI);美国《工程索引》(EI);
  • 原文格式 PDF
  • 正文语种 eng
  • 中图分类
  • 关键词

    Speaker recognition; benchmark evaluation; domain adaptation;

    机译:说话人识别;基准评估;领域适应;

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号