首页> 外文期刊>Computer speech and language >Unsupervised speech representation learning for behavior modeling using triplet enhanced contextualized networks
【24h】

Unsupervised speech representation learning for behavior modeling using triplet enhanced contextualized networks

机译:使用Triplet增强的上下文化网络的行为建模学习无监督的语音表示

获取原文
获取原文并翻译 | 示例
       

摘要

Speech encodes a wealth of information related to human behavior and has been used in a variety of automated behavior recognition tasks. However, extracting behavioral information from speech remains challenging including due to inadequate training data resources stemming from the often low occurrence frequencies of specific behavioral patterns. Moreover, supervised behavioral modeling typically relies on domain-specific construct definitions and corresponding manually-annotated data, rendering generalizing across domains challenging. In this paper, we exploit the stationary properties of human behavior within an interaction and present a representation learning method to capture behavioral information from speech in an unsupervised way. We hypothesize that nearby segments of speech share the same behavioral context and hence map onto similar underlying behavioral representations. We present an encoder-decoder based Deep Contextualized Network (DCN) as well as a Triplet-Enhanced DCN (TE-DCN) framework to capture the behavioral context and derive a manifold representation, where speech frames with similar behaviors are closer while frames of different behaviors maintain larger distances. The models are trained on movie audio data and validated on diverse domains including on a couples therapy corpus and other publicly collected data (e.g., stand-up comedy). With encouraging results, our proposed framework shows the feasibility of unsupervised learning within cross-domain behavioral modeling.
机译:语音编码与人类行为有关的大量信息,并已用于各种自动行为识别任务。然而,从语音中提取行为信息仍然具有挑战性,包括由于来自特定行为模式的经常出现频率的训练数据资源不足。此外,监督的行为建模通常依赖于域特定的构造定义和对应的手动注释的数据,呈现跨域挑战的域概括。在本文中,我们利用交互中的人类行为的静止性质,并提出了一种以无监督方式从语音中捕获行为信息的表示学习方法。我们假设附近的演讲部分份额份额相同的行为背景,因此地图到类似的潜在行为表示。我们介绍了一种基于编码器解码器的深层上下文化网络(DCN)以及三重型增强DCN(TE-DCN)框架,用于捕获行为上下文并导出歧管表示,其中具有类似行为的语音帧在不同的帧时越近行为保持更大的距离。该模型在电影音频数据上培训并在不同的域上验证,包括夫妻治疗语料库和其他公共收集的数据(例如,站立喜剧)。随着令人鼓舞的结果,我们提出的框架显示了跨域行为建模中无监督学习的可行性。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号