...
首页> 外文期刊>Computer speech and language >Impact of Word Error Rate on theme identification task of highly imperfect human-human conversations
【24h】

Impact of Word Error Rate on theme identification task of highly imperfect human-human conversations

机译:单词错误率对高度不完善的人与人对话主题识别任务的影响

获取原文
获取原文并翻译 | 示例
           

摘要

A review is proposed of the impact of word representations and classification methods in the task of theme identification of telephone conversation services having highly imperfect automatic transcriptions. We firstly compare two word-based representations using the classical Term Frequency-Inverse Document Frequency with Gini purity criteria (TF-IDF-Gini) method and the latent Dirichlet allocation (LDA) approach. We then introduce a classification method that takes advantage of the LDA topic space representation, highlighted as the best word representation. To do so, two assumptions about topic representation led us to choose a Gaussian Process (GP) based method. Its performance is compared with a classical Support Vector Machine (SVM) classification method. Experiments showed that the GP approach is a better solution to deal with the multiple theme complexity of a dialogue, no matter the conditions studied (manual or automatic transcriptions) (Morchid et al., 2014). In order to better understand results obtained using different word representation methods and classification approaches, we then discuss the impact of discriminative and non-discriminative words extracted by both word representations methods in terms of transcription accuracy (Morchid et al., 2014). Finally, we propose a novel study that evaluates the impact of the Word Error Rate (WER) in the LDA topic space learning process as well as during the theme identification task. This original qualitative study points out that selecting a small subset of words having the lowest WER (instead of using all the words) allows the system to better classify automatic transcriptions with an absolute gain of 0.9 point, in comparison to the best performance achieved on this dialogue classification task (precision of 83.3%).
机译:提出了对单词表示和分类方法在具有高度不完美的自动转录的电话对话服务的主题识别任务中的影响的综述。我们首先使用经典术语频率-逆文档频率与基尼纯度标准(TF-IDF-Gini)方法和潜在狄利克雷分配(LDA)方法比较两个基于单词的表示形式。然后,我们介绍一种分类方法,该方法利用了LDA主题空间表示形式(突出显示为最佳单词表示形式)。为此,关于主题表示的两个假设使我们选择了基于高斯过程(GP)的方法。将其性能与经典的支持向量机(SVM)分类方法进行了比较。实验表明,无论所研究的条件(手动或自动转录)如何,GP方法都是解决对话的多个主题复杂性的更好解决方案(Morchid等,2014)。为了更好地理解使用不同的单词表示方法和分类方法获得的结果,我们将讨论两种单词表示方法提取的区分性和非区分性单词对转录准确性的影响(Morchid等,2014)。最后,我们提出了一项新颖的研究,该研究评估了单词错误率(WER)在LDA主题空间学习过程以及主题识别任务中的影响。最初的定性研究指出,与在此基础上获得的最佳性能相比,选择一小部分WER最低的单词子集(而不是使用所有单词)可使系统更好地对自动转录进行分类,绝对增益为0.9点对话分类任务(精确度为83.3%)。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号