首页> 外文期刊>Computer speech and language >A study in machine learning from imbalanced data for sentence boundary detection in speech
【24h】

A study in machine learning from imbalanced data for sentence boundary detection in speech

机译:基于不平衡数据的机器学习用于语音句子边界检测的研究

获取原文
获取原文并翻译 | 示例
           

摘要

Enriching speech recognition output with sentence boundaries improves its human readability and enables further processing by downstream language processing modules. We have constructed a hidden Markov model (HMM) system to detect sentence boundaries that uses both prosodic and textual information. Since there are more nonsentence boundaries than sentence boundaries in the data, the prosody model, which is implemented as a decision tree classifier, must be constructed to effectively learn from the imbalanced data distribution. To address this problem, we investigate a variety of sampling approaches and a bagging scheme. A pilot study was carried out to select methods to apply to the full NIST sentence boundary evaluation task across two corpora (conversational telephone speech and broadcast news speech), using both human transcriptions and recognition output. In the pilot study, when classification error rate is the performance measure, using the original training set achieves the best performance among the sampling methods, and an ensemble of multiple classifiers from different downsampled training sets achieves slightly poorer performance, but has the potential to reduce computational effort. However, when performance is measured using receiver operating characteristics (ROC) or area under the curve (AUC), then the sampling approaches outperform the original training set. This observation is important if the sentence boundary detection output is used by downstream language processing modules. Bagging was found to significantly improve system performance for each of the sampling methods. The gain from these methods may be diminished when the prosody model is combined with the language model, which is a strong knowledge source for the sentence detection task. The patterns found in the pilot study were replicated in the full NIST evaluation task. The conclusions may be dependent on the task, the classifiers, and the knowledge combination approach.
机译:使用句子边界丰富语音识别输出可提高其人类可读性,并允许下游语言处理模块进行进一步处理。我们已经构建了一个隐马尔可夫模型(HMM)系统来检测使用韵律和文本信息的句子边界。由于数据中的非句子边界多于句子边界,因此必须构建用作决策树分类器的韵律模型,以有效地从不平衡的数据分布中学习。为了解决这个问题,我们研究了各种采样方法和装袋方案。进行了一项试点研究,以选择两种方法,使用人类转录和识别输出,将方法应用于两个语料库(会话电话语音和广播新闻语音)中的完整NIST句子边界评估任务。在试点研究中,当分类错误率是性能度量标准时,使用原始训练集可以在采样方法中获得最佳性能,而来自不同降采样训练集的多个分类器的集合可以实现稍差的性能,但有可能降低计算工作量。但是,当使用接收器工作特征(ROC)或曲线下面积(AUC)来测量性能时,则采样方法的性能要优于原始训练集。如果下游语言处理模块使用句子边界检测输出,则此观察很重要。发现套袋可显着改善每种采样方法的系统性能。当韵律模型与语言模型结合使用时,这些方法的收益可能会减少,这是用于句子检测任务的强大知识来源。在完整的NIST评估任务中复制了在试验研究中发现的模式。结论可能取决于任务,分类器和知识组合方法。

著录项

  • 来源
    《Computer speech and language》 |2006年第4期|p. 468-494|共27页
  • 作者单位

    Speech Group, International Computer Science Institute, 1947 Center St., Ste 600, Berkeley, CA 94704, USA;

    Department of Computer Science and Engineering, University of Notre Dame, Notre Dame, IN 46530, USA;

    Department of Electrical and Computer Engineering, Purdue University, West Lafayette, IN 47907, USA;

    Speech Group, International Computer Science Institute, 1947 Center St., Ste 600, Berkeley, CA 94704, USA;

    Speech Group, International Computer Science Institute, 1947 Center St., Ste 600, Berkeley, CA 94704, USA;

  • 收录信息 美国《科学引文索引》(SCI);美国《工程索引》(EI);
  • 原文格式 PDF
  • 正文语种 eng
  • 中图分类 计算技术、计算机技术;
  • 关键词

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号