A study in machine learning from imbalanced data for sentence boundary detection in speech

Yang Liu; Nitesh V. Chawla; Mary P. Harper; Elizabeth Shriberg; Andreas Stolcke

首页> 外文期刊>Computer speech and language >A study in machine learning from imbalanced data for sentence boundary detection in speech

【24h】

A study in machine learning from imbalanced data for sentence boundary detection in speech

机译：基于不平衡数据的机器学习用于语音句子边界检测的研究

获取原文

获取原文并翻译 | 示例

掌桥外文数据库（机构版） >>

开具论文收录证明 >>

文献代查 >>

页面导航

摘要
著录项
相似文献
相关主题

摘要

Enriching speech recognition output with sentence boundaries improves its human readability and enables further processing by downstream language processing modules. We have constructed a hidden Markov model (HMM) system to detect sentence boundaries that uses both prosodic and textual information. Since there are more nonsentence boundaries than sentence boundaries in the data, the prosody model, which is implemented as a decision tree classifier, must be constructed to effectively learn from the imbalanced data distribution. To address this problem, we investigate a variety of sampling approaches and a bagging scheme. A pilot study was carried out to select methods to apply to the full NIST sentence boundary evaluation task across two corpora (conversational telephone speech and broadcast news speech), using both human transcriptions and recognition output. In the pilot study, when classification error rate is the performance measure, using the original training set achieves the best performance among the sampling methods, and an ensemble of multiple classifiers from different downsampled training sets achieves slightly poorer performance, but has the potential to reduce computational effort. However, when performance is measured using receiver operating characteristics (ROC) or area under the curve (AUC), then the sampling approaches outperform the original training set. This observation is important if the sentence boundary detection output is used by downstream language processing modules. Bagging was found to significantly improve system performance for each of the sampling methods. The gain from these methods may be diminished when the prosody model is combined with the language model, which is a strong knowledge source for the sentence detection task. The patterns found in the pilot study were replicated in the full NIST evaluation task. The conclusions may be dependent on the task, the classifiers, and the knowledge combination approach.

机译：使用句子边界丰富语音识别输出可提高其人类可读性，并允许下游语言处理模块进行进一步处理。我们已经构建了一个隐马尔可夫模型（HMM）系统来检测使用韵律和文本信息的句子边界。由于数据中的非句子边界多于句子边界，因此必须构建用作决策树分类器的韵律模型，以有效地从不平衡的数据分布中学习。为了解决这个问题，我们研究了各种采样方法和装袋方案。进行了一项试点研究，以选择两种方法，使用人类转录和识别输出，将方法应用于两个语料库（会话电话语音和广播新闻语音）中的完整NIST句子边界评估任务。在试点研究中，当分类错误率是性能度量标准时，使用原始训练集可以在采样方法中获得最佳性能，而来自不同降采样训练集的多个分类器的集合可以实现稍差的性能，但有可能降低计算工作量。但是，当使用接收器工作特征（ROC）或曲线下面积（AUC）来测量性能时，则采样方法的性能要优于原始训练集。如果下游语言处理模块使用句子边界检测输出，则此观察很重要。发现套袋可显着改善每种采样方法的系统性能。当韵律模型与语言模型结合使用时，这些方法的收益可能会减少，这是用于句子检测任务的强大知识来源。在完整的NIST评估任务中复制了在试验研究中发现的模式。结论可能取决于任务，分类器和知识组合方法。

著录项

来源
《Computer speech and language》 |2006年第4期|p. 468-494|共27页
作者
Yang Liu; Nitesh V. Chawla; Mary P. Harper; Elizabeth Shriberg; Andreas Stolcke;
展开▼
作者单位

Speech Group, International Computer Science Institute, 1947 Center St., Ste 600, Berkeley, CA 94704, USA;

Department of Computer Science and Engineering, University of Notre Dame, Notre Dame, IN 46530, USA;

Department of Electrical and Computer Engineering, Purdue University, West Lafayette, IN 47907, USA;

Speech Group, International Computer Science Institute, 1947 Center St., Ste 600, Berkeley, CA 94704, USA;

Speech Group, International Computer Science Institute, 1947 Center St., Ste 600, Berkeley, CA 94704, USA;

展开▼
收录信息美国《科学引文索引》(SCI);美国《工程索引》(EI);
原文格式 PDF
正文语种 eng
中图分类计算技术、计算机技术;
关键词

相似文献

外文文献
中文文献
专利

1. A predictive machine learning application in agriculture: Cassava disease detection and classification with imbalanced dataset using convolutional neural networks [J] . G. Sambasivam, Geoffrey Duncan Opiyo Egyptian Informatics Journal . 2021,第1期

机译：采用卷积神经网络的预测机器学习申请：木薯病检测和对不平衡数据集的分类
2. A new machine learning-based method for android malware detection on imbalanced dataset [J] . Dehkordy Diyana Tehrany, Rasoolzadegan Abbas Multimedia Tools and Applications . 2021,第16期

机译：基于机器学习的基于机器学习的Android Malware检测方法，用于基于Inbalanced DataSet
3. Insider Threat Detection Using Supervised Machine Learning Algorithms on an Extremely Imbalanced Dataset [J] . International Journal of Cyber Warfare and Terrorism . 2020,第2期

机译：使用监督的机器学习算法对极端不平衡的数据集进行内部威胁检测
4. Using Machine Learning to Cope with Imbalanced Classes in Natural Speech: Evidence from Sentence Boundary and Disfluency Detection [C] . Yang Liu, Elizabeth Shriberg, Andreas Stolcke, International Conference on Spoken Language Processing; 20041004-08; Jeju(KR) . 2004

机译：使用机器学习应对自然语音中的不平衡类：句子边界和不满检测的证据
5. Active learning with support vector machines for imbalanced datasets and a method for stopping active learning based on stabilizing predictions. [D] . Bloodgood, Michael. 2009

机译：支持向量机用于不平衡数据集的主动学习，以及一种基于稳定预测的主动学习停止方法。
6. Smartwatch-Based Eating Detection: Data Selection for Machine Learning from Imbalanced Data with Imperfect Labels [O] . Simon Stankoski, Marko Jordan, Hristijan Gjoreski, 2021

机译：基于SmartWatch的进食检测：从具有不完美标签的存储数据的机器学习的数据选择
7. Financial Fraud Detection and Data Mining of Imbalanced Databases using State Space Machine Learning [O] . Sawh Deitra 2015

机译：使用状态空间机器学习的不平衡数据库财务欺诈检测和数据挖掘

A study in machine learning from imbalanced data for sentence boundary detection in speech

摘要

著录项

相似文献

相关主题

期刊订阅