首页> 外文期刊>Pattern recognition letters >Effective semi-supervised learning strategies for automatic sentence segmentation
【24h】

Effective semi-supervised learning strategies for automatic sentence segmentation

机译:有效的半监督学习策略,用于句子自动切分

获取原文
获取原文并翻译 | 示例
           

摘要

The primary objective of sentence segmentation process is to determine the sentence boundaries of a stream of words output by the automatic speech recognizers. Statistical methods developed for sentence segmentation requires a significant amount of labeled data which is time-consuming, labor intensive and expensive. In this work, we propose new multi-view semi-supervised learning strategies for sentence boundary classification problem using lexical, prosodic, and morphological information. The aim is to find effective semi-supervised machine learning strategies when only small sets of sentence boundary labeled data are available. We primarily investigate two semi-supervised learning approaches, called self-training and co-training. Different example selection strategies were also used for co-training, namely, agreement, disagreement and self-combined. Furthermore, we propose three-view and committee-based algorithms incorporating with agreement, disagreement and self-combined strategies using three disjoint feature sets. We present comparative results of different learning strategies on the sentence segmentation task. The experimental results show that the sentence segmentation performance can be highly improved using multi-view learning strategies that we proposed since data sets can be represented by three redundantly sufficient and disjoint feature sets. We show that the proposed strategies substantially improve the average baseline F-measure of 67.66% to 75.15% and 64.84% to 66.32% when only a small set of manually labeled data is available for Turkish and English spoken languages, respectively. (c) 2017 Elsevier B.V. All rights reserved.
机译:句子分割过程的主要目的是确定自动语音识别器输出的单词流的句子边界。为句子分段而开发的统计方法需要大量的标记数据,这是费时,费力且昂贵的。在这项工作中,我们使用词汇,韵律和词法信息为句子边界分类问题提出了新的多视图半监督学习策略。目的是在只有少量句子边界标记的数据可用时,找到有效的半监督机器学习策略。我们主要研究两种半监督的学习方法,称为自我训练和共同训练。共同训练也使用了不同的示例选择策略,即同意,不同意和自我结合。此外,我们提出了基于三视图和委员会的算法,该算法结合了使用三个不相交特征集的协议,分歧和自我组合策略。我们提出了在句子分割任务上不同学习策略的比较结果。实验结果表明,使用我们提出的多视图学习策略可以大大提高句子的分割性能,因为数据集可以由三个冗余且不相交的特征集表示。我们显示,当只有少量手动标记的数据分别适用于土耳其和英语口语时,所提出的策略可以显着提高平均基准F值,分别为67.66%至75.15%和64.84%至66.32%。 (c)2017 Elsevier B.V.保留所有权利。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号