首页> 外文期刊>Journal of logic and computation >Learning structural dependencies of words in the Zipfian Tail
【24h】

Learning structural dependencies of words in the Zipfian Tail

机译:学习Zipfian尾巴中单词的结构依赖性

获取原文
获取原文并翻译 | 示例
           

摘要

This article uses semi-supervised Expectation Maximization (EM) to learn lexico-syntactic dependencies, i.e. associations between words and the structures that occur with them. Due to Zipfian distributions in language, such dependencies are extremely sparse in labelled data, and unlabelled data are the only source for learning them. Specifically, we learn sparse lexical parameters of a generative parsing model (a Probabilistic Context-Free Grammar, PCFG) that is initially estimated over the Penn Treebank. Our lexical parameters are similar to supertags-they are fine-grained, and encode complex structural information at the pre-terminal level. Our goal is to use unlabelled data to learn these for words that are rare or unseen in the labelled data. We get large error reductions (up to 17.5%) in parsing ambiguous structures associated with unseen verbs, the most important case of learning lexico-structural dependencies, resulting in a statistically significant improvement in labelled bracketing score of the treebank PCFG Our semi-supervised method incorporates structural and lexical priors from the labelled data to guide estimation from unlabelled data, and is the first successful use of semi-supervised EM to improve a generative structured model already trained over large labelled data. The method scales well to larger amounts of unlabelled data, and also gives substantial error reductions (up to 11.5%) for models trained on smaller amounts of labelled data, making it relevant to low-resource languages with small treebanks as well.
机译:本文使用半监督的期望最大化(EM)来学习词汇句法依存关系,即单词与伴随它们出现的结构之间的关联。由于语言的Zipfian分布,这种依赖性在标记数据中极为稀疏,而未标记数据是学习它们的唯一来源。具体来说,我们学习生成解析模型(概率上下文无关语法,PCFG)的稀疏词法参数,该模型最初是在Penn树库中估算的。我们的词法参数类似于超级标记-它们的粒度很细,并在终端前级别编码复杂的结构信息。我们的目标是使用未标记的数据来学习标记数据中稀有或看不见的单词。在解析与看不见的动词相关的歧义结构时,我们获得了大幅度的错误减少(高达17.5%),这是学习词汇-结构相关性的最重要情况,从而在树状结构PCFG的带标签的包围式评分中具有统计上的显着改善,我们的半监督方法结合了来自标记数据的结构和词法先验,以指导对未标记数据的估计,并且是半监督EM首次成功使用,以改进已经针对大型标记数据进行训练的生成结构化模型。该方法可以很好地扩展到大量未标记数据,并且对于使用少量标记数据训练的模型,也可以显着减少错误(最多11.5%),从而使其也与具有小树库的低资源语言相关。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号