首页> 外文OA文献 >Preparing, restructuring, and augmenting a French treebank:udlexicalised parsers or coherent treebanks?
【2h】

Preparing, restructuring, and augmenting a French treebank:udlexicalised parsers or coherent treebanks?

机译:准备,重组和扩充法国树库: ud词汇化的解析器或连贯的树库?

摘要

We present the Modified French Treebank (MFT), a completely revamped French Treebank, derived from the Paris 7 Treebankud(P7T), which is cleaner, more coherent, has several transformed structures, and introduces new linguistic analyses. To determine the effect of these changes, weudinvestigate how theMFT fares in statistical parsing. Probabilistic parsers trained on the MFT training set (currently 3800 trees) already perform better than their counterparts trained on five times the P7T data (18,548 trees), providing an extreme example of the importance of data quality over quantity in statistical parsing. Moreover,udregression analysis on the learning curve of parsers trained on the MFT lead to the prediction that parsers trained on the full projected 18,548 tree MFT training setudwill far outscore their counterparts trained on the full P7T. These analyses also show how problematic data can lead to problematic conclusions–in particular, we find thatudlexicalisation in the probabilistic parsing of French is probably not as crucial as was once thought (Arun and Keller (2005)).
机译:我们提出了经过改进的法国树库(MFT),这是经过彻底改造的法国树库,它源自更干净,更连贯的Paris 7 Treebank ud(P7T),具有多个转换后的结构,并介绍了新的语言分析。为了确定这些更改的影响,我们将研究MFT在统计分析中的表现。在MFT训练集上训练的概率解析器(当前3800棵树)的性能已经比在P7T数据上训练五倍的同行(18,548棵树)更好,这提供了统计分析中数据质量胜于数量的重要性的极端例子。此外,对在MFT上训练的解析器的学习曲线进行回归分析,可以得出这样的预测:在完整的预计的18,548树MFT训练集上训练的解析器 ud将远远超过在完整的P7T上训练的解析器。这些分析还显示出有问题的数据如何导致有问题的结论,特别是,我们发现,法语概率解析中的非词法化可能没有以前所想的那么重要(Arun和Keller(2005))。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号