首页> 外文会议>International joint conference on natural language processing;Conference on empirical methods in natural language processing >BiPaR: A Bilingual Parallel Dataset for Multilingual and Cross-lingual Reading Comprehension on Novels
【24h】

BiPaR: A Bilingual Parallel Dataset for Multilingual and Cross-lingual Reading Comprehension on Novels

机译:BiPaR:针对小说的多语言和跨语言阅读理解的双语并行数据集

获取原文

摘要

This paper presents BiPaR, a bilingual parallel novel-style machine reading comprehension (MRC) dataset, developed to support multilingual and cross-lingual reading comprehension. The biggest difference between BiPaR and existing reading comprehension datasets is that each triple (Passage, Question, Answer) in BiPaR is written parallelly in two languages. We collect 3,667 bilingual parallel paragraphs from Chinese and English novels, from which we construct 14.668 parallel question-answer pairs via crowdsourced workers following a strict quality control procedure. We analyze BiPaR in depth and find that BiPaR offers good diversification in prefixes of questions, answer types and relationships between questions and passages. We also observe that answering questions of novels requires reading comprehension skills of coreference resolution, multi-sentence reasoning, and understanding of implicit causality, etc. With BiPaR, we build monolingual, multilingual, and cross-lingual MRC baseline models. Even for the relatively simple monolingual MRC on this dataset, experiments show that a strong BERT baseline is over 30 points behind human in terms of both EM and F1 score, indicating that BiPaR provides a challenging testbed for monolingual, multilingual and cross-lingual MRC on novels.
机译:本文介绍BiPaR,这是一种双语并行的新型小说机器阅读理解(MRC)数据集,旨在支持多语言和跨语言阅读理解。 BiPaR与现有阅读理解数据集之间的最大区别在于,BiPaR中的每个三元组(段落,问题,答案)都是用两种语言并行编写的。我们从中英文小说中收集了3667个双语平行段落,并按照严格的质量控制程序,通过众包工作者从中构造了14.668个平行问答对。我们对BiPaR进行了深入分析,发现BiPaR在问题的前缀,答案类型以及问题与段落之间的关系方面提供了很好的多样化。我们还观察到,回答小说问题需要阅读同解,多句推理和对内在因果关系等理解能力。借助BiPaR,我们建立了单语,多语和跨语MRC基线模型。即使是针对该数据集上相对简单的单语MRC,实验也显示,强大的BERT基线在EM和F1评分方面都落后于人类30多个点,这表明BiPaR为单语,多语和跨语MRC提供了具有挑战性的测试平台小说。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号