首页> 外文会议>International joint conference on natural language processing >PubMed 200k RCT: a Dataset for Sequential Sentence Classification in Medical Abstracts
【24h】

PubMed 200k RCT: a Dataset for Sequential Sentence Classification in Medical Abstracts

机译:PubMed 200k RCT:医学文摘中顺序句子分类的数据集

获取原文

摘要

We present PubMed 200k RCT1, a new dataset based on PubMed for sequential sentence classification. The dataset consists of approximately 200,000 abstracts of randomized controlled trials, totaling 2.3 million sentences. Each sentence of each abstract is labeled with their role in the abstract using one of the following classes: background, objective, method, result, or conclusion. The purpose of releasing this dataset is twofold. First, the majority of datasets for sequential short-text classification (i.e., classification of short texts that appear in sequences) are small: we hope that releasing a new large dataset will help develop more accurate algorithms for this task. Second, from an application perspective, researchers need better tools to efficiently skim through the literature. Automatically classifying each sentence in an abstract would help researchers read abstracts more efficiently, especially in fields where abstracts may be long, such as the medical field.
机译:我们提出了PubMed 200k Rct1,这是一个基于Pubmed的新数据集进行顺序句子分类。该数据集由大约200,000个随机对照试验组成,总计230万张。每个摘要的每个句子都使用以下课程之一标记在摘要中的作用:背景,目标,方法,结果或结论。释放此数据集的目的是双重的。首先,用于顺序短文本分类的大多数数据集(即,序列中出现的短文本的分类)很小:我们希望释放新的大型数据集将有助于为此任务开发更准确的算法。其次,从申请角度来看,研究人员需要更好的工具,以有效地浏览文献。自动在摘要中对每个句子进行分类将帮助研究人员更有效地阅读摘要,特别是在摘要可能长的领域,例如医疗领域。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号