首页> 外文期刊>BMC Medical Informatics and Decision Making >ECCParaCorp: a cross-lingual parallel corpus towards cancer education, dissemination and application
【24h】

ECCParaCorp: a cross-lingual parallel corpus towards cancer education, dissemination and application

机译:EccParacorp:癌症教育,传播和应用的交叉平行语料库

获取原文
           

摘要

The increasing global cancer incidence corresponds to serious health impact in countries worldwide. Knowledge-powered health system in different languages would enhance clinicians’ healthcare practice, patients’ health management and public health literacy. High-quality corpus containing cancer information is the necessary foundation of cancer education. Massive non-structural information resources exist in clinical narratives, electronic health records (EHR) etc. They can only be used for training AI models after being transformed into structured corpus. However, the scarcity of multilingual cancer corpus limits the intelligent processing, such as machine translation in medical scenarios. Thus, we created the cancer specific cross-lingual corpus and open it to the public for academic use. Aiming to build an English-Chinese cancer parallel corpus, we developed a workflow of seven steps including data retrieval, data parsing, data processing, corpus implementation, assessment verification, corpus release, and application. We applied the workflow to a cross-lingual, comprehensive and authoritative cancer information resource, PDQ (Physician Data Query). We constructed, validated and released the parallel corpus named as ECCParaCorp, made it openly accessible online. The proposed English-Chinese Cancer Parallel Corpus (ECCParaCorp) consists of 6685 aligned text pairs in Xml, Excel, Csv format, containing 5190 sentence pairs, 1083 phrase pairs and 412 word pairs, which involved information of 6 cancers including breast cancer, liver cancer, lung cancer, esophageal cancer, colorectal cancer, and stomach cancer, and 3 cancer themes containing cancer prevention, screening, and treatment. All data in the parallel corpus are online, available for users to browse and download (http://www.phoc.org.cn/ECCParaCorp/ ). ECCParaCorp is a parallel corpus focused on cancer in a cross-lingual form, which is openly accessible. It would make up the imbalance of scarce multilingual corpus resources, bridge the gap between human readable information and machine understanding data resources, and would contribute to intelligent technology application as a preparatory data foundation e.g. cancer-related machine translation, cancer system development towards medical education, and disease-oriented knowledge extraction.
机译:越来越多的全球癌症发病率对应于全球各国的严重健康影响。不同语言的知识动力的健康系统将提高临床医生的医疗保健实践,患者的健康管理和公共卫生识字。含有癌症信息的高质量语料库是癌症教育的必要基础。临床叙述中存在大规模的非结构信息资源,电子健康记录(EHR)等。它们只能用于在转化为结构化语料库后训练AI模型。然而,多语种癌症语料库的稀缺限制了医学方案中的机器翻译等智能处理。因此,我们创造了癌症特异性交叉语料库,并将其开放给公众进行学术用途。旨在建立一个英汉癌症并行语料库,我们开发了七个步骤的工作流程,包括数据检索,数据解析,数据处理,语料库实现,评估验证,语料库版本和应用程序。我们将工作流程应用于交叉语言,全面和权威的癌症信息资源,PDQ(医师数据查询)。我们构建,验证和发布了命名为EccParacorp的并行语料库,使其在线可访问。提出的英汉癌症并行语料库(ECCParacorp)由XML,Excel,CSV格式的6685个对齐的文本对组成,其中包含5190句对,1083短语对和412字对,涉及6个癌症的信息,包括乳腺癌,肝癌,肺癌,食管癌,结直肠癌和胃癌,以及含有癌症预防,筛查和治疗的3个癌症主题。并行语料库中的所有数据都在线,可供用户浏览和下载(http://www.phoc.org.cn/eccparacorp/)。 EccParacorp是一种平行的语料库,以交叉形式聚焦癌症,可公开访问。它将弥补稀缺的多语言语料库资源的不平衡,弥合人类可读信息和机器理解数据资源之间的差距,并将有助于智能技术应用作为预备数据基础。癌症有关的机器翻译,癌症系统发展向医学教育,与疾病导向的知识提取。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号