首页> 外文期刊>BMC Medical Informatics and Decision Making >Transformers-sklearn: a toolkit for medical language understanding with transformer-based models
【24h】

Transformers-sklearn: a toolkit for medical language understanding with transformer-based models

机译:变形金刚 - Sklearn:用基于变压器的模型的医疗语言理解的工具包

获取原文
           

摘要

Transformer is an attention-based architecture proven the state-of-the-art model in natural language processing (NLP). To reduce the difficulty of beginning to use transformer-based models in medical language understanding and expand the capability of the scikit-learn toolkit in deep learning, we proposed an easy to learn Python toolkit named transformers-sklearn. By wrapping the interfaces of transformers in only three functions (i.e., fit, score, and predict), transformers-sklearn combines the advantages of the transformers and scikit-learn toolkits. In transformers-sklearn, three Python classes were implemented, namely, BERTologyClassifier for the classification task, BERTologyNERClassifier for the named entity recognition (NER) task, and BERTologyRegressor for the regression task. Each class contains three methods, i.e., fit for fine-tuning transformer-based models with the training dataset, score for evaluating the performance of the fine-tuned model, and predict for predicting the labels of the test dataset. transformers-sklearn is a user-friendly toolkit that (1) Is customizable via a few parameters (e.g., model_name_or_path and model_type), (2) Supports multilingual NLP tasks, and (3) Requires less coding. The input data format is automatically generated by transformers-sklearn with the annotated corpus. Newcomers only need to prepare the dataset. The model framework and training methods are predefined in transformers-sklearn. We collected four open-source medical language datasets, including TrialClassification for Chinese medical trial text multi label classification, BC5CDR for English biomedical text name entity recognition, DiabetesNER for Chinese diabetes entity recognition and BIOSSES for English biomedical sentence similarity estimation. In the four medical NLP tasks, the average code size of our script is 45 lines/task, which is one-sixth the size of transformers’ script. The experimental results show that transformers-sklearn based on pretrained BERT models achieved macro F1 scores of 0.8225, 0.8703 and 0.6908, respectively, on the TrialClassification, BC5CDR and DiabetesNER tasks and a Pearson correlation of 0.8260 on the BIOSSES task, which is consistent with the results of transformers. The proposed toolkit could help newcomers address medical language understanding tasks using the scikit-learn coding style easily. The code and tutorials of transformers-sklearn are available at https://doi.org/10.5281/zenodo.4453803 . In future, more medical language understanding tasks will be supported to improve the applications of transformers_sklearn.
机译:变压器是一种基于关注的体系结构,证明了自然语言处理(NLP)中的最先进模型。为了减少开始使用基于变压器的模型的医学语言的难度,并扩大Scikit-Learn工具包在深度学习中的能力,我们提出了一个易于学习的Python Toolkit命名为Transformers-Sklearn。通过仅在三个功能中包装变压器的接口(即,适合,分数和预测),Transformers-Sklearn结合了变压器和Scikit-Learn工具包的优势。在变换器 - Sklearn中,实现了三种Python类,即分类任务的BertolumaceClassifier,用于命名实体识别(ner)任务的BertologyNerClassifier,以及回归任务的BertologyRegressor。每个类都包含三种方法,即,使用训练数据集进行微调变换器的模型,用于评估微调模型性能的分数,并预测预测测试数据集的标签。变换器 - Sklearn是一个用户友好的工具包,(1)通过一些参数(例如,model_name_or_path和model_type),(2)支持多语言NLP任务,(3)需要更少的编码。输入数据格式由变换器-Sklearn自动生成带注释的语料库。新人只需要准备数据集。模型框架和训练方法是在变换器 - Sklearn中的预定义。我们收集了四种开源医务语言数据集,包括用于中文医学审判文本的Trimclasionification Multi标签分类,BC5CDR用于英语生物医学文本名称实体识别,中国糖尿病实体识别和Biosses的糖尿病患者为英语生物医学句子相似性估算。在四个医疗NLP任务中,我们脚本的平均代码大小为45行/任务,该任务是变形金刚脚本的大小。实验结果表明,基于预磨料BERT模型的变压器 - Sklearn分别在Timclassify,BC5CDR和Diabetesner任务中分别达到0.8225,0.8703和0.6908的宏F1分别,以及0.8260的Pearson相关性,这与之一致变压器的结果。该建议的工具包可以帮助新人地址使用Scikit-Greating编码风格轻松了解医疗语言的理解任务。 Transformers-sklearn的代码和教程可在https://do.org/10.5281/zenodo.4453803中获得。将来,将支持更多医疗语言理解任务,以改善变换器_sklearn的应用。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号