...
首页> 外文期刊>BMC Medical Informatics and Decision Making >A clinical text classification paradigm using weak supervision and deep representation
【24h】

A clinical text classification paradigm using weak supervision and deep representation

机译:使用弱监督和深度表示的临床文本分类范例

获取原文
           

摘要

Automatic clinical text classification is a natural language processing (NLP) technology that unlocks information embedded in clinical narratives. Machine learning approaches have been shown to be effective for clinical text classification tasks. However, a successful machine learning model usually requires extensive human efforts to create labeled training data and conduct feature engineering. In this study, we propose a clinical text classification paradigm using weak supervision and deep representation to reduce these?human efforts. We develop a rule-based NLP algorithm to automatically generate labels for the training data, and then use the pre-trained word embeddings as deep representation features for training machine learning models. Since machine learning is trained on labels generated by the automatic NLP?algorithm, this training process is called weak supervision. We evaluat the paradigm effectiveness on two institutional case studies at Mayo Clinic: smoking status classification and proximal femur (hip) fracture classification, and one case study using a public dataset: the i2b2 2006 smoking status classification shared task. We test four widely used machine learning models, namely, Support Vector Machine (SVM), Random Forest (RF), Multilayer Perceptron Neural Networks (MLPNN), and Convolutional Neural Networks (CNN), using this paradigm. Precision, recall, and F1 score are used as metrics to evaluate performance. CNN achieves the best performance in both institutional tasks (F1 score: 0.92 for Mayo Clinic smoking status classification and 0.97 for fracture classification). We show that word embeddings significantly outperform tf-idf and topic modeling features in the paradigm, and that CNN captures additional patterns from the weak supervision compared to the rule-based NLP algorithms. We also observe two drawbacks of the proposed paradigm that CNN is more sensitive to the size of training data, and that the proposed paradigm might not be effective for complex multiclass classification tasks. The proposed clinical text classification paradigm could reduce human efforts of labeled training data creation and feature engineering for applying machine learning to clinical text classification by leveraging weak supervision and deep representation. The experimental experiments have validated the effectiveness of paradigm by two institutional and one shared clinical text classification tasks.
机译:自动临床文本分类是一种自然语言处理(NLP)技术,可以解锁嵌入在临床叙事中的信息。机器学习方法已被证明对临床文本分类任务有效。但是,成功的机器学习模型通常需要大量的人力来创建标记的训练数据并进行特征工程。在这项研究中,我们提出了一种使用弱监督和深度表示的临床文本分类范例,以减少这些人为的努力。我们开发了一种基于规则的NLP算法,以自动为训练数据生成标签,然后将预训练的词嵌入用作训练机器学习模型的深度表示功能。由于机器学习是根据自动NLP算法生成的标签进行训练的,因此该训练过程称为弱监督。我们在Mayo诊所的两个机构案例研究中评估范式的有效性:吸烟状况分类和股骨近端(髋部)骨折分类,以及一个使用公共数据集的案例研究:i2b2 2006吸烟状况分类共享任务。我们使用此范例测试了四个广泛使用的机器学习模型,即支持向量机(SVM),随机森林(RF),多层感知器神经网络(MLPNN)和卷积神经网络(CNN)。精度,召回率和F1分数用作评估性能的指标。 CNN在两项机构任务中均表现最佳(F1评分:梅奥诊所吸烟状况分类为0.92,骨折分类为0.97)。我们显示,词嵌入在范例中显着胜过tf-idf和主题建模功能,并且与基于规则的NLP算法相比,CNN从弱监督中捕获了其他模式。我们还观察到拟议范式的两个缺点,即CNN对训练数据的大小更敏感,并且拟议范式对于复杂的多类分类任务可能无效。所提出的临床文本分类范例可以通过弱监督和深度表示来减少标记训练数据创建和特征工程的人工操作,从而将机器学习应用于临床文本分类。实验实验通过两个机构和一个共享的临床文本分类任务验证了范例的有效性。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号