...
首页> 外文期刊>Journal of biomedical informatics. >Semi-supervised clinical text classification with Laplacian SVMs: An application to cancer case management
【24h】

Semi-supervised clinical text classification with Laplacian SVMs: An application to cancer case management

机译:使用Laplacian SVM的半监督临床文本分类:在癌症病例管理中的应用

获取原文
获取原文并翻译 | 示例
           

摘要

Objective: To compare linear and Laplacian SVMs on a clinical text classification task; to evaluate the effect of unlabeled training data on Laplacian SVM performance. Background: The development of machine-learning based clinical text classifiers requires the creation of labeled training data, obtained via manual review by clinicians. Due to the effort and expense involved in labeling data, training data sets in the clinical domain are of limited size. In contrast, electronic medical record (EMR) systems contain hundreds of thousands of unlabeled notes that are not used by supervised machine learning approaches. Semi-supervised learning algorithms use both labeled and unlabeled data to train classifiers, and can outperform their supervised counterparts. Methods: We trained support vector machines (SVMs) and Laplacian SVMs on a training reference standard of 820 abdominal CT, MRI, and ultrasound reports labeled for the presence of potentially malignant liver lesions that require follow up (positive class prevalence 77%). The Laplacian SVM used 19,845 randomly sampled unlabeled notes in addition to the training reference standard. We evaluated SVMs and Laplacian SVMs on a test set of 520 labeled reports. Results: The Laplacian SVM trained on labeled and unlabeled radiology reports significantly outperformed supervised SVMs (Macro-F1 0.773 vs. 0.741, Sensitivity 0.943 vs. 0.911, Positive Predictive value 0.877 vs. 0.883). Performance improved with the number of labeled and unlabeled notes used to train the Laplacian SVM (pearson's ρ= 0.529 for correlation between number of unlabeled notes and macro-F1 score). These results suggest that practical semi-supervised methods such as the Laplacian SVM can leverage the large, unlabeled corpora that reside within EMRs to improve clinical text classification.
机译:目的:比较线性和拉普拉斯支持向量机在临床文本分类任务上的作用;评估未标记训练数据对拉普拉斯SVM性能的影响。背景:基于机器学习的临床文本分类器的开发需要创建标记的培训数据,这些数据是由临床医生手动审查而获得的。由于标记数据涉及的工作和费用,临床领域中的训练数据集规模有限。相比之下,电子病历(EMR)系统包含成千上万的未标记笔记,而有监督的机器学习方法并未使用这些笔记。半监督学习算法使用标记和未标记的数据来训练分类器,并且可以胜过其监督的分类器。方法:我们在820腹部CT,MRI和超声报告的培训参考标准上对支持向量机(SVM)和拉普拉斯SVM进行了培训,这些报告标记了需要随访的潜在恶性肝病灶(阳性率为77%)。除了培训参考标准外,拉普拉斯支持向量机还使用了19,845个随机采样的未标记笔记。我们在520个带标签的报告的测试集上评估了SVM和Laplacian SVM。结果:接受过标记和未标记放射学训练的Laplacian SVM报告明显优于监督SVM(Macro-F1 0.773对0.741,灵敏度0.943对0.911,阳性预测值0.877对0.883)。使用用于训练Laplacian SVM的带标签和无标签笔记的数量,性能得到改善(未标签笔记的数量与宏F1得分之间的相关性,皮尔森ρ= 0.529)。这些结果表明,实用的半监督方法(例如Laplacian SVM)可以利用驻留在EMR中的大型,未标记的语料库来改善临床文本分类。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号