首页> 外文期刊>JMIR Medical Informatics >Cohort Selection for Clinical Trials From Longitudinal Patient Records: Text Mining Approach
【24h】

Cohort Selection for Clinical Trials From Longitudinal Patient Records: Text Mining Approach

机译:纵向患者记录的临床试验队列选择:文本挖掘方法

获取原文
           

摘要

Background Clinical trials are an important step in introducing new interventions into clinical practice by generating data on their safety and efficacy. Clinical trials need to ensure that participants are similar so that the findings can be attributed to the interventions studied and not to some other factors. Therefore, each clinical trial defines eligibility criteria, which describe characteristics that must be shared by the participants. Unfortunately, the complexities of eligibility criteria may not allow them to be translated directly into readily executable database queries. Instead, they may require careful analysis of the narrative sections of medical records. Manual screening of medical records is time consuming, thus negatively affecting the timeliness of the recruitment process. Objective Track 1 of the 2018 National Natural Language Processing Clinical Challenge focused on the task of cohort selection for clinical trials, aiming to answer the following question: Can natural language processing be applied to narrative medical records to identify patients who meet eligibility criteria for clinical trials? The task required the participating systems to analyze longitudinal patient records to determine if the corresponding patients met the given eligibility criteria. We aimed to describe a system developed to address this task. Methods Our system consisted of 13 classifiers, one for each eligibility criterion. All classifiers used a bag-of-words document representation model. To prevent the loss of relevant contextual information associated with such representation, a pattern-matching approach was used to extract context-sensitive features. They were embedded back into the text as lexically distinguishable tokens, which were consequently featured in the bag-of-words representation. Supervised machine learning was chosen wherever a sufficient number of both positive and negative instances was available to learn from. A rule-based approach focusing on a small set of relevant features was chosen for the remaining criteria. Results The system was evaluated using microaveraged F measure. Overall, 4 machine algorithms, including support vector machine, logistic regression, na?ve Bayesian classifier, and gradient tree boosting (GTB), were evaluated on the training data using 10–fold cross-validation. Overall, GTB demonstrated the most consistent performance. Its performance peaked when oversampling was used to balance the training data. The final evaluation was performed on previously unseen test data. On average, the F measure of 89.04% was comparable to 3 of the top ranked performances in the shared task (91.11%, 90.28%, and 90.21%). With an F measure of 88.14%, we significantly outperformed these systems (81.03%, 78.50%, and 70.81%) in identifying patients with advanced coronary artery disease. Conclusions The holdout evaluation provides evidence that our system was able to identify eligible patients for the given clinical trial with high accuracy. Our approach demonstrates how rule-based knowledge infusion can improve the performance of machine learning algorithms even when trained on a relatively small dataset.
机译:背景技术临床试验是通过产生关于其安全性和疗效的数据来引入临床实践中的新干预措施的重要一步。临床试验需要确保参与者类似,以便调查结果归因于所研究的干预措施,而不是其他因素。因此,每个临床试验都定义了资格标准,描述了参与者必须共享的特征。不幸的是,资格标准的复杂性可能不允许将它们直接转换成可易于执行的数据库查询。相反,他们可能需要仔细分析医疗记录的叙事部分。手动筛查医疗记录是耗时的,因此对招聘过程的及时性产生负面影响。 2018年国家自然语言处理临床挑战的目标轨迹专注于队列临床试验的队列选择的任务,旨在回答以下问题:可以应用自然语言处理,以识别符合临床试验资格标准的叙述?任务需要参与系统分析纵向患者记录,以确定相应的患者是否符合给定的资格标准。我们旨在描述为解决这项任务而开发的系统。方法我们的系统由13个分类器组成,一个用于每个资格标准。所有分类器都使用了一个单词袋文档表示模型。为防止与此类表示相关的相关上下文信息丢失,用于提取上下文敏感特征的模式匹配方法。它们被嵌入到文本中作为词汇表可区分的令牌,因此在文字袋式中出现。选择受监管机器学习,无论是足够数量的积极和负面的情况都可以学习。为剩余标准选择了专注于一小一小集相关特征的规则的方法。结果使用微宽的F度量评估系统。总体而言,使用10倍交叉验证对训练数据进行评估4个机算法,包括支持向量机,Logistic回归,Na obernian分类器和渐变树升压(GTB)。总体而言,GTB展示了最稳定的性能。在过采样用于平衡培训数据时,它的性能达到了峰值。最终评估是对以前看不见的测试数据进行的。平均而言,89.04%的F措施与共享任务中的最高排名的3个措施相当(91.11%,90.28%和90.21%)。对于88.14%的尺寸,我们在鉴定先进的冠状动脉疾病患者时显着优于这些系统(81.03%,78.50%和70.81%)。结论HoldOut评估提供了证据表明我们的系统能够以高精度识别给定临床试验的符合条件的患者。我们的方法展示了规则的知识输液,即使在相对较小的数据集上训练时也可以提高机器学习算法的性能。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号