首页> 外文期刊>JMIR Medical Informatics >Interpretability and Class Imbalance in Prediction Models for Pain Volatility in Manage My Pain App Users: Analysis Using Feature Selection and Majority Voting Methods
【24h】

Interpretability and Class Imbalance in Prediction Models for Pain Volatility in Manage My Pain App Users: Analysis Using Feature Selection and Majority Voting Methods

机译:止痛性模型的解释性和类别不平衡在管理我的痛苦应用程序中的疼痛波动轴上:使用特征选择和多数投票方法分析

获取原文
           

摘要

Background Pain volatility is an important factor in chronic pain experience and adaptation. Previously, we employed machine-learning methods to define and predict pain volatility levels from users of the Manage My Pain app. Reducing the number of features is important to help increase interpretability of such prediction models. Prediction results also need to be consolidated from multiple random subsamples to address the class imbalance issue. Objective This study aimed to: (1) increase the interpretability of previously developed pain volatility models by identifying the most important features that distinguish high from low volatility users; and (2) consolidate prediction results from models derived from multiple random subsamples while addressing the class imbalance issue. Methods A total of 132 features were extracted from the first month of app use to develop machine learning–based models for predicting pain volatility at the sixth month of app use. Three feature selection methods were applied to identify features that were significantly better predictors than other members of the large features set used for developing the prediction models: (1) Gini impurity criterion; (2) information gain criterion; and (3) Boruta. We then combined the three groups of important features determined by these algorithms to produce the final list of important features. Three machine learning methods were then employed to conduct prediction experiments using the selected important features: (1) logistic regression with ridge estimators; (2) logistic regression with least absolute shrinkage and selection operator; and (3) random forests. Multiple random under-sampling of the majority class was conducted to address class imbalance in the dataset. Subsequently, a majority voting approach was employed to consolidate prediction results from these multiple subsamples. The total number of users included in this study was 879, with a total number of 391,255 pain records. Results A threshold of 1.6 was established using clustering methods to differentiate between 2 classes: low volatility (n=694) and high volatility (n=185). The overall prediction accuracy is approximately 70% for both random forests and logistic regression models when using 132 features. Overall, 9 important features were identified using 3 feature selection methods. Of these 9 features, 2 are from the app use category and the other 7 are related to pain statistics. After consolidating models that were developed using random subsamples by majority voting, logistic regression models performed equally well using 132 or 9 features. Random forests performed better than logistic regression methods in predicting the high volatility class. The consolidated accuracy of random forests does not drop significantly (601/879; 68.4% vs 618/879; 70.3%) when only 9 important features are included in the prediction model. Conclusions We employed feature selection methods to identify important features in predicting future pain volatility. To address class imbalance, we consolidated models that were developed using multiple random subsamples by majority voting. Reducing the number of features did not result in a significant decrease in the consolidated prediction accuracy.
机译:背景技术疼痛挥发性是慢性疼痛经验和适应的重要因素。以前,我们采用了机器学习方法来定义和预测管理我的痛苦应用程序的用户的疼痛波动率。减少特征的数量对于帮助增加这种预测模型的可解释性是重要的。预测结果还需要从多个随机亚样品巩固以解决类别不平衡问题。目的本研究旨在:(1)通过识别从低波动用户区分高的最重要的特征来增加先前显影的止痛波动模型的可解释性; (2)通过在解决类别不平衡问题的同时,整合来自多个随机归位的模型的预测结果。方法从第一个月的应用程序使用中提取了132种功能,以开发基于机器学习的模型,以便在应用程序使用的第六个月预测疼痛波动。应用了三种特征选择方法来识别比用于开发预测模型的大功能集的其他成员显着更好的预测器的特征:(1)基尼杂质标准; (2)信息增益标准;和(3)Boruta。然后,我们组合了这些算法确定的三组重要特征,以产生重要特征的最终列表。然后采用三种机器学习方法使用所选择的重要特征进行预测实验:(1)与脊估计器的逻辑回归; (2)具有最小绝对收缩和选择操作员的逻辑回归; (3)随机森林。在数据集中进行多个类别的多个随机抽样,以解决数据集中的类别不平衡。随后,采用大多数表决方法来巩固来自这些多个副页的预测结果。本研究中包含的用户总数为879,总数为391,255次疼痛记录。结果使用聚类方法建立1.6的阈值,以区分2类:低挥发性(n = 694)和高挥发性(n = 185)。当使用132个功能时,随机林和逻辑回归模型的整体预测精度约为70%。总体而言,使用3个特征选择方法识别出9个重要特征。在这9个特征中,来自应用程序使用类别,另一个7与疼痛统计有关。在通过多数投票中使用随机归档开发的组合模型之后,使用132或9个功能同样良好地进行了逻辑回归模型。随机森林比预测高挥发性类的逻辑回归方法更好。当仅在预测模型中仅包含9个重要特征时,随机森林的综合准确性不会显着下降(601/879; 68.4%vs 618/879; 70.3%)。结论我们采用了特征选择方法来确定预测未来疼痛波动性的重要特征。要解决类别不平衡,我们通过多数投票使用多个随机归位开发的综合模型。减少特征数量不会导致综合预测准确性的显着降低。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号