Assessing the Reliability of Performance Assessment Scores: Some Considerations in Selecting an Appropriate Framework

André F. De Champlain; Andrea Gotzmann; Sirius Qin

首页> 外文期刊>The Journal of Graduate Medical Education >Assessing the Reliability of Performance Assessment Scores: Some Considerations in Selecting an Appropriate Framework

【24h】

Assessing the Reliability of Performance Assessment Scores: Some Considerations in Selecting an Appropriate Framework

机译：评估绩效评估的可靠性：选择适当的框架时的一些考虑因素

获取原文

掌桥外文数据库（机构版） >>

开具论文收录证明 >>

文献代查 >>

页面导航

摘要
著录项
相似文献
相关主题

摘要

The incorporation of learner assessments has become part and parcel of the accreditation process over the past few decades as a means of evaluating program or instructional effectiveness.1 Given the high stakes associated with assessments not only for individual candidate-based decisions but also programs as a whole, it is critical to ensure that scores based on any tools meet certain psychometric standards. At its most elemental level, any test score is intended to reflect the competency domain(s) presumed to underlie an assessment. For example, if a candidate obtains a score of 90% on a direct observation tool, this might be interpreted as reflecting “strong” patient care, even though the latter is, in all likelihood, established on a small number of encounters. Given that high-stakes decisions may be based on such observational tools, it is critical that the sample of performance be reflective of the candidate's true ability in that competency. Reliability refers to the extent to which performance on any assessment (ie, in a restricted number of encounters) is indicative of the candidate's true competency level (ie, in an infinite number of encounters).2 An “unreliable” assessment (ie, one that does not reflect the candidate's true competency level) could have dire consequences not only for the physician's medical education but also for the accreditation of the postgraduate program. Due to restricted testing time, any assessment encompasses a limited sample of encounters that theoretically represents the domain of interest. The selection of 10 patients for inclusion into a direct observation assessment, for example, might be predicated on 3 hours of testing time. However, one could conceive of different sets of 10 patients that could have been selected. The program director who is reviewing a candidate's score of 90% with these 10 patients is not interested in restricting his or her interpretation of that “strong” performance to these 10 specific encounters, but rather generalizes this statement to the (theoretically infinite) pool of encounters from which the sample of 10 was selected. Yet, several sources of measurement error can detract from the accuracy or precision with which the performance on a restricted sample of encounters generalizes to the broader domain. With performance assessments (in addition to the restricted sample of encounters), the examiners, the setting, and other factors can impede a candidate's score. Reliability allows us to estimate how well a score on any assessment (ie, a sample of performance) generalizes to the broader domain(s) of interest. With the previous example, how accurately does a score of 90%, in 10 patient encounters, scored by 10 examiners, generalize to all possible patient encounters and physician examiners? This generalization is quantified with a reliability coefficient. Note that patients and examiners are sources of measurement error, given that any candidate's true score or ability level should not depend on the sample of patients nor the examiners encountered. A candidate's true ability level should be invariant across all these sources of measurement error or facets. In reality, all of these sources will detract from reliability due to the lack of representativeness of the patient encounters selected for an examination and the poor training of examiners. Commonly, Cronbach's α coefficient is computed as the reliability estimate largely because it is readily available in most statistical software packages.3 However, the use of Cronbach's α with examinations that are affected by several sources of measurement error, such as performance-based assessments, is ill-advised. Specifically, this coefficient does not partition all sources of measurement error in the computation of the reliability coefficient; rather, it is restricted to only 1 facet (ie, “patient encounters”) in the previous example. Cronbach's α can thus yield a very misleading (spurious) reliability estimate becaus

机译：纳入学习者评估已成为过去几十年的认证过程的一部分和地块，因为评估计划或教学有效性的手段。整体，确保基于任何工具的得分至关重要，符合某些心理测量标准。在最具元素的层面，任何测试分数都旨在反映所假定的能力领域提出评估。例如，如果候选人在直接观察工具上获得90％的得分，这可能被解释为反映“强烈”患者护理，即使后者在所有可能性中都是在少量遭遇中建立的。鉴于高赌注决策可能是基于这种观察工具，重要的是，性能样本反映候选人的真实能力。可靠性是指在任何评估的性能（即，在受限制的遭遇中）的程度表示候选人的真正能力水平（即，在无限遭遇中）.2“不可靠”的评估（即，一个这不反映候选人的真正能力水平）不仅可以针对医生的医学教育而且对研究生课程的认可来说可能具有可怕的后果。由于受限制的测试时间，任何评估都包含有限的遭遇样本，理论上是理论上代表感兴趣的领域。例如，选择10名患者的患者，例如，可能会在3小时的测试时间上取得预测。然而，人们可以设想可以选择的不同10名患者。这些10名患者审查候选人得分为90％的方案总监对这10个特定遭遇的“强烈”表现的解释限制了他或她的解释，而是将本声明概括为（理论无限）遇到10个样品的选择。然而，几个测量误差来源可能会减损精度或精度，其中遇到的受限制样本的性能推广到更广泛的域。通过性能评估（除了遭受限制的遭遇样本），审查员，环境和其他因素可以阻碍候选人的分数。可靠性允许我们估计任何评估的分数的程度（即性能样本）推广到更广泛的域名。通过前面的例子，在10名患者遭遇中，10％的评分是10％的，由10名审查员评分，概括为所有可能的患者遭遇和医生审查员？该概括用可靠性系数量化。请注意，患者和审查员是测量误差的来源，因为任何候选人的真正得分或能力水平不依赖于患者样本，也不达到审查人员。候选人的真实能力水平应在所有这些测量误差或方面中不变。实际上，由于患者遭遇的代表性缺乏所选择的考试和审查员的衡量训练，所有这些来源都会减损可靠性。通常，Cronbach的α系数在很大程度上被计算为可靠性估计，因为它在大多数统计软件包中很容易获得。不明智。具体地，该系数不会在可靠性系数的计算中分配所有测量误差源;相反，在前面的示例中仅限于1个小平面（即“患者遇到”）。因此，Cronbach的α可以产生非常误导的（虚假）可靠性估计

著录项

来源
《The Journal of Graduate Medical Education》 |2016年第4期|共3页
作者
André F. De Champlain; Andrea Gotzmann; Sirius Qin;
展开▼
作者单位

展开▼
收录信息
原文格式 PDF
正文语种
中图分类预防医学、卫生学;
关键词

相似文献

外文文献
中文文献
专利

1. A framework of reliability assessment with consideration effect of transient and voltage stabilities [J] . Amjady N. IEEE Transactions on Power Systems . 2004,第2期

机译：考虑瞬态和电压稳定性影响的可靠性评估框架
2. Could Residents Adequately Assess the Severity of Hidradenitis Suppurativa? Interrater and Intrarater Reliability Assessment of Major Scoring Systems [J] . Wlodarek Katarzyna, Stefaniak Aleksandra, Matusiak Lukasz, Dermatology: international journal for clinical and investigative dermatology . 2020,第1期

机译：居民可以充分评估Hidradenitis Suppurativa的严重程度吗？ Interriter和Interrarent可靠性评估重大评分系统
3. The FIGO assessment scoring system (FASS): a new holistic classification tool to assess women with pelvic floor dysfunction: validity and reliability [J] . Digesu G. Alessandro, Swift Steven, Puccini Federica, International urogynecology journal and pelvic floor dysfunction . 2015,第6期

机译：FIGO评估评分系统（FASS）：一种新的整体分类工具，用于评估骨盆底功能障碍的妇女：有效性和可靠性
4. CONSIDERATIONS FOR SELECTING APPROACHES TO ESTIMATE LATERAL SPREAD DISPLACEMENTS FOR ASSESSING PIPELINE PERFORMANCE [C] . Douglas G. Honegger ASME International Pipeline Conference . 2007

机译：选择估计横向扩增位移的方法的考虑因素评估管道性能
5. An examination of effects on decision accuracy of changes in exam length, case selection, and scoring method in complex performance assessments. [D] . Frye, Ann Winecoff. 2001

机译：在复杂的绩效评估中检查考试时长变化，案例选择和评分方法对决策准确性的影响。
6. Assessing the Reliability of Performance Assessment Scores: Some Considerations in Selecting an Appropriate Framework [O] . André F. De Champlain, Andrea Gotzmann, Sirius Qin 2016

机译：评估绩效评估分数的可靠性：选择适当框架时的一些注意事项
7. Using Generalizability Theory and the ERP Reliability Analysis (ERA) Toolbox for Assessing Test-Retest Reliability of ERP Scores Part 1: Algorithms, Framework, and Implementation [O] . Peter E Clayson, Kaylie Amanda Carbine, Scott Baldwin, 2020

机译：使用普遍性理论和ERP可靠性分析（ERA）工具箱，用于评估ERP分数的测试 - RETEST可靠性第1部分：算法，框架和实现

Assessing the Reliability of Performance Assessment Scores: Some Considerations in Selecting an Appropriate Framework

摘要

著录项

相似文献

相关主题

期刊订阅