Consensus Conference Follow‐up: Inter‐rater...

Consensus Conference Follow‐up: Inter‐rater Reliability Assessment of the Best Evidence in Emergency Medicine (BEEM) Rater Scale, a Medical Literature Rating Tool for Emergency Physicians

Abstract

BACKGROUND: Studies published in general and specialty medical journals have the potential to improve emergency medicine (EM) practice, but there can be delayed awareness of this evidence because emergency physicians (EPs) are unlikely to read most of these journals. Also, not all published studies are intended for or ready for clinical practice application. The authors developed "Best Evidence in Emergency Medicine" (BEEM) to ameliorate these problems by searching for, identifying, appraising, and translating potentially practice-changing studies for EPs. An initial step in the BEEM process is the BEEM rater scale, a novel tool for EPs to collectively evaluate the relative clinical relevance of EM-related studies found in more than 120 journals. The BEEM rater process was designed to serve as a clinical relevance filter to identify those studies with the greatest potential to affect EM practice. Therefore, only those studies identified by BEEM raters as having the highest clinical relevance are selected for the subsequent critical appraisal process and, if found methodologically sound, are promoted as the best evidence in EM. OBJECTIVES: The primary objective was to measure inter-rater reliability (IRR) of the BEEM rater scale. Secondary objectives were to determine the minimum number of EP raters needed for the BEEM rater scale to achieve acceptable reliability and to compare performance of the scale against a previously published evidence rating system, the McMaster Online Rating of Evidence (MORE), in an EP population. METHODS: The authors electronically distributed the title, conclusion, and a PubMed link for 23 recently published studies related to EM to a volunteer group of 134 EPs. The volunteers answered two demographic questions and rated the articles using one of two randomly assigned seven-point Likert scales, the BEEM rater scale (n = 68) or the MORE scale (n = 66), over two separate administrations. The IRR of each scale was measured using generalizability theory. RESULTS: The IRR of the BEEM rater scale ranged between 0.90 (95% confidence interval [CI] = 0.86 to 0.93) to 0.92 (95% CI = 0.89 to 0.94) across administrations. Decision studies showed a minimum of 12 raters is required for acceptable reliability of the BEEM rater scale. The IRR of the MORE scale was 0.82 to 0.84. CONCLUSIONS: The BEEM rater scale is a highly reliable, single-question tool for a small number of EPs to collectively rate the relative clinical relevance within the specialty of EM of recently published studies from a variety of medical journals. It compares favorably with the MORE system because it achieves a high IRR despite simply requiring raters to read each article's title and conclusion.