Quality Evaluation Scores are no more Reliable than Gestalt in Evaluating the Quality of Emergency Medicine Blogs: A METRIQ Study
Additional Document Info
Construct: We investigated the quality of emergency medicine (EM) blogs as educational resources. PURPOSE: Online medical education resources such as blogs are increasingly used by EM trainees and clinicians. However, quality evaluations of these resources using gestalt are unreliable. We investigated the reliability of two previously derived quality evaluation instruments for blogs. APPROACH: Sixty English-language EM websites that published clinically oriented blog posts between January 1 and February 24, 2016, were identified. A random number generator selected 10 websites, and the 2 most recent clinically oriented blog posts from each site were evaluated using gestalt, the Academic Life in Emergency Medicine (ALiEM) Approved Instructional Resources (AIR) score, and the Medical Education Translational Resources: Impact and Quality (METRIQ-8) score, by a sample of medical students, EM residents, and EM attendings. Each rater evaluated all 20 blog posts with gestalt and 15 of the 20 blog posts with the ALiEM AIR and METRIQ-8 scores. Pearson's correlations were calculated between the average scores for each metric. Single-measure intraclass correlation coefficients (ICCs) evaluated the reliability of each instrument. RESULTS: Our study included 121 medical students, 88 EM residents, and 100 EM attendings who completed ratings. The average gestalt rating of each blog post correlated strongly with the average scores for ALiEM AIR (r = .94) and METRIQ-8 (r = .91). Single-measure ICCs were fair for gestalt (0.37, IQR 0.25-0.56), ALiEM AIR (0.41, IQR 0.29-0.60) and METRIQ-8 (0.40, IQR 0.28-0.59). CONCLUSION: The average scores of each blog post correlated strongly with gestalt ratings. However, neither ALiEM AIR nor METRIQ-8 showed higher reliability than gestalt. Improved reliability may be possible through rater training and instrument refinement.