Can Large Language Models Generate High-Quality...

Can Large Language Models Generate High-Quality Short-Answer Assessments? A Comparative Study in Undergraduate Medical Education

Abstract

Background: Generative artificial intelligence (AI) tools including ChatGPT have the potential to augment the process of designing examinations and assessments for medical learners, leading to time and resource savings, and the ability to produce large volumes of practice problems tailored to learner-specific strengths and weaknesses. Methods: This study compares the quality of free-text assessment problems and answer keys generated by ChatGPT to those produced by faculty educators for a renal and hematology curriculum subunit. Five expert reviewers reviewed a collection of 21 free-text assessment problems, 9 from a collection of historical assessment problems used in an undergraduate medical program and 12 produced with ChatGPT. Reviewers assigned a score from 1 to 5, reflecting the overall quality. Results: The average quality of problems generated by ChatGPT was greater than that of human-generated problems (4.00 vs. 2.71, p < 0.001). Using ordinal mixed-effect modeling, human-generated problems had significantly lower odds of receiving higher ratings than ChatGPT-generated problems (β = −2.43, 95% confidence interval −3.34 to −1.51, p < 0.001). Conclusions: It is suggested that ChatGPT can assist expert faculty educators in producing assessment tools, with direct benefits to medical learners, although it cannot entirely replace this role in its current state.