Home
Scholarly Works
Can Large Language Models Generate High-Quality...
Journal article

Can Large Language Models Generate High-Quality Short-Answer Assessments? A Comparative Study in Undergraduate Medical Education

Abstract

Background: Generative artificial intelligence (AI) tools including ChatGPT have the potential to augment the process of designing examinations and assessments for medical learners, leading to time and resource savings, and the ability to produce large volumes of practice problems tailored to learner-specific strengths and weaknesses. Methods: This study compares the quality of free-text assessment problems and answer keys generated by ChatGPT to those produced by faculty educators for a renal and hematology curriculum subunit. Five expert reviewers reviewed a collection of 21 free-text assessment problems, 9 from a collection of historical assessment problems used in an undergraduate medical program and 12 produced with ChatGPT. Reviewers assigned a score from 1 to 5, reflecting the overall quality. Results: The average quality of problems generated by ChatGPT was greater than that of human-generated problems (4.00 vs. 2.71, p < 0.001). Using ordinal mixed-effect modeling, human-generated problems had significantly lower odds of receiving higher ratings than ChatGPT-generated problems (β = −2.43, 95% confidence interval −3.34 to −1.51, p < 0.001). Conclusions: It is suggested that ChatGPT can assist expert faculty educators in producing assessment tools, with direct benefits to medical learners, although it cannot entirely replace this role in its current state.

Authors

Morjaria L; Burns L; Gandhi B; Bracken K; Farooq MS; Levinson AJ; Ngo Q; Sibbald M

Journal

Applied Sciences, Vol. 16, No. 5,

Publisher

MDPI

Publication Date

March 1, 2026

DOI

10.3390/app16052535

ISSN

2076-3417

Contact the Experts team