PURPOSE: The purpose is to analyze and compare the quality and readability of information regarding anterior shoulder instability and shoulder stabilization surgery from three LLMs: ChatGPT 4o, ChatGPT Orthopaedic Expert (OE) and Google Gemini.
METHODS: ChatGPT 4o, ChatGPT OE and Google Gemini were used to answer 21 commonly asked questions from patients on anterior shoulder instability. The responses were independently rated by three fellowship-trained orthopaedic surgeons using the validated Quality Analysis of Medical Artificial Intelligence (QAMAI) tool. Assessors were blinded to the model, and evaluations were performed twice, 3 weeks apart. Readability was measured using Flesch Reading Ease Score (FRES) and Flesch-Kincaid Grade Level (FKGL). This study adhered to TRIPOD-LLM. Statistical analysis included the Friedman test, the Wilcoxon signed-rank tests and inter-class coefficients.
RESULTS: Inter-rater reliability between three surgeons was good or excellent reliability in all LLMs. ChatGPT OE and ChatGPT 4o demonstrated comparable overall performance, each achieving a median QAMAI score of 22 with interquartile ranges (IQRs) of 5.25 and 6.75, respectively, with median (IQR) domain scores for accuracy 4 (1) and 4 (1), clarity 4 (1) and 4 (1), relevance 4 (1) and 4 (1), completeness 4 (1) and 4 (1), provision of sources 1 (0) for both and usefulness 4 (1) and 4 (1), respectively. Google Gemini showed lower scores across these domains (accuracy 3 [1], clarity 3 [1], relevance 3 [1.25], completeness 3 [0.25], sources 3 [3] and usefulness 3 [1.25]), with a median QAMAI score of 19 (5.25) (p < 0.01 vs. each ChatGPT model). Readability was higher for Google Gemini (FRES = 36.96, FKGL = 11.92) than for ChatGPT OE (FRES = 21.90, FKGL = 14.94) and ChatGPT 4o (FRES = 24.24, FKGL = 15.11), indicating easier-to-read content (p < 0.01). There was no significant difference between ChatGPT 4o and OE in overall quality or readability.
CONCLUSIONS: ChatGPT 4o and ChatGPT OE provided statistically higher-quality responses than Google Gemini, though all models showed good-quality responses overall. However, responses generated by ChatGPT 4o and OE were more difficult to read than those generated by Google Gemini.
LEVEL OF EVIDENCE: Level V, expert opinion.