Home
Scholarly Works
ChatGPT models provide higher‐quality but...
Journal article

ChatGPT models provide higher‐quality but lower‐readability responses than Google Gemini regarding anterior shoulder instability, with no added benefit of the orthopaedic expert plugin

Abstract

PURPOSE: The purpose is to analyze and compare the quality and readability of information regarding anterior shoulder instability and shoulder stabilization surgery from three LLMs: ChatGPT 4o, ChatGPT Orthopaedic Expert (OE) and Google Gemini. METHODS: ChatGPT 4o, ChatGPT OE and Google Gemini were used to answer 21 commonly asked questions from patients on anterior shoulder instability. The responses were independently rated by three fellowship-trained orthopaedic surgeons using the validated Quality Analysis of Medical Artificial Intelligence (QAMAI) tool. Assessors were blinded to the model, and evaluations were performed twice, 3 weeks apart. Readability was measured using Flesch Reading Ease Score (FRES) and Flesch-Kincaid Grade Level (FKGL). This study adhered to TRIPOD-LLM. Statistical analysis included the Friedman test, the Wilcoxon signed-rank tests and inter-class coefficients. RESULTS: Inter-rater reliability between three surgeons was good or excellent reliability in all LLMs. ChatGPT OE and ChatGPT 4o demonstrated comparable overall performance, each achieving a median QAMAI score of 22 with interquartile ranges (IQRs) of 5.25 and 6.75, respectively, with median (IQR) domain scores for accuracy 4 (1) and 4 (1), clarity 4 (1) and 4 (1), relevance 4 (1) and 4 (1), completeness 4 (1) and 4 (1), provision of sources 1 (0) for both and usefulness 4 (1) and 4 (1), respectively. Google Gemini showed lower scores across these domains (accuracy 3 [1], clarity 3 [1], relevance 3 [1.25], completeness 3 [0.25], sources 3 [3] and usefulness 3 [1.25]), with a median QAMAI score of 19 (5.25) (p < 0.01 vs. each ChatGPT model). Readability was higher for Google Gemini (FRES = 36.96, FKGL = 11.92) than for ChatGPT OE (FRES = 21.90, FKGL = 14.94) and ChatGPT 4o (FRES = 24.24, FKGL = 15.11), indicating easier-to-read content (p < 0.01). There was no significant difference between ChatGPT 4o and OE in overall quality or readability. CONCLUSIONS: ChatGPT 4o and ChatGPT OE provided statistically higher-quality responses than Google Gemini, though all models showed good-quality responses overall. However, responses generated by ChatGPT 4o and OE were more difficult to read than those generated by Google Gemini. LEVEL OF EVIDENCE: Level V, expert opinion.

Authors

Skaik K; Omoseni S; Dagher D; Shah D; Fermín TM; Agostinone P; Hantouly A; Khan M

Journal

Knee Surgery Sports Traumatology Arthroscopy, Vol. 34, No. 2, pp. 763–775

Publisher

Wiley

Publication Date

February 1, 2026

DOI

10.1002/ksa.70255

ISSN

0942-2056

Contact the Experts team