Evaluating the performance of five large language...

Evaluating the performance of five large language models in answering Delphi consensus questions relating to patellar instability and medial patellofemoral ligament reconstruction

Abstract

PurposeArtificial intelligence (AI) has become incredibly popular over the past several years, with large language models (LLMs) offering the possibility of revolutionizing the way healthcare information is shared with patients. However, to prevent the spread of misinformation, analyzing the accuracy of answers from these LLMs is essential. This study will aim to assess the accuracy of five freely accessible chatbots by specifically evaluating their responses to questions about patellofemoral instability (PFI). The secondary objective will be to compare the different chatbots, to distinguish which LLM offers the most accurate set of responses.MethodsTen questions were selected from a previously published international Delphi Consensus study pertaining to patellar instability, and posed to ChatGPT4o, Perplexity AI, Bing CoPilot, Claude2, and Google Gemini. Responses were assessed for accuracy using the validated Mika score by eight Orthopedic surgeons who have completed fellowship training in sports-medicine. Median responses amongst the eight reviewers for each question were compared using the Kruskal-Wallis and Dunn’s post-hoc tests. Percentages of each Mika score distribution were compared using Pearson’s chi-square test. P-values less than or equal to 0.05 were considered significant. The Gwet’s AC2 coefficient was calculated to assess for inter-rater agreement, corrected for chance and employing quadratic weights.ResultsChatGPT4o and Claude2 had the highest percentage of reviews (38/80, 47.5%) considered to be an “excellent response not requiring classification”, or a Mika score of 1. Google Gemini had the highest percentage of reviews (17/80, 21.3%) considered to be “unsatisfactory requiring substantial clarification”, or a Mika score of 4 (p < 0.001). The median ± interquartile range (IQR) Mika scores was 2 (1) for ChatGPT4o and Perplexity AI, 2 (2) for Bing CoPilot and Claude2, and 3 (2) for Google Gemini. Median responses were not significantly different between ChatGPT4o, Perplexity AI, Bing CoPilot, and Claude2, however all four statistically outperformed Google Gemini (p < 0.05). Inter-rater agreement was classified as moderate (0.40 > AC2 ≥ 0.60) for ChatGPT, Perplexity AI, Bing CoPilot, and Claude2, while there was no agreement for Google Gemini (AC2 < 0).ConclusionCurrent free access LLMs (ChatGPT4o, Perplexity AI, Bing CoPilot, and Claude2) predominantly provide satisfactory responses requiring minimal clarification to standardized questions relating to patellar instability. Google Gemini statistically underperformed in accuracy relative to the other four LLMs, with most answers requiring moderate clarification. Furthermore, inter-rater agreement was moderate for all LLMs apart from Google Gemini, which had no agreement. These findings advocate for the utility of existing LLMs in serving as an adjunct to physicians and surgeons in providing patients information pertaining to patellar instability.Level of evidence: V

Authors

Vivekanantha P; Cohen D; Slawaska-Eng D; Nagai K; Tarchala M; Matache B; Hiemstra L; Longstaffe R; Lesniak B; Meena A

Journal

BMC Musculoskeletal Disorders, Vol. 26, No. 1,

Publisher

Springer Nature

Publication Date

December 1, 2025

DOI

10.1186/s12891-025-09227-1

ISSN

1471-2474

Associated Experts

Darren de SA

Associate Professor, Faculty of Health Sciences

Visit profile

Labels