A Blinded Comparison of Three Generative Artificial Intelligence Chatbots for Orthopaedic Surgery Therapeutic Questions
Journal Articles
Overview
Research
Identity
Additional Document Info
View All
Overview
abstract
Objective To compare the quality of responses from three chatbots (ChatGPT, Bing Chat, and AskOE) across various orthopaedic surgery therapeutic treatment questions. Design We identified a series of treatment-related questions across a range of subspecialties in orthopaedic surgery. Questions were "identically" entered into one of three chatbots (ChatGPT, Bing Chat, and AskOE) and reviewed using a standardized rubric. Participants Orthopaedic surgery experts associated with McMaster University and the University of Toronto blindly reviewed all responses. Outcomes The primary outcomes were scores on a five-item assessment tool assessing clinical correctness, clinical completeness, safety, usefulness, and references. The secondary outcome was the reviewers' preferred response for each question. We performed a mixed effects logistic regression to identify factors associated with selecting a preferred chatbot. Results Across all questions and answers, AskOE was preferred by reviewers to a significantly greater extent than both ChatGPT (P<0.001) and Bing (P<0.001). AskOE also received significantly higher total evaluation scores than both ChatGPT (P<0.001) and Bing (P<0.001). Further regression analysis showed that clinical correctness, clinical completeness, usefulness, and references were significantly associated with a preference for AskOE. Across all responses, there were four considered as having major errors in response, with three occurring with ChatGPT and one occurring with AskOE. Conclusions Reviewers significantly preferred AskOE over ChatGPT and Bing Chat across a variety of variables in orthopaedic therapy questions. This technology has important implications in a healthcare setting as it provides access to trustworthy answers in orthopaedic surgery.