Using a large language model as a third reviewer...

Using a large language model as a third reviewer to augment dual human full‐text screening in orthopaedic systematic reviews

Abstract

Abstract Purpose Large language models (LLMs) are a form of artificial intelligence (AI) that have emerged as potential tools to augment systematic review workflows. This study aimed to evaluate GPT‐5 as a third reviewer for full‐text screening across orthopaedic subspecialties. Methods Three review topics were selected. Python scripts were developed to call on the GPT‐5 model via the OpenAI application programming interface (API) to perform full‐text screening using predefined inclusion and exclusion criteria. Two human reviewers simultaneously performed screening based on the same criteria. Performance metrics such as specificity, sensitivity, accuracy, positive predictive value (PPV), negative predictive values (NPV), and F1 scores for GPT‐5 were calculated based on a gold‐standard inclusion and exclusion list developed by a third human adjudicator. Efficiency metrics included total cost and completion time. Results The number of full‐texts screened were 35, 70 and 146 amongst the three review topics. For topic one, sensitivity, specificity, PPV, NPV, accuracy and F1 scores were 100% each. For topic two, sensitivity, specificity, PPV, NPV, accuracy and F1 scores were 93.3%, 98.2%, 93.3%, 98.2%, 97.1% and 93.3% respectively. For topic three, sensitivity, specificity, PPV, NPV, accuracy and F1 scores were 93.3%, 100%, 100%, 99.2%, 99.3% and 96.7%, respectively. Time to completion ranged between 18.1 and 58 min. Cost ranged from $0.84 to $3.29 USD. Conclusion GPT‐5 demonstrated high diagnostic accuracy as a third reviewer for full‐text screening across three different subspecialties, with high agreement with final consensus adjudication decisions. These findings suggest that modern LLMs can potentially augment dual‐review screening workflows by providing efficient decision‐support while preserving methodological rigour. However, the small number of included studies within each topic resulted in wide confidence intervals, and additional validation across larger datasets are necessary. Level of Evidence Not applicable.