Annual updates of the European Association of...

Annual updates of the European Association of Urology – European Society for Pediatric Urology (EAU-ESPU) paediatric urology guidelines: Are large-language models (LLM) better than the usual structured methodology?

Abstract

INTRODUCTION: The European Association for Urology - European Society for Pediatric Urology (EAU-ESPU) guidelines comprise a comprehensive publication of evidence based clinical guidelines for the field of Pediatric urology. The goal is to produce recommendations to optimize patient care and provide an assessment of benefits and harms and possible alternative treatment options. Artificial intelligence (AI) has immensely evolved and is often used in urology. With the emergence of Chat Generative Pre-trained Transformer (ChatGPT) and CoPilot, a new dimension in AI was reached and more widespread use of AI became possible. ChatGPT and CoPilot are both large language models (LLMs). OBJECTIVES: The aim of the current study was to test the ability of LLMs to provide a trustworthy update of two of the chapters of the EAU-ESPU Pediatric Urology Guideline. STUDY DESIGN: Three LLM's (Chat-GPT 3.5, Chat-GPT 4.0 and CoPilot) were asked to perform a systematic update of the hydrocele and varicocele chapters. For both chapters two standard conversations were written; one humane dialogue and one conversation in which we included minor prompt engineering, i.e. few-shot prompting. All conversations were performed five times by an independent researcher and outcomes were scored for accuracy, consistency and reliability, using several predefined criteria by two reviewers. RESULTS: A total of sixty conversations were analyzed. All three LLMs were unable to update the guidelines with the recent relevant literature because of the lack of access to the correct scientific databases. Furthermore, a high variability was seen in the responses provided by the LLMs, although the input text was similar every time. The use of basic prompting in the structured conversations compared to the humane responses improved the consistency of the responses. The reproducibility, consistency, and reliability of the updates provided by the LLMs were assessed to be inadequate, despite the use of basic prompting. DISCUSSION: Development of AI and specific plug-ins for LLMs are developing at a very fast pace. A specific follow-up project would be to create specific plug-ins and advanced prompt engineering in cooperation with AI experts for existing LLMs to update the guidelines with access to the relevant databases and correct instructions to follow the handbook of the guidelines. CONCLUSION: At the moment LLMs cannot replace the panel members of the EAU Guidelines panel in their work to update the clinical guidelines. They have demonstrated inadequate consistency, reliability, accuracy, and are not able to incorporate new literature.

Authors

Hoen LAT; van Uitert A; Bussmann M; Bezuidenhout C; Ribal M; Canfield S; Yuan Y; Omar MI; Castagnetti M; Burgu B

Journal

Journal of Pediatric Urology, Vol. 21, No. 6, pp. 1643–1649

Publisher

Elsevier

Publication Date

December 1, 2025

DOI

10.1016/j.jpurol.2025.05.030

ISSN

1477-5131

Associated Experts

Yuhong Yuan

Assistant Professor (Part-Time), FHSMED

Visit profile

Labels