Home
Scholarly Works
From Radiology Findings to Artificial Intelligence...
Journal article

From Radiology Findings to Artificial Intelligence (AI) Powered Impressions: A Retrospective Study on the Comparative Performance of Recent Large Language Models

Abstract

Background Large language models (LLMs), a revolutionary breakthrough in Artificial Intelligence, can be leveraged to automatically generate the impressions for radiology reports, which usually requires time, effort, and training. Our objective was to evaluate the performance of five recent LLMs (GPT-4, GPT-4o mini, Gemini 1.5 – Pro, Gemini 1.5 – Flash, and Llama 3.1) for impression generation. Methods In this retrospective study, 100 radiology reports were sampled (20 from each of the report-groups 0-400, 400-800, 800-1,200, 1,200-2,000, and 2,000–8,000 based on character count of findings section) from the publicly available “BioNLP 2023 report summarization” dataset (collected between 2001-2016, training subset of size 59,320 considered for sampling), sourced from PhysioNet. Then each of the five LLMs was zero-shot prompted to generate impressions using the findings from the sample. Generated impressions were evaluated: (a) subjectively for coherence, comprehensiveness, conciseness, and medical harmfulness by two radiology fellows and a large reasoning model (LRM) Gemini 2.5 – Pro, and (b) objectively using a composite accuracy metric (ROUGE-1, BLEU, and Cosine Similarity) against the original human expert-generated impressions. The LLMs were ranked according to the percentage agreement ranking of subjective scores and composite scores. Statistical tests (Friedman test and post-hoc Nemenyi test) were used to assess inter-model differences. Results The top-ranked models were Gemini 1.5 – Pro, GPT-4, and Gemini 1.5 – Flash. Performance varied across models for both human and LRM raters (Friedman test: Human P < 1.82 × 10⁻⁶; LRM P < 9.10 × 10⁻⁴⁰). Composite accuracy scores were significantly higher for the top three models (0.69, 0.68, 0.68) versus others (0.65; Nemenyi P < 1.11 × 10⁻¹⁶). The LRM aligned closely with human raters (2.15% complete disagreement) and identified all human-rated inaccurate impressions. Conclusions Gemini 1.5 – Pro outperformed GPT-4, in terms of coherence, comprehensiveness, and medical harmfulness, at lower cost. Human and LRM evaluations were generally consistent, though the LRM was more conservative.

Authors

Tasneem N; van der Pol CB; Zahoor A; Juggath N; McGowan K; Lokker C; Saha A

Journal

Intelligent Medicine, , ,

Publisher

Elsevier

Publication Date

December 1, 2025

DOI

10.1016/j.imed.2025.11.003

ISSN

2096-9376

Contact the Experts team