Home
Scholarly Works
Mind the Evaluation Gap: Large Language Models for...
Chapter

Mind the Evaluation Gap: Large Language Models for Structured Data Extraction from Radiology Reports

Abstract

Extracting structured labels from free-text radiology reports is essential for building large-scale datasets used to train and evaluate models in medical AI. However, this process is costly, typically requiring expert annotators. Prior efforts often rely on noisy rule-based NLP or use LLMs without directly evaluating their structured data extraction capabilities. In this work, we address this evaluation gap by extensively benchmarking open-weight general and medical LLMs for extracting structured labels from chest X-ray (CXR) reports using a radiologist-verified dataset. We consider two tasks: (1) Disease-Only label extraction, and (2) Location+Disease, where each disease is paired with its anatomical region. We compare fine-tuning and in-context learning (ICL) across models ranging from 0.5B to 72B parameters. Our results show that a fine-tuned 7B model matches the performance of a 72B model in ICL mode. We advocate for rigorous and task-specific evaluation of LLMs in medical AI and highlight open-weight models as privacy-preserving, cost-effective, and clinically deployable solutions.

Authors

Sabour A; Chu K; Dehaghani ME; Moradi M

Book title

Emerging LLM/LMM Applications in Medical Imaging

Series

Lecture Notes in Computer Science

Volume

16146

Pagination

pp. 19-27

Publisher

Springer Nature

Publication Date

January 1, 2026

DOI

10.1007/978-3-032-07502-4_3
View published work (Non-McMaster Users)

Contact the Experts team