Home
Scholarly Works
Evaluating the Efficacy of Large Language Models...
Journal article

Evaluating the Efficacy of Large Language Models for Dizzy History Taking and Peripheral Vestibular Disorder Diagnosis

Abstract

ImportanceVertigo accounts for one of the most frequent presenting symptoms in primary care. However, complexities in differential diagnoses and reliance on clinical history contribute to frequent specialist referrals and diagnostic delays. Large language models (LLMs), like LLaMA-3.1-8B, offer new potential for assisting in clinical decision-making.ObjectiveTo assess the utility of a small-scale, open-source LLM in diagnosing peripheral vestibular disorders (PVDs), and evaluate the impact of synthetic data augmentation on diagnostic accuracy.DesignRetrospective chart review.Setting/ParticipantsA retrospective analysis included adult patients presenting with dizziness to a neuro-otologist at St. Joseph's Healthcare Hamilton between 2018 and 2023. The dataset comprised 100 clinical cases, supplemented with 40 synthetic cases generated using GPT-4. The LLaMA-3.1-8B model was evaluated on the clinical, synthetic, and combined datasets. Diagnostic reasoning approaches, including chain-of-thought reasoning and multi-shot prompting, were employed to optimize model performance.Main Outcome MeasuresMetrics for evaluation included top 1 and top 3 diagnostic accuracy, Cohen's kappa for inter-rater agreement, and accuracy in predicting symptom laterality.ResultsThe LLaMA-3.1-8B model achieved a top 1 diagnostic accuracy of 60.7% and a top 3 accuracy of 71.4% in the combined dataset. The most frequent diagnosis was Meniere's disease (55.7%), followed by vestibular migraines (9.3%) and labyrinthitis (9.3%). Diagnostic accuracy was highest for benign paroxysmal positional vertigo (90%), followed by Meniere's disease (80.8%). Less common conditions, such as superior canal dehiscence syndrome and vestibular paroxysmia, exhibited lower diagnostic accuracies. Cohen's kappa indicated substantial agreement for symptom side prediction (κ = 0.96) and moderate agreement for diagnosis (κ = 0.41) in the combined dataset.Conclusions and RelevanceThe LLaMA-3.1-8B model demonstrated promising accuracy in diagnosing PVDs. The model's performance highlights its potential to serve as a high-yield screening tool for primary care physicians and general otolaryngologists.

Authors

Lu B; Misariu A-M; Ham JI; Archibald J; van der Woerd B

Journal

Journal of Otolaryngology, Vol. 54, ,

Publisher

SAGE Publications

Publication Date

January 1, 2025

DOI

10.1177/19160216251377349

ISSN

1916-0208

Contact the Experts team