Six of one, half a dozen of the other: A measure of multidisciplinary inter/intra-rater reliability of the society for fetal urology and urinary tract dilation grading systems for hydronephrosis
Additional Document Info
INTRODUCTION: The urinary tract dilation (UTD) classification system was introduced to standardize terminology in the reporting of hydronephrosis (HN), and bridge a gap between pre- and postnatal classification such as the Society for Fetal Urology (SFU) grading system. Herein we compare the intra/inter-rater reliability of both grading systems. MATERIALS AND METHODS: SFU (I-IV) and UTD (I-III) grades were independently assigned by 13 raters (9 pediatric urology staff, 2 nephrologists, 2 radiologists), twice, 3 weeks apart, to 50 sagittal postnatal ultrasonographic views of hydronephrotic kidneys. Data regarding ureteral measurements and bladder abnormalities were included to allow proper UTD categorization. Ten images were repeated to assess intra-rater reliability. Krippendorff's alpha coefficient was used to measure overall and by grade intra/inter-rater reliability. Reliability between specialties and training levels were also analyzed. RESULTS: Overall inter-rater reliability was slightly higher for SFU (α = 0.842, 95% CI 0.812-0.879, in session 1; and α = 0.808, 95% CI 0.775-0.839, in session 2) than for UTD (α = 0.774, 95% CI 0.715-0.827, in session 1; and α = 0.679, 95% CI 0.605-0.750, in session 2). Reliability for intermediate grades (SFU II/III and UTD 2) of HN was poor regardless of the system. Reliabilities for SFU and UTD classifications among Urology, Nephrology, and Radiology, as well as between training levels were not significantly different. DISCUSSION: Despite the introduction of HN grading systems to standardize the interpretation and reporting of renal ultrasound in infants with HN, none have been proven superior in allowing clinicians to distinguish between "moderate" grades. While this study demonstrated high reliability in distinguishing between "mild" (SFU I/II and UTD 1) and "severe" (SFU IV and UTD 3) grades of HN, the overall reliability between specialties was poor. This is in keeping with a previous report of modest inter-rater reliability of the SFU system. This drawback is likely explained by the subjective interpretation required to assign grades, which can be impacted by experience, image quality, and scanning technique. As shown in the figure, which demonstrates SFU II (a) and SFU III (b), as assigned by a radiologist, it is possible to make an argument that either of these images can be classified into both categories that were observed during the grading sessions of this study. CONCLUSION: Although both systems have acceptable reliability, the SFU grading system showed higher overall intra/inter-rater reliability regardless of rater specialty than the UTD classification. Inter-rater reliability for SFU grades II/III and UTD 2 was low, highlighting the limitations of both classifications in regards to properly segregating moderate HN grades.