Home
Scholarly Works
Augmenting small tabular health data for training...
Journal article

Augmenting small tabular health data for training prognostic ensemble machine learning models using generative models

Abstract

BackgroundSmall datasets are common in health research. However, the generalization performance of machine learning models is suboptimal when the training datasets are small. To address this, data augmentation is one solution and is often used for imaging and time series data, but there are no evaluations on its potential benefits for tabular health data. Augmentation increases sample size and is seen as a form of regularization that increases the diversity of small datasets, leading them to perform better on unseen data.ObjectivesEvaluate data augmentation using generative models on tabular health data and assess the impact of diversity versus increasing the sample size.MethodsUsing 13 large health datasets, we performed a simulation to evaluate the impact of data augmentation on the prediction performance (as measured by the ROC-AUC, the area under the receiver operating characteristic curve) on binary classification gradient boosted decision tree models. Four different synthetic data generation models were evaluated. We also built a generalized linear mixed effect model to assess the variable importance for model performance improvements from augmentation. We illustrate the proposed method on seven small real datasets as an application. A comparison of augmentation with resampling (which is a proxy for a larger dataset with minimal impact on diversity) was performed.ResultsAugmentation improves prognostic performance for datasets that have higher cardinality categorical variables and lower baseline ROC-AUC. No specific generative model consistently outperformed the others. For the seven small application datasets, augmenting the existing data results in an increase in ROC-AUC between 4.31% (ROC-AUC from 0.71 to 0.75) and 43.23% (ROC-AUC from 0.51 to 0.73), with an average 15.55% relative improvement, demonstrating the nontrivial impact of augmentation on small datasets (p = 0.0078). Augmentation ROC-AUC was higher than resampling only ROC-AUC (p = 0.016). The diversity of augmented datasets was higher than the diversity of resampled datasets (p = 0.046).ConclusionsThis study demonstrates that data augmentation using generative models can have a marked benefit in terms of improved predictive performance for machine learning models on tabular health data, but only for datasets that meet baseline data complexity and predictive performance criteria. Our mixed effect model identified the most influential characteristics of the dataset and can help end-users have a more realistic expectation of the augmentation performance for a new dataset. Furthermore, augmentation performed better when having a smaller dataset, which is consistent with the argument that greater data diversity due to augmentation is beneficial.Clinical trial registrationNot applicable.

Authors

Liu D; Kababji SE; Mitsakakis N; Pilgram L; Walters TD; Clemons M; Pond GR; El-Hussuna A; Emam KE

Journal

BMC Medical Informatics and Decision Making, Vol. 25, No. 1,

Publisher

Springer Nature

Publication Date

December 1, 2025

DOI

10.1186/s12911-025-03266-3

ISSN

1472-6947

Contact the Experts team