Home
Scholarly Works
Application of machine learning methods in the...
Journal article

Application of machine learning methods in the imputation of heterogeneous co-missing data

Abstract

ObjectiveOrdinary imputation methods may not be able to handle heterogeneous co-missing data, such as the lung function measures from the spirometry test in population-based studies. This work aims to review and evaluate various statistical and machine learning imputation methods for estimating the prevalence of impaired lung function, such as chronic obstructive pulmonary disease, using data from public surveys on aging studies.Materials and methodsWe examined 70 articles and identified different statistical and machine learning methods used in missing data imputation. We selected and applied samples from a pseudo-population dataset and compared their accuracy in estimating the sample lung disease prevalence.ResultsUnsupervised learning (clustering) methods improve multiple imputations. The k-prototype method outperforms DBSCAN as it can handle categorical data more effectively. Direct imputations based on the predicted values of random forests and artificial neural networks are unsatisfactory.ConclusionWhen combined with multiple imputations, the k-prototype clustering method appears to be the most suitable one for imputing missing spirometry values. Even if the imputation functions are not the same as those used in simulation, the k-prototype method can improve the estimates of the MI methods.

Authors

So HY; Ma J; Griffith LE; Balakrishnan N

Journal

Journal of the Japan Statistical Society, Vol. 8, No. 1, pp. 691–720

Publisher

Springer Nature

Publication Date

June 1, 2025

DOI

10.1007/s42081-025-00298-x

ISSN

1882-2754

Labels

Contact the Experts team