Measuring the severity of multi-collinearity in high dimensions
Abstract
Multi-collinearity is a wide-spread phenomenon in modern statistical
applications and when ignored, can negatively impact model selection and
statistical inference. Classic tools and measures that were developed for
"$n>p$" data are not applicable nor interpretable in the high-dimensional
regime. Here we propose 1) new individualized measures that can be used to
visualize patterns of multi-collinearity, and subsequently 2) global measures
to assess the overall burden of multi-collinearity without limiting the
observed data dimensions. We applied these measures to genomic applications to
investigate patterns of multi-collinearity in genetic variations across
individuals with diverse ancestral backgrounds. The measures were able to
visually distinguish genomic regions of excessive multi-collinearity and
contrast the level of multi-collinearity between different continental
populations.