Exploring dimension learning via a penalized probabilistic principal component analysis
Abstract
Establishing a low-dimensional representation of the data leads to efficient
data learning strategies. In many cases, the reduced dimension needs to be
explicitly stated and estimated from the data. We explore the estimation of
dimension in finite samples as a constrained optimization problem, where the
estimated dimension is a maximizer of a penalized profile likelihood criterion
within the framework of a probabilistic principal components analysis. Unlike
other penalized maximization problems that require an "optimal" penalty tuning
parameter, we propose a data-averaging procedure whereby the estimated
dimension emerges as the most favourable choice over a range of plausible
penalty parameters. The proposed heuristic is compared to a large number of
alternative criteria in simulations and an application to gene expression data.
Extensive simulation studies reveal that none of the methods uniformly dominate
the other and highlight the importance of subject-specific knowledge in
choosing statistical methods for dimension learning. Our application results
also suggest that gene expression data have a higher intrinsic dimension than
previously thought. Overall, our proposed heuristic strikes a good balance and
is the method of choice when model assumptions deviated moderately.