A Statistical Perspective on the Challenges in Molecular Microbial Biology
Abstract
High throughput sequencing (HTS)-based technology enables identifying and
quantifying non-culturable microbial organisms in all environments. Microbial
sequences have enhanced our understanding of the human microbiome, the soil and
plant environment, and the marine environment. All molecular microbial data
pose statistical challenges due to contamination sequences from reagents, batch
effects, unequal sampling, and undetected taxa. Technical biases and
heteroscedasticity have the strongest effects, but different strains across
subjects and environments also make direct differential abundance testing
unwieldy. We provide an introduction to a few statistical tools that can
overcome some of these difficulties and demonstrate those tools on an example.
We show how standard statistical methods, such as simple hierarchical mixture
and topic models, can facilitate inferences on latent microbial communities. We
also review some nonparametric Bayesian approaches that combine visualization
and uncertainty quantification. The intersection of molecular microbial biology
and statistics is an exciting new venue. Finally, we list some of the important
open problems that would benefit from more careful statistical method
development.