Robust estimation of microbial diversity in theory and in practice
Abstract
Quantifying diversity is of central importance for the study of structure,
function and evolution of microbial communities. The estimation of microbial
diversity has received renewed attention with the advent of large-scale
metagenomic studies. Here, we consider what the diversity observed in a sample
tells us about the diversity of the community being sampled. First, we argue
that one cannot reliably estimate the absolute and relative number of microbial
species present in a community without making unsupported assumptions about
species abundance distributions. The reason for this is that sample data do not
contain information about the number of rare species in the tail of species
abundance distributions. We illustrate the difficulty in comparing species
richness estimates by applying Chao's estimator of species richness to a set of
in silico communities: they are ranked incorrectly in the presence of large
numbers of rare species. Next, we extend our analysis to a general family of
diversity metrics ("Hill diversities"), and construct lower and upper estimates
of diversity values consistent with the sample data. The theory generalizes
Chao's estimator, which we retrieve as the lower estimate of species richness.
We show that Shannon and Simpson diversity can be robustly estimated for the in
silico communities. We analyze nine metagenomic data sets from a wide range of
environments, and show that our findings are relevant for empirically-sampled
communities. Hence, we recommend the use of Shannon and Simpson diversity
rather than species richness in efforts to quantify and compare microbial
diversity.