Approximation Rates and VC-Dimension Bounds for (P)ReLU MLP Mixture of Experts
Abstract
Mixture-of-Experts (MoEs) can scale up beyond traditional deep learning
models by employing a routing strategy in which each input is processed by a
single "expert" deep learning model. This strategy allows us to scale up the
number of parameters defining the MoE while maintaining sparse activation,
i.e., MoEs only load a small number of their total parameters into GPU VRAM for
the forward pass depending on the input. In this paper, we provide an
approximation and learning-theoretic analysis of mixtures of expert MLPs with
(P)ReLU activation functions. We first prove that for every error level
$\varepsilon>0$ and every Lipschitz function $f:[0,1]^n\to \mathbb{R}$, one can
construct a MoMLP model (a Mixture-of-Experts comprising of (P)ReLU MLPs) which
uniformly approximates $f$ to $\varepsilon$ accuracy over $[0,1]^n$, while only
requiring networks of $\mathcal{O}(\varepsilon^{-1})$ parameters to be loaded
in memory. Additionally, we show that MoMLPs can generalize since the entire
MoMLP model has a (finite) VC dimension of $\tilde{O}(L\max\{nL,JW\})$, if
there are $L$ experts and each expert has a depth and width of $J$ and $W$,
respectively.