Approximation Rates and VC-Dimension Bounds for...

Approximation Rates and VC-Dimension Bounds for (P)ReLU MLP Mixture of Experts

Abstract

Mixture-of-Experts (MoEs) can scale up beyond traditional deep learning models by employing a routing strategy in which each input is processed by a single “expert” deep learning model. This strategy allows us to scale up the number of parameters defining the MoE while maintaining sparse activation, i.e., MoEs only load a small number of their total parameters into GPU VRAM for the forward pass depending on the input. In this paper, we provide an approximation and learning-theoretic analysis of mixtures of expert MultiLayer Perceptrons (MLPs) with (P)ReLU activation functions. We first prove that for every error level ε > 0 and every Lipschitz function f: [0, 1]ⁿ → R, one can construct a MoMLP model (a Mixture-of-Experts comprising of (P)ReLU MLPs) which uniformly approximates f to ε accuracy over [0, 1]ⁿ, while only requiring networks of O(ε⁻¹) parameters to be loaded in memory. Additionally, we show that MoMLPs can generalize since the entire MoMLP model has a (finite) VC dimension of Õ(L max{nL, JW }), if there are L experts and each expert has a depth and width of J and W, respectively.

Authors

Kratsios A; Borde HSDO; Furuya T; Law MT

Journal

Transactions on Machine Learning Research, Vol. 2025, ,

Publication Date

January 1, 2025

Associated Experts

Anastasis Kratsios

Assistant Professor, Faculty of Science

Visit profile

Contact the Experts team

Get technical help

Provide website feedback