We introduce a deep learning model that can universally approximate regular
conditional distributions (RCDs). The proposed model operates in three phases:
first, it linearizes inputs from a given metric space $\mathcal{X}$ to
$\mathbb{R}^d$ via a feature map, then a deep feedforward neural network
processes these linearized features, and then the network's outputs are then
transformed to the $1$-Wasserstein space $\mathcal{P}_1(\mathbb{R}^D)$ via a
probabilistic extension of the attention mechanism of Bahdanau et al.\ (2014).
Our model, called the \textit{probabilistic transformer (PT)}, can approximate
any continuous function from $\mathbb{R}^d $ to $\mathcal{P}_1(\mathbb{R}^D)$
uniformly on compact sets, quantitatively. We identify two ways in which the PT
avoids the curse of dimensionality when approximating
$\mathcal{P}_1(\mathbb{R}^D)$-valued functions. The first strategy builds
functions in $C(\mathbb{R}^d,\mathcal{P}_1(\mathbb{R}^D))$ which can be
efficiently approximated by a PT, uniformly on any given compact subset of
$\mathbb{R}^d$. In the second approach, given any function $f$ in
$C(\mathbb{R}^d,\mathcal{P}_1(\mathbb{R}^D))$, we build compact subsets of
$\mathbb{R}^d$ whereon $f$ can be efficiently approximated by a PT.