Fast Inference for Quantile Regression with Tens of Millions of Observations
Abstract
Big data analytics has opened new avenues in economic research, but the
challenge of analyzing datasets with tens of millions of observations is
substantial. Conventional econometric methods based on extreme estimators
require large amounts of computing resources and memory, which are often not
readily available. In this paper, we focus on linear quantile regression
applied to "ultra-large" datasets, such as U.S. decennial censuses. A fast
inference framework is presented, utilizing stochastic subgradient descent
(S-subGD) updates. The inference procedure handles cross-sectional data
sequentially: (i) updating the parameter estimate with each incoming "new
observation", (ii) aggregating it as a $\textit{Polyak-Ruppert}$ average, and
(iii) computing a pivotal statistic for inference using only a solution path.
The methodology draws from time-series regression to create an asymptotically
pivotal statistic through random scaling. Our proposed test statistic is
calculated in a fully online fashion and critical values are calculated without
resampling. We conduct extensive numerical studies to showcase the
computational merits of our proposed inference. For inference problems as large
as $(n, d) \sim (10^7, 10^3)$, where $n$ is the sample size and $d$ is the
number of regressors, our method generates new insights, surpassing current
inference methods in computation. Our method specifically reveals trends in the
gender gap in the U.S. college wage premium using millions of observations,
while controlling over $10^3$ covariates to mitigate confounding effects.