Finding Outliers in Gaussian Model-Based Clustering
Abstract
Clustering, or unsupervised classification, is a task often plagued by
outliers. Yet there is a paucity of work on handling outliers in clustering.
Outlier identification algorithms tend to fall into three broad categories:
outlier inclusion, outlier trimming, and post hoc outlier identification
methods, with the former two often requiring pre-specification of the number of
outliers. The fact that sample squared Mahalanobis distance is beta-distributed
is used to derive an approximate distribution for the log-likelihoods of subset
finite Gaussian mixture models. An algorithm is then proposed that removes the
least plausible points according to the subset log-likelihoods, which are
deemed outliers, until the subset log-likelihoods adhere to the reference
distribution. This results in a trimming method, called OCLUST, that inherently
estimates the number of outliers.