Finding Outliers in Gaussian Model-Based Clustering
Journal Articles
Overview
Research
View All
Overview
abstract
Clustering, or unsupervised classification, is a task often plagued by
outliers. Yet there is a paucity of work on handling outliers in clustering.
Outlier identification algorithms tend to fall into three broad categories:
outlier inclusion, outlier trimming, and \textit{post hoc} outlier
identification methods, with the former two often requiring pre-specification
of the number of outliers. The fact that sample Mahalanobis distance is
beta-distributed is used to derive an approximate distribution for the
log-likelihoods of subset finite Gaussian mixture models. An algorithm is then
proposed that removes the least plausible points according to the subset
log-likelihoods, which are deemed outliers, until the subset log-likelihoods
adhere to the reference distribution. This results in a trimming method, called
OCLUST, that inherently estimates the number of outliers.