Home
Scholarly Works
Clustering gene expression time course data using...
Journal article

Clustering gene expression time course data using mixtures of multivariate t-distributions

Abstract

Clustering gene expression time course data is an important problem in bioinformatics because understanding which genes behave similarly can lead to the discovery of important biological information. Statistically, the problem of clustering time course data is a special case of the more general problem of clustering longitudinal data. In this paper, a very general and flexible model-based technique is used to cluster longitudinal data. Mixtures of multivariate t-distributions are utilized, with a linear model for the mean and a modified Cholesky-decomposed covariance structure. Constraints are placed upon the covariance structure, leading to a novel family of mixture models, including parsimonious models. In addition to model-based clustering, these models are also used for model-based classification, i.e., semi-supervised clustering. Parameters, including the component degrees of freedom, are estimated using an expectation-maximization algorithm and two different approaches to model selection are considered. The models are applied to simulated data to illustrate their efficacy; this includes a comparison with their Gaussian analogues—the use of these Gaussian analogues with a linear model for the mean is novel in itself. Our family of multivariate t mixture models is then applied to two real gene expression time course data sets and the results are discussed. We conclude with a summary, suggestions for future work, and a discussion about constraining the degrees of freedom parameter.

Authors

McNicholas PD; Subedi S

Journal

Journal of Statistical Planning and Inference, Vol. 142, No. 5, pp. 1114–1127

Publisher

Elsevier

Publication Date

May 1, 2012

DOI

10.1016/j.jspi.2011.11.026

ISSN

0378-3758

Labels

Contact the Experts team