Model checks for complex
hierarchical models
Alex Lewin and Sylvia Richardson
Imperial College
Centre for Biostatistics
Background and Aims
Many complex models used in bioinformatics
Classification/clustering can be greatly affected by
choice of distributions
Our approach: exploit the structure of the model to
perform predictive checks
hierarchical models generally involve
exchangeability assumptions
mixture models are partially exchangeable
Outline of Talk
Mixture model for gene expression data
Model checks for mixture model
distribution for gene-specific variances
different mixture priors
Future work: model checks for a clustering and
variable selection model (Tadesse et al. 2005)
Hierarchical mixture model for
gene expression data
ηj
μ,τ
wj
g
σg
Sg
ybarg
w ~ Dirichlet(1,…,1),
various priors for δg, g
δg | η ~ Σwjhj(ηj),
f(μ,τ)
ygr | δg, g N(δg, g2)
Data: paired log
differences between 2
conditions
g = gene
r = replicate
j = mixture component
g2 | μ,τ
differential effect for
gene g
variance for
each gene
Mixture model for gene
expression data
Many mixture models have been
proposed for gene expression data
Set-up is similar to variable selection
prior: point mass + alternative distribution
Particular choices for alternative:
Normal (Lönnstedt and Speed)
Uniform (Parmigiani et al)
many others …
Mixture model for gene
expression data
Allow for asymmetry in over-and under-expressed
genes 3-component mixture model
δg | η ~ w1h1(η1) + w2h2(η2) + w3h3(η3)
6 knock-out and 5 wildtype mice
MAS5.0 processed data
Mixture model for gene
expression data
Classify each gene into mixture
components using posterior probabilities
Choice of mixture prior affects
classification results
Mixture Prior for δg
Est. w2 (% in null)
w1Unif(-η-,0) + w2δ(0) + w3Unif(0,η+)
0.96
w1Gam-(1.5,η-) + w2 δ(0) + w3Gam+(1.5,η+)
0.68
w1Gam-(1.5,η-) + w2N(0,ε) + w3Gam+(1.5,η+) 0.99
Outline of Talk
Mixture model for gene expression data
Models checks for mixture model
distribution for gene-specific variances
different mixture priors
Future work: model checks for a clustering and
variable selection model (Tadesse et al. 2005)
Predictive model checks
Predict new data from the model
Use posterior predictive distribution
Condition on hyperparameters (‘mixed predictive’ *
not very conservative)
Get Bayesian p-value for each gene/marker/sample
Use all p-values together (100’s or 1000’s) to assess
model fit
* Gelman, Meng and Stern 1995;
Marshall and Spiegelhalter 2003
Checking distribution for gene
variances
Bayesian p-value for gene g:
pg = Prob(
Smpred
posterior
Smpred
> Sg
obs
Sgobs
All genes are
exchangeable
histogram of p-values
for all genes together
μ,τ
| data )
g
ybarg
σg
Sgobs
post.
pred.
Sgppred
σpre
d
mixed
pred.
Smpred
‘Mixed’ v. ‘posterior’ predictive
Predictive p-values for data simulated from the model
Histograms should be Uniform
Mixed predictive distribution much less conservative
than posterior predictive
Using gene-specific distributions
Using global distribution
Checking different variance models
g2 = 2 for all genes
Model differential
expression between
3 transgenic and 3
wildtype mice
g2 | μ,τ
Gam(μ,τ)
g2 | μ,τ
Gam(μ,τ), μ fixed
g2 | μ,τ logNorm(μ,τ)
Implementation (MCMC)
niter = no. MCMC iterations
m = (no. replicates – 1)/2
pg = 0
for t = 1,…,niter {
σtpred f(μt,τt)
Stmpred Gam( m, m(σtpred)-2 )
pg pg + I[
Stmpred
>
Sgobs
μ,τ
]
}
pg pg / niter
Just two extra parameters predicted
at each iteration
g
ybarg
σg
Sgobs
σpre
d
mixed
pred.
Smpred
Outline of Talk
Mixture model for gene expression data
Model checks for mixture model
distribution for gene-specific variances
different mixture priors
Future work: model checks for a clustering and
variable selection model (Tadesse et al. 2005)
Checking mixture prior
δg | η ~ w1h1(η1) + w2h2(η2) + w3h3(η3)
OR
δg | η, zg = j ~ hj(ηj)
j = 1,…,3
P(zg = j) = wj
Model checking: focus on separate
mixture components
Issues for mixture model checking
δg | η, zg = j ~ hj(ηj)
j = 1,…,3
Think about MCMC iterations …
Mixture component is estimated from genes currently
assigned to that component
Can only define p-value for given gene and mix.
component when the gene is assigned to that component
(i.e. condition on zg in p-value)
So check each component using only the genes currently
assigned (i.e. condition on zg in histogram)
Predictive checks for mixture model
Bayesian p-value for gene g and mix. component j:
pgj = Prob( ybargjmpred > ybargobs | data, zg=j )
μ,τ
Genes assigned to the same mix.
component are exchangeable
ηj
histogram of p-values for each
mix. component separately
g
σg
jpred
histogram for component j
made only from genes with large
P(zg = j)
ybarg
Sg
ybargjmpre
wj
d
Condition on classification to check
separate components
Predictive p-values for data simulated from the model
All genes with
P(zg = j) > 0
Only genes with
P(zg = j) > 0.5
Effectively we condition on a best classification
Checking different mixture
distributions
w1Unif(-η-,0) + w2δ(0) + w3Unif(0,η+)
Outer mix.
components
skewed too
much away from
zero
Null component
too narrow
Checking different mixture
distributions
w1Gam-(1.5,η-) + w2 δ(0) + w3Gam+(1.5,η+)
Outer
components
skewed opposite
Null still too
narrow?
Checking different mixture
distributions
w1Gam-(1.5,η-) + w2N(0,ε) + w3Gam+(1.5,η+)
Better fit for all
components
wj
ηj
μ,τ
Implementation
g
pgj = 0
ybarg
for t = 1,…,niter {
δjtpred ~ hjt(ηjt)
j = 1,…,3
ybargtmpred N( δjtpred , g2/nrep )
for j = zgt
σg
jpred
Sg
ybargjmpre
d
pgj pgj + I[ ybargtmpred > ybargobs ]
for j = zgt
}
pgj pgj / niter(zg=j)
Need ≈ngenes extra parameters at
each iteration
Summary of model checking
procedure
1.
Find part of model where individuals are assumed to be
exchangeable (so information is shared)
2.
Choose test statistic T (eg. sample mean or variance)
3.
Predict Tpred from distribution for exchangeable
individuals (whole posterior for Tpred)
4.
Compare observed Ti for each individual i to distribution
of Tpred
5.
For checking mixture components, condition on the
best classification
Outline of Talk
Mixture model for gene expression data
Model checks for mixture model
distribution for gene-specific variances
different mixture priors
Future work: model checks for a clustering and
variable selection model (Tadesse et al. 2005)
Clustering and variable
selection (Tadesse et al. 2005)
yi vector of gene expression for each sample i = 1,…,n
Multi-variate mixture model for clustering samples:
yi | zi = j MVN(ζj, Λj)
P(zi = j) = wj
No. of mix. components (J) is estimated in the model
j = 1,…,J
Aim to select genes which are informative for clustering
the samples
Clustering and variable
selection (Tadesse et al. 2005)
Likelihood conditional on
allocation to mixture:
γ’ = vector of indices of
variables not used to
cluster samples
1 n ( ')
Likelihood | z ~ exp( ( yi ( ') )T (1') ( yi( ') ( ') ))
2 i 1
1
exp( ( yi( ) ( ) )T (1) ( yi( ) ( ) ))
2 iC j
j
Conjugate priors on multivariate
means and covariance matrices
P(γg = 1) = φ
γ = vector of indices of
selected variables
i = sample
g = gene
j = mix. component
Clustering and variable
selection (Tadesse et al. 2005)
J
μj(γ) , Σj(γ)
wj
φ
η(γ), Ω(γ)
yi
y(γ)jpred
Model checking: want to check the distribution for each
mixture component separately (conditional on J)
In addition, need to condition on a given variable
selection
Clearly impossible computationally
i = sample
g = gene
j = mix. component
Computing predictive p-values
1)
Run model with no prediction
2)
Find the best configuration:
3)
set of selected variables (γ)
no. mixture components J
allocation of samples to mixture components zi
Re-run model, with (γ), J and zi fixed, calculated
predictive p-values
pij = Prob( Tjpred > Tiobs | data, zi=j, J, (γ) )
where T = |y|2 (for example)
Conclusions
Choice of model distributions can greatly
influence results of clustering and classification
For models where information is shared across
individuals, predictive checks can be used as an
alternative to cross-validation
Should be possible to do this even for quite
complex models (if you can fit the model, you
can check it)
Acknowledgements
Collaborators on BBSRC Exploiting Genomics Grant
Natalia Bochkina, Clare Marshall
Peter Green
Meeting on model checking in Cambridge
David Spiegelhalter
Shaun Seaman
BBSRC Exploiting Genomics Grant
Paper and software at http://www.bgx.org.uk/
© Copyright 2025