Model checks for complex hierarchical models Alex Lewin and Sylvia Richardson Imperial College

Model checks for complex
hierarchical models
Alex Lewin and Sylvia Richardson
Imperial College
Centre for Biostatistics
Background and Aims

Many complex models used in bioinformatics

Classification/clustering can be greatly affected by
choice of distributions

Our approach: exploit the structure of the model to
perform predictive checks

hierarchical models generally involve
exchangeability assumptions

mixture models are partially exchangeable
Outline of Talk

Mixture model for gene expression data

Model checks for mixture model


distribution for gene-specific variances

different mixture priors
Future work: model checks for a clustering and
variable selection model (Tadesse et al. 2005)
Hierarchical mixture model for
gene expression data
ηj
μ,τ
wj
g
σg
Sg
ybarg
w ~ Dirichlet(1,…,1),
various priors for δg, g
δg | η ~ Σwjhj(ηj),
f(μ,τ)
ygr | δg, g  N(δg, g2)
Data: paired log
differences between 2
conditions
g = gene
r = replicate
j = mixture component
g2 | μ,τ 
differential effect for
gene g
variance for
each gene
Mixture model for gene
expression data

Many mixture models have been
proposed for gene expression data

Set-up is similar to variable selection
prior: point mass + alternative distribution

Particular choices for alternative:

Normal (Lönnstedt and Speed)

Uniform (Parmigiani et al)

many others …
Mixture model for gene
expression data
Allow for asymmetry in over-and under-expressed
genes  3-component mixture model
δg | η ~ w1h1(η1) + w2h2(η2) + w3h3(η3)
6 knock-out and 5 wildtype mice
MAS5.0 processed data
Mixture model for gene
expression data
Classify each gene into mixture
components using posterior probabilities
Choice of mixture prior affects
classification results
Mixture Prior for δg
Est. w2 (% in null)
w1Unif(-η-,0) + w2δ(0) + w3Unif(0,η+)
0.96
w1Gam-(1.5,η-) + w2 δ(0) + w3Gam+(1.5,η+)
0.68
w1Gam-(1.5,η-) + w2N(0,ε) + w3Gam+(1.5,η+) 0.99
Outline of Talk

Mixture model for gene expression data

Models checks for mixture model


distribution for gene-specific variances

different mixture priors
Future work: model checks for a clustering and
variable selection model (Tadesse et al. 2005)
Predictive model checks

Predict new data from the model

Use posterior predictive distribution

Condition on hyperparameters (‘mixed predictive’ *
 not very conservative)

Get Bayesian p-value for each gene/marker/sample

Use all p-values together (100’s or 1000’s) to assess
model fit
* Gelman, Meng and Stern 1995;
Marshall and Spiegelhalter 2003
Checking distribution for gene
variances
Bayesian p-value for gene g:
pg = Prob(
Smpred
posterior
Smpred
> Sg
obs
Sgobs
All genes are
exchangeable
 histogram of p-values
for all genes together
μ,τ
| data )
g
ybarg
σg
Sgobs
post.
pred.
Sgppred
σpre
d
mixed
pred.
Smpred
‘Mixed’ v. ‘posterior’ predictive

Predictive p-values for data simulated from the model

Histograms should be Uniform

Mixed predictive distribution much less conservative
than posterior predictive
Using gene-specific distributions
Using global distribution
Checking different variance models
g2 = 2 for all genes
Model differential
expression between
3 transgenic and 3
wildtype mice
g2 | μ,τ 
Gam(μ,τ)
g2 | μ,τ 
Gam(μ,τ), μ fixed
g2 | μ,τ  logNorm(μ,τ)
Implementation (MCMC)
niter = no. MCMC iterations
m = (no. replicates – 1)/2
pg = 0
for t = 1,…,niter {
σtpred  f(μt,τt)
Stmpred  Gam( m, m(σtpred)-2 )
pg  pg + I[
Stmpred
>
Sgobs
μ,τ
]
}
pg  pg / niter
Just two extra parameters predicted
at each iteration
g
ybarg
σg
Sgobs
σpre
d
mixed
pred.
Smpred
Outline of Talk

Mixture model for gene expression data

Model checks for mixture model


distribution for gene-specific variances

different mixture priors
Future work: model checks for a clustering and
variable selection model (Tadesse et al. 2005)
Checking mixture prior
δg | η ~ w1h1(η1) + w2h2(η2) + w3h3(η3)
OR
δg | η, zg = j ~ hj(ηj)
j = 1,…,3
P(zg = j) = wj
Model checking: focus on separate
mixture components
Issues for mixture model checking
δg | η, zg = j ~ hj(ηj)
j = 1,…,3
Think about MCMC iterations …

Mixture component is estimated from genes currently
assigned to that component

Can only define p-value for given gene and mix.
component when the gene is assigned to that component
(i.e. condition on zg in p-value)

So check each component using only the genes currently
assigned (i.e. condition on zg in histogram)
Predictive checks for mixture model
Bayesian p-value for gene g and mix. component j:
pgj = Prob( ybargjmpred > ybargobs | data, zg=j )
μ,τ
Genes assigned to the same mix.
component are exchangeable
ηj
 histogram of p-values for each
mix. component separately
g
σg
jpred
 histogram for component j
made only from genes with large
P(zg = j)
ybarg
Sg
ybargjmpre
wj
d
Condition on classification to check
separate components
Predictive p-values for data simulated from the model
All genes with
P(zg = j) > 0
Only genes with
P(zg = j) > 0.5
Effectively we condition on a best classification
Checking different mixture
distributions
w1Unif(-η-,0) + w2δ(0) + w3Unif(0,η+)


Outer mix.
components
skewed too
much away from
zero
Null component
too narrow
Checking different mixture
distributions
w1Gam-(1.5,η-) + w2 δ(0) + w3Gam+(1.5,η+)


Outer
components
skewed opposite
Null still too
narrow?
Checking different mixture
distributions
w1Gam-(1.5,η-) + w2N(0,ε) + w3Gam+(1.5,η+)

Better fit for all
components
wj
ηj
μ,τ
Implementation
g
pgj = 0
ybarg
for t = 1,…,niter {
 δjtpred ~ hjt(ηjt)
j = 1,…,3
 ybargtmpred  N( δjtpred , g2/nrep )
for j = zgt

σg
jpred
Sg
ybargjmpre
d
pgj  pgj + I[ ybargtmpred > ybargobs ]
for j = zgt
}
pgj  pgj / niter(zg=j)
Need ≈ngenes extra parameters at
each iteration
Summary of model checking
procedure
1.
Find part of model where individuals are assumed to be
exchangeable (so information is shared)
2.
Choose test statistic T (eg. sample mean or variance)
3.
Predict Tpred from distribution for exchangeable
individuals (whole posterior for Tpred)
4.
Compare observed Ti for each individual i to distribution
of Tpred
5.
For checking mixture components, condition on the
best classification
Outline of Talk

Mixture model for gene expression data

Model checks for mixture model


distribution for gene-specific variances

different mixture priors
Future work: model checks for a clustering and
variable selection model (Tadesse et al. 2005)
Clustering and variable
selection (Tadesse et al. 2005)

yi vector of gene expression for each sample i = 1,…,n

Multi-variate mixture model for clustering samples:


yi | zi = j  MVN(ζj, Λj)

P(zi = j) = wj

No. of mix. components (J) is estimated in the model
j = 1,…,J
Aim to select genes which are informative for clustering
the samples
Clustering and variable
selection (Tadesse et al. 2005)
Likelihood conditional on
allocation to mixture:
γ’ = vector of indices of
variables not used to
cluster samples
1 n (  ')
Likelihood | z ~ exp(   ( yi   ( ') )T  (1') ( yi( ')   ( ') ))
2 i 1
1
  exp(   ( yi( )   ( ) )T (1) ( yi( )   ( ) ))
2 iC j
j
Conjugate priors on multivariate
means and covariance matrices
P(γg = 1) = φ
γ = vector of indices of
selected variables
i = sample
g = gene
j = mix. component
Clustering and variable
selection (Tadesse et al. 2005)
J
μj(γ) , Σj(γ)
wj
φ
η(γ), Ω(γ)
yi
y(γ)jpred
Model checking: want to check the distribution for each
mixture component separately (conditional on J)
In addition, need to condition on a given variable
selection
Clearly impossible computationally
i = sample
g = gene
j = mix. component
Computing predictive p-values
1)
Run model with no prediction
2)
Find the best configuration:
3)

set of selected variables (γ)

no. mixture components J

allocation of samples to mixture components zi
Re-run model, with (γ), J and zi fixed, calculated
predictive p-values
pij = Prob( Tjpred > Tiobs | data, zi=j, J, (γ) )
where T = |y|2 (for example)
Conclusions

Choice of model distributions can greatly
influence results of clustering and classification

For models where information is shared across
individuals, predictive checks can be used as an
alternative to cross-validation

Should be possible to do this even for quite
complex models (if you can fit the model, you
can check it)
Acknowledgements
Collaborators on BBSRC Exploiting Genomics Grant
Natalia Bochkina, Clare Marshall
Peter Green
Meeting on model checking in Cambridge
David Spiegelhalter
Shaun Seaman
BBSRC Exploiting Genomics Grant
Paper and software at http://www.bgx.org.uk/