Model checks for complex hierarchical models Alex Lewin and Sylvia Richardson Imperial College Centre for Biostatistics Background and Aims Many complex models used in bioinformatics Classification/clustering can be greatly affected by choice of distributions Our approach: exploit the structure of the model to perform predictive checks hierarchical models generally involve exchangeability assumptions mixture models are partially exchangeable Outline of Talk Mixture model for gene expression data Model checks for mixture model distribution for gene-specific variances different mixture priors Future work: model checks for a clustering and variable selection model (Tadesse et al. 2005) Hierarchical mixture model for gene expression data ηj μ,τ wj g σg Sg ybarg w ~ Dirichlet(1,…,1), various priors for δg, g δg | η ~ Σwjhj(ηj), f(μ,τ) ygr | δg, g N(δg, g2) Data: paired log differences between 2 conditions g = gene r = replicate j = mixture component g2 | μ,τ differential effect for gene g variance for each gene Mixture model for gene expression data Many mixture models have been proposed for gene expression data Set-up is similar to variable selection prior: point mass + alternative distribution Particular choices for alternative: Normal (Lönnstedt and Speed) Uniform (Parmigiani et al) many others … Mixture model for gene expression data Allow for asymmetry in over-and under-expressed genes 3-component mixture model δg | η ~ w1h1(η1) + w2h2(η2) + w3h3(η3) 6 knock-out and 5 wildtype mice MAS5.0 processed data Mixture model for gene expression data Classify each gene into mixture components using posterior probabilities Choice of mixture prior affects classification results Mixture Prior for δg Est. w2 (% in null) w1Unif(-η-,0) + w2δ(0) + w3Unif(0,η+) 0.96 w1Gam-(1.5,η-) + w2 δ(0) + w3Gam+(1.5,η+) 0.68 w1Gam-(1.5,η-) + w2N(0,ε) + w3Gam+(1.5,η+) 0.99 Outline of Talk Mixture model for gene expression data Models checks for mixture model distribution for gene-specific variances different mixture priors Future work: model checks for a clustering and variable selection model (Tadesse et al. 2005) Predictive model checks Predict new data from the model Use posterior predictive distribution Condition on hyperparameters (‘mixed predictive’ * not very conservative) Get Bayesian p-value for each gene/marker/sample Use all p-values together (100’s or 1000’s) to assess model fit * Gelman, Meng and Stern 1995; Marshall and Spiegelhalter 2003 Checking distribution for gene variances Bayesian p-value for gene g: pg = Prob( Smpred posterior Smpred > Sg obs Sgobs All genes are exchangeable histogram of p-values for all genes together μ,τ | data ) g ybarg σg Sgobs post. pred. Sgppred σpre d mixed pred. Smpred ‘Mixed’ v. ‘posterior’ predictive Predictive p-values for data simulated from the model Histograms should be Uniform Mixed predictive distribution much less conservative than posterior predictive Using gene-specific distributions Using global distribution Checking different variance models g2 = 2 for all genes Model differential expression between 3 transgenic and 3 wildtype mice g2 | μ,τ Gam(μ,τ) g2 | μ,τ Gam(μ,τ), μ fixed g2 | μ,τ logNorm(μ,τ) Implementation (MCMC) niter = no. MCMC iterations m = (no. replicates – 1)/2 pg = 0 for t = 1,…,niter { σtpred f(μt,τt) Stmpred Gam( m, m(σtpred)-2 ) pg pg + I[ Stmpred > Sgobs μ,τ ] } pg pg / niter Just two extra parameters predicted at each iteration g ybarg σg Sgobs σpre d mixed pred. Smpred Outline of Talk Mixture model for gene expression data Model checks for mixture model distribution for gene-specific variances different mixture priors Future work: model checks for a clustering and variable selection model (Tadesse et al. 2005) Checking mixture prior δg | η ~ w1h1(η1) + w2h2(η2) + w3h3(η3) OR δg | η, zg = j ~ hj(ηj) j = 1,…,3 P(zg = j) = wj Model checking: focus on separate mixture components Issues for mixture model checking δg | η, zg = j ~ hj(ηj) j = 1,…,3 Think about MCMC iterations … Mixture component is estimated from genes currently assigned to that component Can only define p-value for given gene and mix. component when the gene is assigned to that component (i.e. condition on zg in p-value) So check each component using only the genes currently assigned (i.e. condition on zg in histogram) Predictive checks for mixture model Bayesian p-value for gene g and mix. component j: pgj = Prob( ybargjmpred > ybargobs | data, zg=j ) μ,τ Genes assigned to the same mix. component are exchangeable ηj histogram of p-values for each mix. component separately g σg jpred histogram for component j made only from genes with large P(zg = j) ybarg Sg ybargjmpre wj d Condition on classification to check separate components Predictive p-values for data simulated from the model All genes with P(zg = j) > 0 Only genes with P(zg = j) > 0.5 Effectively we condition on a best classification Checking different mixture distributions w1Unif(-η-,0) + w2δ(0) + w3Unif(0,η+) Outer mix. components skewed too much away from zero Null component too narrow Checking different mixture distributions w1Gam-(1.5,η-) + w2 δ(0) + w3Gam+(1.5,η+) Outer components skewed opposite Null still too narrow? Checking different mixture distributions w1Gam-(1.5,η-) + w2N(0,ε) + w3Gam+(1.5,η+) Better fit for all components wj ηj μ,τ Implementation g pgj = 0 ybarg for t = 1,…,niter { δjtpred ~ hjt(ηjt) j = 1,…,3 ybargtmpred N( δjtpred , g2/nrep ) for j = zgt σg jpred Sg ybargjmpre d pgj pgj + I[ ybargtmpred > ybargobs ] for j = zgt } pgj pgj / niter(zg=j) Need ≈ngenes extra parameters at each iteration Summary of model checking procedure 1. Find part of model where individuals are assumed to be exchangeable (so information is shared) 2. Choose test statistic T (eg. sample mean or variance) 3. Predict Tpred from distribution for exchangeable individuals (whole posterior for Tpred) 4. Compare observed Ti for each individual i to distribution of Tpred 5. For checking mixture components, condition on the best classification Outline of Talk Mixture model for gene expression data Model checks for mixture model distribution for gene-specific variances different mixture priors Future work: model checks for a clustering and variable selection model (Tadesse et al. 2005) Clustering and variable selection (Tadesse et al. 2005) yi vector of gene expression for each sample i = 1,…,n Multi-variate mixture model for clustering samples: yi | zi = j MVN(ζj, Λj) P(zi = j) = wj No. of mix. components (J) is estimated in the model j = 1,…,J Aim to select genes which are informative for clustering the samples Clustering and variable selection (Tadesse et al. 2005) Likelihood conditional on allocation to mixture: γ’ = vector of indices of variables not used to cluster samples 1 n ( ') Likelihood | z ~ exp( ( yi ( ') )T (1') ( yi( ') ( ') )) 2 i 1 1 exp( ( yi( ) ( ) )T (1) ( yi( ) ( ) )) 2 iC j j Conjugate priors on multivariate means and covariance matrices P(γg = 1) = φ γ = vector of indices of selected variables i = sample g = gene j = mix. component Clustering and variable selection (Tadesse et al. 2005) J μj(γ) , Σj(γ) wj φ η(γ), Ω(γ) yi y(γ)jpred Model checking: want to check the distribution for each mixture component separately (conditional on J) In addition, need to condition on a given variable selection Clearly impossible computationally i = sample g = gene j = mix. component Computing predictive p-values 1) Run model with no prediction 2) Find the best configuration: 3) set of selected variables (γ) no. mixture components J allocation of samples to mixture components zi Re-run model, with (γ), J and zi fixed, calculated predictive p-values pij = Prob( Tjpred > Tiobs | data, zi=j, J, (γ) ) where T = |y|2 (for example) Conclusions Choice of model distributions can greatly influence results of clustering and classification For models where information is shared across individuals, predictive checks can be used as an alternative to cross-validation Should be possible to do this even for quite complex models (if you can fit the model, you can check it) Acknowledgements Collaborators on BBSRC Exploiting Genomics Grant Natalia Bochkina, Clare Marshall Peter Green Meeting on model checking in Cambridge David Spiegelhalter Shaun Seaman BBSRC Exploiting Genomics Grant Paper and software at http://www.bgx.org.uk/
© Copyright 2024