Bayesian Inference in the Sample Selection and Two-Part Models

Bayesian Inference in the Sample Selection and Two-Part Models
Martijn van Hasselt
Department of Economics
The University of Western Ontario
February 2008
Abstract
This paper considers two models to deal with an outcome variable that contains a large fraction
of zeros: the sample-selection model and the two-part model. Whereas the sample selection
model allows correlation between the selection and outcome mechanisms, the two-part model
assumes conditional independence. Using a fully parametric Bayesian approach, we present
Markov Chain Monte Carlo (MCMC) algorithms for the model parameters that are based on
data augmentation. With a Gaussian speci…cation of the likelihood the models are, in some
sense, nested. In order to determine which model is more appropriate, inference can focus
on the correlation coe¢ cient. Alternatively, a Bayes factor can be computed. The Bayesian
semiparametric approach uses ‡exible families of distributions. In particular, we present MCMC
schemes based on mixtures of normal distributions and Dirchlet process priors. The various
methods are illustrated using simulated data.
This is still a work in progress and feedback is appreciated. I thank Tony Lancaster for his guidance and William
McCausland for helpful comments and suggestions. All errors are, of course, my own. Contact: [email protected]
1
1
Introduction
In many economic applications a nonnegative outcome variable of interest is typically characterized
by a certain fraction of observations clustered at zero and a distribution of positive values that is
highly skewed. Examples are consumer expenditures on durable goods and labor supply functions
(hours worked). Our paper considers two models that are commonly used in the literature to
analyze this type of data. The essential di¤erence between these models is how they interpret a
zero value of the outcome. In the …rst model, that we will refer to as a sample selection model
(SSM), the data generating process is split up into two stages. The …rst stage describes the selection
mechanism, which determines whether we see a positive outcome or not. Usually, the selection
mechanism takes the form of a structural equation for an underlying latent variable such as utility.
If the latent variable falls below a certain threshold, the outcome variable is zero; if it exceeds this
threshold a positive outcome is observed. The second stage determines the level of the outcome.
If the …rst stage dictates that the outcome should be positive we observe the level determined in
the second stage. Otherwise, we observe a zero. Thus, the zeros represent missing data: we do not
observe what the positive outcome would have been. As an example, consider a consumer who is
considering the purchase of a car. Depending on her income, preferences, the availability of public
transportation and the travel distance to work, she will …rst decide on whether to buy a car at all.
If not, we observe zero expenditures and the potential outcome is unknown. On the other hand,
if she does decide to buy, the potential expenditures are realized (i.e. observed) and may vary
depending on, say, income, travel distance and individual tastes.
Sample selection models, such as the one used in this paper, typically describe potential outcomes which are only partially observed. In this case the observed positive values of the outcome
variable follow a pattern that is derived from the latent structure. One could wonder whether the
potential outcome is an interesting quantity to begin with. Regardless, as a modeling device the
SSM can be used to describe actual outcomes as well. More generally, sample selection occurs when
the data at hand is not a random sample from the population. When making inferences it is then
important to understand the process that generated the sample. Individuals may select themselves
into (or out of) the sample based on a combination of observable quantities and unobserved heterogeneity. If this heterogeneity also a¤ects the outcome variable, inferences using the selected sample
alone may be subject to selection bias.
Early contributions to the sample selection literature are, among others, Gronau (1974) and
Heckman (1979). Gronau’s paper analyzes the potential for selection bias in the labor market when
observed wages are used to make inference about the distribution of wage o¤ers. Heckman (1979)
treats sample selection as a speci…cation error and proposes a two-step estimator that corrects for
omitted variable bias. Since our paper takes a Bayesian approach we will not discuss the frequentist
literature any further.1
The second model in this paper is a two-part model (2PM).2 One of the …rst discussions of the
1
2
Excellent frequentist surveys are Lee (2003) and Vella (1998). The latter focuses on semiparametric models.
The term hurdle model is also used. Wooldridge (2002) and Cameron and Trivedi (2005) contain good discussions
2
2PM goes back to Cragg (1971). As in the sample selection model, two stages are distinguished
in the outcome generating process: a selection mechanism and an outcome level function. The
main di¤erence with the SSM is that the 2PM models the observed outcomes directly, rather than
specifying a latent structure. In this framework a zero does not represent missing data, but can be
interpreted as a corner solution in a consumer optimization problem. An application can be found
in Duan et al. (1983) who use the two-part model to analyze individuals’medical expenditures.
There has been some debate in the health literature as to which model is more appropriate
for describing medical expenditures. Duan et al. (1983) argue that the 2PM is to be preferred
since it models actual as opposed to potential outcomes. Whether we are interested in actual or
potential outcomes depends on the particular application at hand. Regardless, the SSM can also
be used to analyze actual outcomes because the latent structure implies a model for the observed
data. Another claim, e.g. Hay and Olsen (1984), has been that in the parametric (Gaussian)
case the 2PM is actually nested within the SSM. This is shown to be incorrect in Duan et al.
(1984). Moreover, the model parameters are not directly comparable because they have a di¤erent
interpretation in each model. Marginal e¤ects on the observed outcomes are directly available in
the 2PM, whereas in the SSM they involve some nonlinear transformation of the model parameters
and are typically covariate dependent.
The aforementioned debate has prompted many authors to compare the SSM and 2PM. Manning, Duan, and Rogers (1987) conduct an extensive Monte Carlo study and …nd that overall the
2PM performs very well in terms of predictive ability, even if the SSM is the true model generating the data. Given a joint normality assumption the 2PM is observationally equivalent to the
SSM when the cross-equation correlation is zero. In principle the null hypothesis that the 2PM is
the true model can then be tested via a classical t-test. Leung and Yu (1996) present simulation
evidence suggesting that such a test may perform poorly due to near multicollinearity. Dow and
Norton (2003) propose a test based on the di¤erence in empirical mean squared error between the
two models.3
Our paper takes a Bayesian approach to estimating the sample selection and two-part models.
The main contribution is to provide sampling algorithms to approximate the posterior distributions
of interest. We start by considering inference in a fully parametric Bayesian model based on
multivariate normal distributions. This analysis is then extended to allow for more ‡exible families
of distributions. In particular, the focus is on mixtures of normal distributions with either a …xed
or random number of mixture components. We will refer to this case as a semiparametric Bayesian
model.
Bayesian inference in limited dependent variable models can proceed by a combination of Gibbs
sampling (e.g. Casella and George 1992) and data augmentation (e.g. Tanner and Wong 1987).
These methods are useful when either the joint posterior is analytically intractable or di¢ cult to
sample from. Gibbs sampling entails generating consecutive draws from the conditional posterior
of the 2PM and related models.
3
There are some problems with this test as well. First, the choice of null hypothesis is arbitrary and second, the
test su¤ers from the same power problems as the t-test.
3
of each parameter, given the remaining ones. Data augmentation treats missing observations as
additional parameters and samples new values as part of the algorithm.4 The combined algorithm
yields a Markov chain that, under some regularity conditions, has the posterior as its invariant
distribution. Parameter values generated by the chain are then an approximate sample from the
posterior. An excellent treatment of Markov Chain Monte Carlo (MCMC) methods can be found
in Gilks, Richardson, and Spiegelhalter (1996).
Applications of MCMC are becoming widespread. Discrete choice models are treated in Albert
and Chib (1993), McCulloch and Rossi (1994) and McCulloch, Polson, and Rossi (2000). The
analysis of the parametric sample selection model in our paper is most closely related to Li (1998),
Huang (2001) and Munkin and Trivedi (2003), the common element being a simultaneous equations
structure combined with a limited range of the dependent variable. The extension to semiparametric
models draws on a large body of literature on nonparametric Bayesian methods and incorporates
it into a sample selection framework.5 The remainder of this paper is organized as follows. Section
2 presents the SSM and two Gibbs sampling algorithms based on a bivariate normal likelihood.
In section 3 we construct a Gibbs sampler for the 2PM and show how the output may be used to
compute Bayes factors for the purpose of model selection and hypothesis testing. Section 4 discusses
a simulation example. Section 5 and 6 present semiparametric versions of the SSM and construct
the corresponding MCMC algorithms. Finally, section 7 concludes. Some additional details on the
various algorithms can be found in the appendix.
2
The Sample Selection Model
We use the following version of the SSM, which is sometimes referred to as a type 2 Tobit model
(e.g. Amemiya 1985, ch. 10):
Ii = x0i1
1
+ ui1 ;
si = I fIi > 0g ;
mi = x0i2
(
ln yi =
2
mi
+ ui2 ;
if si = 1
1 if si = 0
(2.1)
:
The subscript i denotes the ith observation in a sample of size n. The vectors xi1 and xi2 have k1
and k2 elements, respectively. The equation for Ii is a selection equation: if Ii > 0, then a positive
outcome yi is observed; Ii
0 corresponds to yi = 0. The variable si is simply the indicator
of a positive outcome. The equation for mi represents the logarithm of the potential outcomes.
Potential outcomes are realized only when si = 1. Thus, m is a partially observed, partially latent
4
Thus, data augmentation can be viewed as stochastic imputation of missing values.
To quote Müller and Quintana (2004), “Nonparametric Bayesian inference is an oxymoron and a misnomer”.
The term nonparametric refers to the fact that these methods bear some resemblance to classical nonparametric
techniques, such as kernel smoothing.
5
4
variable. If the outcome yi is zero, and hence si = 0, then mi is unobserved. On the other hand,
if yi is positive and si = 1, the potential outcome mi (= ln yi ) is realized.6,7 To summarize, we
observe the vectors (x0i1 ; x0i2 ; si ) for all i and the values mi that belong to the set fmi : si = 1g. For
the parametric Bayesian analysis of this model it is assumed that the joint distribution of ui1 and
ui2 is bivariate normal:
ui1
ui2
where
!
N (0; ) ;
=
"
2
1
1 2
2
2
1 2
#
;
(2.2)
is the correlation coe¢ cient.
The random variable si has a Bernoulli distribution with
Pr fsi = 1jxi1 ;
where
1g
=
x0i1
1= 1
;
( ) denotes the CDF of the standard normal distribution. Though
1
and
1
are not
separately identi…ed, we retain both parameters for reasons explained shortly. The likelihood for
the n observations can be written as
pSSM (ln yj
1;
2;
) =
n
Y
x0i1
1= 1
si
x0i1
1
1= 1
1 si
(2.3)
i=1
Y
i:yi >0
x0i2
pu2 jI>0 ln yi
;
2
where pu2 jI>0 is the density of ui2 conditional on Ii > 0. Let fN (ajb; c) and FN (ajb; c) denote the
density and CDF, respectively, of a normal random variable with mean b, variance c, evaluated at
the point a. Let
( ) denote the standard normal density function and de…ne u
~i = ln yi
x0i2
2,
so
that
pu2 jI>0 (~
ui ) =
=
=
R1
0
pu2 ;I (~
ui ; I)dI
P (I > 0)
Z 1
pu2 (~
ui )
pIju2 (Ij~
ui )dI
(x0i1 1 = 1 ) 0
~i j0;
fN u
0
(xi1 1 =
1
=
(~
ui =
2
0
(xi1 1 =
2
2
1)
2)
1
FN 0jx0i1
x0i1
1)
6
1+(
p
2 (1
1
1
+(
ui ;
1 = 2 )~
~i
1= 2) u
2)
!
2
1 (1
2
)
:
This version of the sample selection model is widely used in the literature; see, for example, Lee (2003).
The logarithmic transform is very common in this type of models. A discussion of its rationales can be found in
Manning (1998).
7
5
Plugging this back into (2.3) the likelihood becomes
pSSM (ln yj
1;
Y
2; ) =
x0i1
1
1= 1
(2.4)
i:yi =0
Y
1
ln yi
2
x0i2
x0
pi1
1 1
2
2
i:yi >0
Gibbs Sampling in the SSM
1
2
x0i2
(ln yi
p
+
By inspection of the likelihood (2.4) it appears that no choice of prior for (
2
1
2
1;
2)
2;
!
:
) will yield a
tractable posterior distribution. We therefore develop a Gibbs sampling algorithm that simulates
draws from the posterior distribution of ( ; ; ). The updating step for
2,
1
for
2
2
and the covariance
12 .
The implied value of
is then computed as
Our Gibbs sampler involves the unidenti…ed parameters
(
1;
1)
generates new values
1
and
1.
=
12 =( 1 2 ).
The sampled values of
are therefore not informative, in the sense that there is no updating of the prior.8 The out-
put from the algorithm, however, can be used to approximate the posterior of identi…ed parameters
such as
1= 1
and . We follow the approach of McCulloch and Rossi (1994), who apply this idea
in the context of the multinomial Probit model. The main advantage of retaining the unidenti…ed
parameters is that it preserves the natural conjugacy structure in the model, and allows for an
easier approximation of the posterior. We now turn to the Gibbs samplers for the parametric SSM.
Since only the selection indicator si is observed, the variable Ii is latent and treated as an
additional parameter in the algorithm. The same can be said about the unobserved values of mi .
Data-augmentation ’completes’ the data and generates a sequence of (
1;
2;
; I; m) values that
are approximately drawn from their joint posterior. Discarding the values of (I; m) we then obtain
a sample (again, approximately) from the posterior distribution of (
1;
2;
). In what follows all
conditional distributions are to be understood as conditional on the data as well. This conditioning
is omitted for notational simplicity.
The SSM in (2.1) can be written as a SUR model. Let I = (I1 ; : : : ; In )0 , m = (m1 ; : : : ; mn )0 ,
u1 = (u11 ; : : : ; un1 )0 and u2 = (u12 ; : : : ; un2 )0 be n
W
X2
=
"
I
m
2
#
: 2n
3
x012
6 . 7
. 7
= 6
4 . 5:n
x0n2
"
#
=
1
1;
k2 ;
: (k1 + k2 )
1 vectors. De…ne the following matrices:
2
3
x011
6 . 7
. 7
X1 = 6
4 . 5:n
x0n1
X=
1;
"
X1
0
0
X2
u=
2
8
When the prior is improper, the generated values of (
1;
6
1)
"
u1
u2
k1 ;
#
#
: 2n
: 2n
(k1 + k2 );
1:
are a random walk; see McCulloch and Rossi (1994).
Then W = X + u, where E(u) = 0 and V (u) =
In . The likelihood of the normal SUR model
is
p(W j ; ) / j j
n=2
exp
/ j j
n=2
exp
1
(W X )0 (
2
1
tr B 1 ;
2
1
In )(W
X )
(2.5)
where tr( ) is the trace of a square matrix and B is de…ned as
B=
"
(I
X1
0
1 ) (I
X1
1)
(I
X1
0
1 ) (m
X2
2)
(m
X2
0
2 ) (I
X1
1)
(m
X2
0
2 ) (m
X2
2)
#
:
(2.6)
Starting with the conditional posterior of , note that p( jI; m; ; s) = p( jI; m; ) because s is a
function of I. The likelihood in (2.5) can be rewritten as
n=2
p(W j ; ) / j j
X ^;
e = W
S
^ = (X 0 S
1
1
In :
1
=
1h 0
eS
2
exp
X)
1
X 0S
1
1
^)0 X 0 S
e+(
1
i
^) ;
X(
W;
Combining this with a normal N ( 0 ; D0 ) prior for , the posterior is again normal with mean and
variance given by
E ( jW; ) =
D0 1 + X 0 S
1
X
V ( jW; ) =
D0 1 + X 0 S
1
X
1
1
h
D0 1
0
+ X 0S
1
:
i
X^ ;
(2.7)
(2.8)
To sample (Ii ; mi ) we need to distinguish two cases: si = 0 and si = 1. Suppose …rst that si = 1
so that mi is observed and Ii > 0. From (2.2) it follows that Ii , conditional on mi and Ii > 0, has
p
2 , truncated
a normal distribution with mean x0i1 1 + 1 2 1 (mi x0i2 2 ) and variance 1 1
from below at zero:
p (Ii jsi = 1; mi ;
1;
0
2 ; ) = N xi1
If si = 0 then it is known that Ii
can be generated from the N
1+
1 2
1
mi
x0i2
2
;
1
p
1
2
I fIi > 0g :
(2.9)
0 but the actual values (Ii ; mi ) are not observed. A value of Ii
x0i1 1 ;
2
1
distribution truncated from above at zero.9 The value of
9
All draws from truncated normal distributions can easily be obtained through the inverse c.d.f. method, e.g.
Lancaster (2004, p.190-191).
7
mi is a realization of its conditional distribution given Ii :10
) = N x0i1
p (Ii jsi = 0;
1;
2;
p (mi jIi ;
2;
) = N x0i2
1;
2
1;
+
2
1
1
I fIi
1
(2.10)
x0i1
Ii
2
Finally it remains to …nd the conditional posterior of
0g ;
1
;
2
2
2
1
:
(2.11)
. By inspection of the SUR likelihood
(2.5) it can be seen that the inverse Wishart distribution is the natural conjugate prior. If
has
an inverse Wishart distribution with parameter matix H and degrees of freedom v, we will write
W
1 (H; v).
The density of
is given by
p ( jH; v) / j j
(v+3)=2
exp
and the prior mean (e.g. Muirhead, page 97) is (v
1
tr
2
3)
1 H.
1
;
H
v
2;
Multiplying this density with the SUR
likelihood we get
p ( j ; ; I; m) / j j
(n+v+3)=2
exp
1
tr
2
1
(B + H)
where B was de…ned in (2.6). Thus the conditional posterior of
Gibbs sampler can now be summarized as follows:
;
is W
Algorithm 1 (Unidenti…ed Parameters) For given starting values of (
v
1 (B
1;
2;
(2.12)
+ H; n + v). The
2;
; Ig and
fmi : yi = 0g:
1. Sample (
1;
2)
from a multivariate normal with mean (2.7) and variance (2.8);
2. if si = 1, sample Ii from (2.9); if si = 0, sample Ii from (2.10) and mi from (2.11);
3. sample
from (2.12);
4. return to step 1 and repeat.
In order to execute step 3 of the algorithm, note that by de…nition of the inverse Wishart
distribution the precision matrix P (=
parameter matrix (B + H)
1
1)
has a conditional posterior Wishart distribution with
and (n + v) degrees of freedom. A draw of
by …rst generating (n + v) vectors zj from the N2 (0; (B + H)
P
0
1
( n+v
j=1 zj zj ) .
1)
can then be obtained
distribution and computing
=
Algorithm 1 yields a realization of a Markov chain that is informative about the posterior
distribution of (
1= 1;
2; 2;
). A disadvantage of using unidenti…ed parameters is that it may be
di¢ cult to choose appropriate priors for
1= 1
and .11 Our algorithm cannot be trivially modi…ed
10
When si = 0 we could also generate mi …rst and then Ii from its conditional distribution given mi , righttruncated at zero.
11
In other words, the induced prior of ( 1 = 1 ; ) needs to be checked to ensure it is appropriately re‡ecting the
researcher’s beliefs.
8
to satisfy the restriction
1
= 1.12 For that reason we also adopt the approach of McCulloch, Polson,
and Rossi (2000). In essence this entails a reparameterization of the model and placing priors on
the identi…ed parameters directly; see also Koop and Poirier (1997) and Li (1998) for applications
of this idea to a regime switching model and a Tobit model with endogeneity, respectively.
Given the bivariate normality of (ui1 ; ui2 ) and imposing the restriction
2
2
2
var(ui2 jui1 ) =
The covariance matrix can now be written as
"
1
=
12
#
2
12
+
= 1, it follows that
2
12 :
12
2
1
;
and the likelihood of the SSM becomes
pSSM (ln yj
1;
2;
Y
) =
i:yi =0
0
Y
@
i:yi >0
In order to generate draws (
p(
12 ;
jI; m;
1;
2 ).
12 ;
x0i1
1
Y
1
2
@ lnqyi
1=2
2
12
+
0
i:yi >0
x0i1
2
1
2
12
+
q
+
2
x0i2
12 (ln yi
+
2
12
1
2) A
x0i2
2
+
2
12
1
2A
:
(2.13)
) in the Gibbs sampler, we need the conditional posterior
Given (I; m;
1;
2 ),
however, the errors u1 and u2 are known and (
12 ;
2
)
12 ;
2
)
are the parameters of a normal linear regression model:
ui2 =
12 ui1
+
i;
N 0;
i
2
:
Thus, the conditional posterior of interest satis…es
p
12 ;
2
jI; m;
1;
2
= p
2
12 ;
/ p(u1 ; u2 j
ju1 ; u2 ;
12 ;
/ p u1 ; u 2 j
where we take (
12 ;
2
) a priori independent of (
is of the normal-inverse gamma form. We say
parameters (c0 ; d0 ), written as
2
1;
2
( 2 jc0 ; d0 ) =
dc00
( 2)
(c0 )
12
;
1;
2
2)
(
12 ;
2
2
j
1;
2)
12 ;
2
2 ).
The natural conjugate prior for (
12 ;
;
has a prior inverse gamma distribution with
IG(c0 ; d0 ), if
Then
2
1;
2
has a prior gamma distribution G(c0 ; d0 ).
(c0 +1)
e
d0 =
2
:
Although
has an inverse Wishart distribution,
conditional on 1 = 1 does not. Nobile (2000) proposes a
methods to sample the remaining elements of , conditional on the value of a diagonal element. The algorithm in
our paper is based on McCulloch, Polson, and Rossi (2000).
9
The conditional prior of
12
is given by
(
12 j
The reason for prior dependence between
2
; ; g) = N (g;
12
2
and
2
):
is that the induced prior for the correlation
coe¢ cient can be made roughly uniform by an appropriate choice of . Here (c0 ; d0 ; g; ) is a set
of hyperparameters. It is easy to show that the posteriors take the following form:
~
IG(~
c; d);
n+1
c~ = c0 +
;
2
( 12 g)2 1
0
d~ = d0 +
+ (u2
12 u1 ) (u2
2
2
2
jI; m;
1;
2;
and
12 jI; m;
1;
2;
2
g= + u01 u2
;
1 + u0 u
1 1
N
(2.14)
12
12 u1 ) ;
2
1
+ u01 u1
:
(2.15)
The Gibbs sampler with identi…ed parameters, which is similar to Li’s (1998) algorithm, can now
be summarized as follows:
Algorithm 2 (Identi…ed Parameters) For given starting values of (
1;
2;
12 ;
2
; Ig and
fmi : yi = 0g:
1. Sample (
1;
2)
from a multivariate normal with mean (2.7) and variance (2.8);
2. if si = 1, sample Ii from (2.9); if si = 0, sample Ii from (2.10) and mi from (2.11);
3. sample
2
from (2.14) and
12
from (2.15);
4. return to step 1 and repeat.
3
The Two-Part Model
The version of the 2PM we use is
Ii = x01i
ln (yi jsi ) =
1
+ "i1 ;
si = I fIi > 0g ;
(
x0i2 2 + "i2 if si = 1
1
(3.1)
:
if si = 0
For the parametric model we take
"i1
N 0;
2
1
;
"i2
10
N 0;
2
2
:
The selection equation is the same as in the SSM: if Ii > 0, then yi > 0 and the logarithm of yi is
well-de…ned. If Ii
0 then yi = 0. The main di¤erence between the SSM and the 2PM concerns
the errors "i2 and ui2 . In the sample selection model ui2 is an error that corresponds to potential
outcomes. Conditional on Ii > 0 the error then has a nonzero mean that depends on
and x0i1
1.
In contrast, "i2 only a¤ects the logarithm of positive values of expenditures and by construction
E ("i2 jIi > 0) = 0. The 2PM is silent about the joint distribution of ("i1 ; "i2 ) and assumes that
conditional on "i1 >
xi1
1,
the errors "i1 and "i2 are independent.13
The likelihood of the 2PM is
p2P M (ln yj
1;
2;
1;
n
Y
2) =
x0i1
si
1= 1
x0i1
1
1= 1
1 si
(3.2)
i=1
Y
ln yi
1
2
x0i2
2
:
2
i:yi >0
By comparing (2.4) and (3.2) it is clear that the former reduces to the latter when
= 0. This
suggests that in order to discriminate between the SSM and 2PM we can consider inference on the
correlation coe¢ cient.
Gibbs Sampling in the 2PM
Approximating the posterior of (
(or
12 )
(
1;
1;
and (
sample (
2;
2;
1
2)
2
in the two-part model is considerably easier because
is absent from the model. Moreover, the likelihood in (3.2) factors into a Probit term
involving (
2)
1
1;
1;
2)
1
2)
1
and a Gaussian term involving (
2;
2)
2
2 ).
2
2;
If we impose independence between
in the prior, this is carried over to the posterior. As a consequence we can
and (
2)
2
2;
each in their own ’mini Gibbs sampler’.
In the Probit part we impose the restriction
1
= 1, because it reduces the number of steps
needed in the algorithm. The Probit algorithm described here is due to Albert and Chib (1993). The
parameters are (I;
p(
1
jI).14
1)
and it remains to …nd the conditional posteriors p(Ij
Since I = X1
p(Ij
1)
+ u1 and u1
1
= (2 )
e = I
^
1
=
n=2
exp
1
and p(
N (0; In ), it follows that
X1 ^ 1 ;
X10 X1
1 ; s)
1h 0
ee+(
2
1
^ )0 X 0 X1 (
1
1
1
1 jI; s)
=
i
^ ) ;
1
X10 I:
Combining a normal N (b1 ; B1 ) prior distribution for
1
with the likelihood of I given above, we
get
1 jI
13
14
N b1 ; B1 ;
(3.3)
This does not imply that "i1 and "i2 are independent; see Duan et al. (1984) for an example.
The equality follows because s is a function of I.
11
B1 1 + X10 X1
B1 =
1
b1 =
Since Ii j
given
1
1
B1 +
X10 X1
1
;
1
B1 1 b1 + X10 X1 ^ 1 :
has a normal distribution with mean x0i1
1
and unit variance, the distribution of Ii
and si is truncated normal:
p(Ii j ; si = 0) = N x0i1 ; 1 I fIi
0g ;
(3.4)
p(Ii j ; si = 1) = N x0i1 ; 1 I fIi > 0g :
Inference on
2
2
;
+
uses only the subsample in which yi > 0. Let ln y + , X2+ and u+
2 = ln y
all refer to this subsample of size n+ . If the priors are
2
2
IG(c0 ; d0 ) and
2
X2+
N (b2 ; B2 ), then
standard results for the linear model with normal errors yield
2
+
2 j 2 ; ln y
n+
c~ = c0 +
2j
2
;
2
+
2 ; ln y
B2 = B2 1 +
b2 = B2 1 +
~
IG(~
c; d);
(3.5)
1
u+ ;
d~ = d0 + u+0
2 2 2
N b2 ; B2 ;
(3.6)
2 +0 +
2 X2 X2
2 +0 +
2 X2 X2
^ = X +0 X +
2
2
2
1
1
1
;
B2 1 b2 +
2
2
X2+0 X2+ ^ 2 ;
X2+0 ln y + :
The Gibbs sampler in the 2PM can now be summarized as follows:
Algorithm 3 (Two-Part Model) For given starting values of I;
1. Sample
1
from (3.3) and I from (3.4);
2. sample
2
2
from (3.5) and
1;
2;
2
2
:
from (3.6);
3. return to step 1 and repeat.
Bayes Factors
The Bayes factor provides a way to compare di¤erent models on the basis of their prior predictive
distribution.15 Suppose that two competing models, M1 and M2 , are entertained to describe the
outcome y. A model in this context consists of a prior distribution on the appropriate parameters
and a likelihood for the data. Given the prior probabilities
(M1 ) and
(M2 ) of the two models,
the posterior odds ratio is computed as
p (M1 jy)
p (M2 jy)
15
=
p (yjM1 )
p (yjM2 )
(M1 )
(M2 )
Bayes factors have been researched extensively. The article by Kass and Raftery (1995) is an excellent survey.
12
(M1 )
;
(M2 )
= B12
where B12 is the Bayes factor of model 1 versus model 2. In other words, the Bayes factor transforms
the prior odds ratio into the posterior odds ratio. The Bayes factor itself is the ratio of the prior
predictive distributions or marginal likelihoods. Bayes factors larger than 1 indicate support for
model M1 . Conversely, a value of B12 much smaller than 1 suggests that model M2 is more likely.
In what follows we consider two ways to compute the Bayes factor.
Let M1 denote the SSM with the restriction
=
12
= 0 imposed and M2 the unrestricted SSM.
In the parameterization of the model leading up to algorithm 2 (i.e. incorporating the identi…cation
constraint
2
1
= 1) the marginal likelihood mj
Z
mj =
where pj ( j ) and
j(
2
pj (ln yj ;
p(ln yjMj ) of model Mj (j = 1; 2) is given by
;
12 ) j (
;
2
;
12 )d
d 2d
12 ;
) are the likelihood and prior of model j, respectively. Note that p1 simply
follows from p2 by imposing the restriction
= 0. As before, the prior
12
2
for the unrestricted
model is:
0
0
1;
=
2
In M1 only the parameters (
0
2
N ( 0 ; D0 ) ;
IG (c0 ; d0 ) ;
1;
2
2;
1(
) appear, because
2
;
12 j
N g;
12
2
:
= 0. It is therefore natural to take
) = N ( 0 ; D0 ) IG (c0 ; d0 )
=
2(
;
2
):
Using the arguments in Verdinelli and Wasserman (1995), we can now …nd an expression for the
Bayes factor:
B12 =
m1
m2
= p(
12
= 0jy)
Z
p(ln yj ; 2 ; 12 = 0) 1 ( ;
m2 p( 12 = 0jy)
2
)
d d 2:
Plugging in the relation
1
p( 12 = 0jy)
p( ;
p( ;
2
= p( ;
2
=
2
jy; 12 = 0)
; 12 = 0jy)
jy;
12
= 0)
p(ln yj ;
13
2
;
12
m2
= 0)
2(
;
2
;
12
= 0)
;
we get
Z
B12 = p(
12
= 0jy)
= p(
12
= 0jy)E
p(
12
=
2(
= 0jy)
E
12 = 0)
1(
)=
2(
;
2
j
p( ;
2
jy;
= 0)d d
12
2
2
)
2 ( ; ; 12 = 0)
2
1( ; )
;
2
2 ( ; j 12 = 0)
where the expectation is with respect to p( ;
2
2
)
2 ( ; ; 12 = 0)
1(
2
;
2
;
jy;
12
(3.7)
= 0). Note that in the special case where
1(
;
12
= 0. This simple form is usually referred to as the Savage-Dickey density ratio. Our choice
12
= 0), the Bayes factor reduces to the ratio p(
12 jy)= 2 ( 12 ),
evaluated at
of restricted prior is slightly di¤erent, so that we need Verdinelli and Wasserman’s (1995) more
general expression of the Bayes factor.
The expression in (3.7) can be simpli…ed even further:
B12 = p(
= p(
12
12
= 0jy)E
= 0jy)E
; 2)
2
2 ( 12 = 0) 2 ( ; j
1
;
2
2 ( 12 = 0j )
1(
12
= 0)
(3.8)
where the last line follows from Bayes’ rule and prior independence of
12
and . In order to
estimate (3.8) from the Gibbs sampler output, note that
p(
12 jy)
= E p(
12 jy;
2
;
) ;
where the expectation is taken with respect to p( ; 2 jy). Given a sample
n
oT
2
;
;
I
;
m
generated by algorithm 2, the value of p( 12 = 0jy) can be estimated
(t) (t) (t)
(t)
through
t=1
p^(
12
T
1X
= 0jy) =
p(
T
12
= 0jy;
2
(t) ; (t) ; I(t) ; m(t) );
t=1
which requires T evaluations of the density in (2.15) at zero.
The expected value in (3.8) is with respect to p( ;
f
(s) ;
2
S
(s) gs=1
2
jy;
12
= 0).
To simulate a sample
from this distribution, we run algorithm 2 again, but now with
12
…xed at zero.
The expected value is then estimated through
^
E
S
1X
1
=
2
S
2 ( 12 = 0j )
s=1
"
2 ( 12
1
= 0j
We do not consider the Bayes factor for algorithm 1, because when
2
(s) )
#
:
has an inverse Wishart prior,
the elements of (3.8) are hard to estimate. In particular, the marginal prior and posterior of
12
are very complicated. Algorithm 2 has the advantage that all densities that appear in (3.8) are
14
standard, so that B12 can easily be estimated.
A second method to compute the Bayes factor is the one proposed by Chib (1995). Output
from the Gibbs sampling algorithms 2 and 3 can be used to estimate m1 and m2 separately and
compute their ratio.16 We apply Chib’s (1995) method …rst to the sample selection model. The
marginal likelihood m2 can be written as
p ln yj ;
m2 =
2
12 ;
p
;
2
2
12 ;
;
12 ;
2
:
jy
Note that this equation holds for all parameter values in the support of p
speci…c value
;
2
12 ;
;
12 ;
2
jy . Now pick a
, say the sample mean from the Gibbs output. Taking logarithms we get
log m2 = log p ln yj ;
;
log p
Using (2.13) and the prior on
;
12 ;
2
2
12 ;
2
12 ;
+ log
;
2
2
12 ;
jy :
the …rst two terms on the right-hand side can easily be
calculated. It remains to estimate the value of the posterior. To this end, write
log p
;
12 ;
2
jy
2
= log p
jy + log p
+ log p ( j
Since
p
and a sample
(t) ;
2
jy =
Z
2
p
j ;
T
12(t) ; I(t) ; m(t) t=1
p^
2
jy =
12 ; I; m; y
12 ;
p( ;
2
12 j
;y
(3.9)
; y) :
12 ; I; mjy) d
d
12 dIdm
is available from the posterior, we estimate this term by
T
1X
p
T
2
t=1
j
(t) ;
12(t) ; I(t) ; m(t) ; y
;
where each term in the sum requires evaluating the inverse-gamma density in (2.14).17 As for the
second term in (3.9), we have
p
12 j
2
;y =
Z
p
12 j
;
2
; I; m; y p
; I; mj
2
; y d dIdm:
In order to estimate this term, we need a sample from the posterior distribution of ( ; I; m),
given y and
2
. The current sample does not satisfy this condition. Therefore the algorithm
needs to implemented again, this time with
R
(r) ; I(r) ; m(r) r=1
2
…xed at the value
that (approximately) comes from p
16
; I; mj
2
2
. This yields a sequence
; y . The estimate is constructed
Estimating the marginal likelihood from algorithm 1 yields unstable results. Apparently the lack of full identi…cation in that sampler causes problems.
17
Note that the parameter d~ varies with t.
15
as
R
p^
2
12 j
1 X
;y =
p
R
2
12 j (r) ;
r=1
; I(r) ; m(r) ; y ;
where each term involves evaluating (2.15).18 Using similar logic …x
Q
I(q) ; m(q) q=1
run algorithm 2 again, to obtain a sample
12
from p I; mj
=
12 ;
12
2
and
2
=
2
and
; y . The third term
in (3.9) is then estimated as
Q
p^
j
2
12 ;
1 X
;y =
p
Q
j
q=1
2
12 ;
; I(q) ; m(q) ; y ;
where each term in the sum requires evaluating the multivariate normal density with mean (2.7)
and variance (2.8) at the point
. Only the mean varies with q.
Computations in the 2PM are largely similar, so we will be brief. Again the logarithm of the
marginal likelihood m1 is split up into the logarithms of the likelihood, the prior and the posterior,
evaluated at
1;
2;
2
2
. Using (3.2) and the priors for algorithm 3 the …rst two terms are easy to
compute. The value of the posterior is estimated using the Gibbs output. Unlike the SSM there is
no need for additional runs of the sampler. The currently available sample f
1(t) ; I(t) ;
is used to construct the estimates
p^(
p^
1 jy) =
2
2
jy
=
T
1X
p
T
t=1
1 jI(t) ; y
T
1X
p
T
2
2
t=1
where we evaluate (3.3) and (3.5) T times. Note that p(
j
+
2 ; y)
;
= p(
2j 2
be estimated, but only requires a single function evaluation of (3.6).
4
2 gT
(t) t=1
;
2(t) ; y
2j 2
2(t) ;
2; y+)
does not have to
Simulation Example
We now compare algorithms 1, 2 and 3 using some simulated data. A sample of size n = 500 is
generated from the sample selection model in 2.1:
ui1
ui2
Ii =
01
+ xi1
11
+ ui1 ;
mi =
02
+ xi2
12
+ ui2 ;
yi = I fIi > 0g emi ;
!
N (0; ) ;
01
18
=
02 ;
11
=
=
12
"
(4.1)
1
0:5
0:5
1
#
;
= 1:
The mean and variance of the normal distribution in this case both depend on r.
16
Note that
that
10
=
01
controls the probability p0 of observing a zero outcome. We let xi1 ; xi2
0:58 and
10
=
U (0; 3), so
1:50 correspond to p0 = 0:25 and p0 = 0:50, respectively (cf. Leung
and Yu 1996). With the exclusion restriction xi1 6= xi2 the parameter
12
measures the marginal
e¤ect of xi2 on the log-outcome (conditional on a positive outcome) in both the SSM and 2PM. Of
course, the correlation
is absent in the 2PM. We start each algorithm at three di¤erent points.
The number of iterations is 6; 000 with a burn-in period of 3; 000. The …gures in this section are
therefore estimated posterior distributions based on 9; 000 random draws.
β11
4.5
β12
8
4
7
3.5
ρ
3.5
3
6
2.5
3
5
2
2.5
4
2
1.5
3
1.5
0.5
1
0.5
0
1
2
1
0.8
1
1.2
00.8
1.4
0.9
1
1.1
1.2
0
-0.2
0
0.2
0.4
0.6
0.4
0.6
Figure 1: algorithm 1; n = 500; p0 = 0:25
β11
β12
8
4
ρ
3.5
7
3
3.5
6
2.5
3
5
2
2.5
4
2
1.5
3
1.5
0.5
1
0.5
0
1
2
1
0.8
1
1.2
1.4
1.6
00.8
0.9
1
1.1
1.2
0
-0.2
0
Figure 2: algorithm 2; n = 500; p0 = 0:25
17
0.2
β11
4
β12
8
3.5
7
3
6
2.5
5
2
4
1.5
3
1
2
0.5
1
00.6
0.8
1
1.2
00.8
1.4
0.9
1
1.1
1.2
1.3
Figure 3: algorithm 3; n = 500; p0 = 0:25
As appears from …gures 1-3, all three Gibbs samplers produce posterior approximations centered
around the true parameter value. It is interesting to note that Gibbs sampling in the two-part
model does locate the coe¢ cient of the selection equation,
the log-outcome,
22 ,
11 ,
and the marginal e¤ect of xi2 on
despite ignoring the cross-equation correlation.
β11
5
β12
7
ρ
4.5
4.5
4
6
4
3.5
5
3.5
3
3
4
2.5
2.5
2
3
2
1.5
1.5
2
1
1
1
0.5
0.5
0
0.8
1
1.2
1.4
0
0.8
1
1.2
00
0.2
Figure 4: algorithm 1; n = 500; p0 = 0:50
18
0.4
0.6
0.8
β11
5
β12
7
ρ
4.5
4.5
4
6
4
3.5
5
3.5
3
3
4
2.5
2.5
2
3
2
1.5
1.5
2
1
1
1
0.5
0.5
0
0.8
1
1.2
0
1.4
0.8
1
00.2
1.2
0.4
0.6
0.8
Figure 5: algorithm 2; n = 500; p0 = 0:50
β11
4.5
β12
6
4
3.5
5
3
4
2.5
3
2
1.5
2
1
1
0.5
00.7
0.8
0.9
1
1.1
1.2
00.7
1.3
0.8
0.9
1
1.1
1.2
1.3
Figure 6: algorithm 3; n = 500; p0 = 0:50
The results for more severe selection, p0 = 0:50, are given in …gures 4-6. Even with a larger
fraction of zero outcomes all posteriors for
12
and
22
are centered around the true value, though
the posterior variance has increased. Moreover, the posterior of the correlation puts most probability on positive values. In results not reported here, we have found that all algorithms converge
rapidly (as judged by the Gelman-Rubin R-statistic; e.g. Gelman, Carlin, Stern, and Rubin 1995,
chapter 11). In addition, the Bayes factor for testing the restriction
= 0 decisively rejects the
2PM in favor of the SSM.
5
Finite Mixtures and Sample Selection
There are several ways the assumption of bivariate normality in the SSM can be relaxed. Here we
…rst consider using a mixture of normal distributions with a …xed number of mixture components.
19
The sample selection model is given in (2.1) but now
!
ui1
ui2
where cu > 0 is a scaling factor, the
j ’s
j1
=
j
j2
k
X
j N ( j ; cu
j );
j=1
are nonnegative and
!
;
j
=
"
2
j;1
Pk
j
j=1
j;12
2
j;2
j;12
#
= 1, and
:
Mixtures of normals are very ‡exible distributions and even with small values of k (say, 2 or 3) can
display skewness, excess kurtosis and multimodality (e.g. Geweke 2005, chapter 6). Note that the
mixture distribution trivially reduces to (2.2) when k = 1,
j
= 0 and cu = 1.
To construct a Gibbs sampler for this mixture model, let
component selectors. That is, if
mixture component:
ui1
ui2
Since the values of
i
i
!
j
= ( 1; : : : ;
n)
be the vector of
= j then the errors of observation i are drawn from the j th
i
=j
N(
j ; cu
j );
j = 1; : : : ; k:
are not observed, the parameters in the Gibbs sampler are now
0
=(
0
1;
=
0
2 );
k
j j=1 ;
fIi gni=1 ;
;
fmi : yi = 0g ;
k
j j=1 ;
=
cu ;
k
j gj=1 :
=f
We highlight some of the main features of the sampler and leave additional details to appendix
A. Recall that in the SUR formulation W = X + u. Let W(j) ; X(j) ; u(j) be the submatrices
corresponding to the j th mixture component. The SSM likelihood in (2.5) can be written as
p(W j ; ; ; ; cu ) /
(5.1)
k
Q
j=1
where Aj = cu 1
1
j
jcu
jj
nj =2
1
(W
2 (j)
exp
Inj and nj = jfi :
i
X(j)
j
0
nj ) Aj (W(j)
Using straightforward algebra, it can be shown that
j=1
nj )
is prior independent
of , the same will be true for the conditional posterior of .
p(W j ; ; ; ; cu ) /
j
= jgj. Note that once we condition on the component
selectors , the completed data likelihood does not depend on . Provided
k
Q
X(j)
jcu
exp
jj
(
nj =2
1
(
2
20
exp
(
k
1 P
e0 Aj e(j)
2 j=1 (j)
k
^) P X 0 Aj X(j) (
(j)
0
j=1
)
)
^) ;
;
where ^ is the GLS estimator and e(j) the residual:
2
3
k
X
0
^ = 4
Aj X(j) 5
X(j)
1
nj
j
0
Aj (W(j)
X(j)
nj );
j
j=1
j=1
e(j) = W(j)
k
X
X(j) ^:
Combining the likelihood with a N (d0 ; D0 ) prior, the conditional posterior is again normal:
p( jW; ; ; ) = N (d; D);
2
3
k
X
0
d = D 4D0 1 d0 +
Aj X(j) ^5 ;
X(j)
(5.2)
j=1
2
k
X
D = 4D0 1 +
Sampling fIi gni=1 , fmi : yi = 0g and
j=1
3
1
0
X(j)
Aj X(j) 5
:
is similar as before, so we will be brief. For simplicity we
do not list every conditioning argument:
x0i1
Ii jsi = 1;
i
=j
N
Ii jsi = 0;
i
=j
N (x0i1
mi jIi ;
i
=j
N
The component selector
1; : : : ; k. Write the
ith
i
x0i2
1
1
+
+
2
j1
j;12
2 (Ii
j;1
x0i1
+
j2
+
!
2
j)
i
IfIi > 0g;(5.3)
(5.4)
:
(5.5)
= jj g =
j,
j =
observation as the bivariate linear model wi = Xi + ui , where
/
"
Xi =
i
x0i1
0
0
x0i2
#
;
j
ui = (ui1 ; ui2 )0 :
follows from Bayes’rule:
jcu
jj
exp
tribution. If (
2
j)
2
j1 ); cu j;2 (1
1
= jjwi ; ; ; ; ; cu g / p(Ii ; mi j ; ; ; ; cu ;
The parameter
!
2
j2 ); cu j;1 (1
2
has a prior multinomial distribution with Prf
The conditional posterior distribution of
by
x0i2
0g;
wi = (Ii ; mi ) ;
i
+
2
j1 ; cu j;1 )IfIi
0
Prf
j;12
2 (mi
j;2
i
= jg
Prf
1=2
i
= jj g
(5.6)
1
(wi
2cu
Xi
0
j)
j
1
(wi
Xi
j)
:
is a set of multinomial probabilities, which suggests using a Dirichlet prior dis1; : : : ;
k)
= D(
0; : : : ;
0 ),
it is shown in the appendix that the posterior is given
j
D(
0
+ n1 ; : : : ;
21
0
+ nk ):
(5.7)
The use of a uniform Dirichlet prior, i.e.
is the parameter for each state j = 1; : : : ; k, is
0
appropriate here, because the state labels are not identi…ed. This lack of knowledge is re‡ected by
a prior in which the states are exchangeable.
In order to …nd the conditional posterior of (
(
j)
=W
1
(H; v);
(
j;
j)
j jcu ;
The advantage of this choice is that the posterior of (
we take its prior to be
j)
j;
normal-inverse Wishart form. This allows us to sample (
convergence behavior of the Markov
p(
j jW;
chain.19
uj
j)
j;
conditional on ( ; ; cu ) is again of the
j)
jointly, which should improve the
It follows that then
~ j ; v~j );
(H
1 nj uj u0j
1 X
= H+
+
(ui
cu 1 + nj
cu
=
j) :
1
; ; cu ) = W
~j
H
= N (0; cu
(5.8)
uj )0 ;
uj )(ui
i: i =j
1 X
ui ;
nj
i: i =j
v~j
where nj =
j. Also:
Pn
i=1 If i
= v + nj ;
= jg is the size of the subsample that is distributed according to component
p(
j j j ; W;
; ; cu ) = N (bj ; cu Mj ) ;
nj
bj =
uj ;
1 + nj
Mj
Allowing for prior dependence between
j
and
j
=
1 + nj
(5.9)
j:
is often reasonable: if cu
j
is large the errors are
highly variable, making it harder to pin down their location. The parameter
can be thought of
as a tuning parameter that can be used to control the potential multimodality in the distribution
of ui , independently of the variance.
The conditionally conjugate prior distribution for cu is the inverse gamma distribution. In
particular, we take
(cu ) = IG(v=2; d). Since cu enters the prior of
j
and the likelihood, it is
shown in the appendix that
~
p(cu jW; ; ; ; ; ) = IG(~
v ; d);
v
v~ = + n + k;
2
k
n
X
1
1X
1
1
0
0
d~ = d +
+
(ui
j
j(i) ) j(i) (ui
j j
2
2
j=1
19
p(
(5.10)
j(i) );
i=1
Alternatively, if j is a priori independent of j , it would be necessary to sample from p( j j j ; W; ; ; cu ) and
; ; cu ) consecutively. This adds a ’block’to the Gibbs sampler, which slows down convergence.
j j j ; W;
22
where j(i) is the value of j 2 f1; : : : ; kg such that
i
= j. The Gibbs sampler for the mixture of
normals sample selection model can now be summarized as follows:
Algorithm 4 (Mixture SSM) For given starting values of ( ; ; ; ; ; cu ; I) and fmi : yi = 0g:
1. Sample
0
1;
=(
0 0
2)
from (5.2);
2. if si = 1 sample Ii from (5.3); if si = 0 sample Ii from (5.4) and mi from (5.5);
3. sample
from (5.6) for i = 1; : : : ; n;
i
4. sample
=(
5. sample
j
1; : : : ;
k)
from (5.7);
from (5.8 and
from (5.9) for j = 1; : : : ; k;
j
6. sample cu from (5.10);
7. return to step 1 and repeat.
A similar algorithm can be constructed for the two-part model. Some simplications will occur
because of the assumed conditional independence between the selection and outcome equations.
In (3.1) the distributions of both "i1 and "i2 could be modeled as a mixture of normals. One
important point to note is that the use of improper priors in a mixture model leads to improper
posterior distributions (e.g. Roeder and Wasserman 1997). Second, the state labels j are not
identi…ed without further prior information. If the states themselves are the primary focus, for
example a state might represent a certain regime or subpopulation, then the algorithm above is
not appropriate.20 In our case, however, we merely use mixtures as a modeling device, and labeling
issues are not a concern.
Simulation Example
To illustrate the Gibbs sampler of algorithm 4 we generate a sample of size n = 500 from the SSM
in 4.1 with
01
=
02
=
1:50 and
11
=
12
= 1. This corresponds to a fraction of zero outcomes
of about 50%. The errors are generated according to
ui1
ui2
where
1
= ( 1; 1),
2
!
= (0:25; 0:25),
1N ( 1;
1
) + (1
= 0:2 and
1 )N ( 2 ;
);
has variances 1 and covariance 0:5. The
regressors xi1 and xi2 are taken from the U (0; 3) distribution. Since in our experience the sampler
for the mixture model displays slower convergence rates, we run the Markov chain from three
di¤erent starting points for 15; 000 iterations and discard the …rst 10; 000. The posterior density
estimates are thus based on 15; 000 sampled values.
20
For a discussion, see Geweke (2005, chapter 6) and the references cited therein.
23
β11
0.8
β12
7
0.7
ρ
2.5
6
2
0.6
5
0.5
1.5
4
0.4
3
1
0.3
2
0.2
0.5
1
0.1
00
1
2
3
4
5
00.8
1
1.2
1.4
0-0.5
0
0.5
1
Figure 7: algorithm 4; n = 500; p0 = 0:50
γ1
1.4
1.3
1.2
1.1
1
0.9
0.8
0.7
0.6
0.5
0.4 0
0.1
0.2
0.3
0.4
0.5
0.6
Figure 8: mixture weights (
The estimated posterior of
12
0.7
1
0.8
0.9
1
= 0:2)
centered around its true value with a 95% highest posterior density
(HPD) interval of [0:91; 1:17]. Although the correlation is harder to pin down, the posterior puts
substantial probability on positive values, with a 95% HPD of [ 0:07; 0:61]. The coe¢ cient of xi1 in
the selection equation is not well-identi…ed with an 95% HPD of [0:55; 3:16]. Finally, the posterior
of
1
is reasonably accurate. The bimodality occurs because of so-called ’label-switching’ in the
sampler; since the state labels themselves are not identi…ed, it is possible that at some point in the
Markov chain the value of
probability region around 1
1;
2;
1;
2.
1
jumps from a low probability region, around its true value, to a high
1.
The same then happens with the component speci…c parameters
To ensure that this switching happens in a balanced manner and that the mixture
posterior is fully explored, we have incorporated a random permutation step in algorithm 4.21
21
See Frühwirth-Schnatter (2006, page 82)
24
6
Dirichlet Mixtures and Sample Selection
The approach discussed above requires one to choose the number of mixture components beforehand. If the econometrician is uncomfortable doing this, he could use various choices of k and
compare di¤erent models on the basis of their posterior probabilities. Here we explore the use of
Dirichlet process priors in modeling sample selection. Building on a large literature that started
with Ferguson ((1973, 1974)) and Antoniak (1974) we show that some of the existing methodology
can be readily adapted to the models in this paper. The main appeal of using a Dirichlet process
prior lies in the fact that the errors are modeled as a mixture of normals with a random number
of mixture components. Through Bayesian updating we can directly make inference about this
number. Also, the prior allows us to center in some sense the semiparametric model around the
parametric one. The work in this section is based on results from Escobar (1994), Escobar and
West (1995)22 and is closely related to Conley, Hansen, McCulloch, and Rossi (2007), who consider
the use of Dirchlet process priors in an instrumental variables model.
The basic setup can be described as follows. In (2.1) let
ui1
ui2
!
j
i
i jG
Gj ; G0
Here
i
N ( i ; cu
i );
23 .
= ( i;
i );
G;
(6.1)
DP( ; G0 ):
is simply the set of parameters for the normal distribution (apart from the common scale
factor cu ). Our discussion of the SSM in section 2 involved
for
i
i
=
= ( ; ) and specifying a prior
The semiparametric model in (6.1) allows each pair (ui1 ; ui2 ) to have a distinct normal
distribution, conditional on
G is chosen to be a N (
0 ; S0 )
i.
The parameters f i gni=1 are i.i.d. draws from a distribution G. If
distribution, possibly augmented with a hyperprior on (
0 ; S0 ),
then
we have speci…ed a hierarchical normal model, which still imposes a lot of structure on, say, the
marginal distribution of the errors. In particular, it would not allow multimodality or skewness.
Instead, in (6.1) the distribution G itself is treated as unknown and given a Dirichlet process (DP)
prior24 . Thus, G can be viewed as a random probability measure. G0 is a distribution that in some
sense is a prior guess about G. Speci…cally, the marginal prior distribution of
i
is exactly G0 (e.g.
Ferguson 1973, Antoniak 1974). The parameter
re‡ects the prior belief that G0 is the actual
distribution of
! 1.
i.
This belief becomes stronger as
We construct a Gibbs sampler based on …xed values of ( ; G0 ). Additional details on the prior
of
and the form of G0 are discussed in appendix B. Throughout this section the conditioning on
22
Escobar and West (1998), MacEachern (1998) and Müller and Quintana (2004) are excellent reviews of semiparametric modeling with Dirichlet processes.
23
To be precise, the prior of was degenerate at zero and ( ) W 1 (H; v).
24
Suppose is the sample space and fAj gkj=1 is any measurable partition. If G DP( ; G0 ), then the collection
of random probabilities fG(Aj )gkj=1 follows a Dirichlet distribution.
25
( ; G0 ) is implicit. The parameters are now
fIi gni=1 ;
;
fmi : yi = 0g;
n
i gi=1 ;
f
G;
cu
A simpli…cation occurs because G can be integrated out of the posterior (e.g. Escobar 1994), so
that the Gibbs sampler only involves updating the remaining parameters.
= f i gni=1 , is
The likelihood of W , conditional on
p(W j ; ; cu ) /
n
Y
i=1
jcu
ij
1=2
1
(wi
2cu
exp
Combining this with a N (d0 ; D0 ) prior for
jW; ; cu
i
(wi
Xi
i)
:
and collecting terms, it can be shown that
N (d; D);
"
D0 1 + cu 1
D =
1
0
i)
Xi
n
X
1
Xi0
i
Xi
i=1
"
d = D D0 1 d0 + cu 1
n
X
Xi0
#
i
i=1
(6.2)
1
;
#
1
Xi ^ ;
and ^ is the weighted least squares estimator:
^=
"
n
X
1
Xi0
i
Xi
i=1
#
1 n
X
1
Xi0
i
i ):
(wi
i=1
Sampling Ii when si = 1 and (Ii ; mi ) when si = 0 is done by generating draws from the distributions
in (5.3), (5.4) and (5.5), where we now condition on
Let
i
and G
= f 1; : : : ;
i 1 ; i+1 ; : : : ; n g.
ij
That is,
i
i
equals one of the other
=
j
i
given
w. prob.
G0 w. prob.
j ’s
= ( i;
i ),
instead of
i.
Blackwell and MacQueen (1973) show that if
DP( ; G0 ), then the distribution of
(
i
i
i jG
G
with G integrated out is given by
1
+n 1 ;
j 6= i
:
(6.3)
+n 1
with nonzero probability, or is a new value distributed
according to G0 . This property is often referred to as the Pólya urn representation of a sample
from the Dirichlet process. Using Bayes’rule the posterior distribution takes the following form:
ij
Here c
1
i ; W;
(
=
j
w. prob. c
p( i jwi ; Xi ; ; cu ) w. prob. c
1 p(w jX ; ; ; c );
i
i
j u
1 p(w jX ; ; c )
i
i
u
j 6= i
:
(6.4)
is a normalizing constant, p( i jwi ; ; cu ) is the posterior given prior dG0 ( i ) and p(wi jXi ; ; cu )
26
is the marginal likelihood after integrating out
i
with respect to dG0 ( i ):
p( i jwi ; Xi ; ; cu ) / p(wi jXi ; ; i ; cu )dG0 ( i );
Z
p(wi jXi ; ; cu ) =
p(wi jXi ; ; i ; cu )dG0 ( i ):
More details are given in appendix B.
Finally, updating cu conditional on the remaining parameters is exactly the same as in the
…nite mixture case discussed earlier. The Gibbs sampler for the Dirichlet process SSM can now be
summarized as follows:
Algorithm 5 (Dirichlet SSM) For given starting values of ( ; ; I; cu g and fmi : yi = 0g:
1. Sample
from (6.2);
2. if si = 1 sample Ii from (5.3); if si = 0 sample Ii from (5.4) and mi from (5.5); all draws
here are conditional on
3. sample
i
i
= ( i;
i );
from (6.4) for i = 1; : : : ; n;
4. sample cu from (5.10);
5. return to step 1 and repeat.
Algorithm 5 can be extended in several ways. First of all, it is possible to place a prior distribution on
. Recall that
represents the prior belief that G0 is the distribution of
i.
If
is large, then we will see many unique values in , which yields a model with a large number
of mixture components. Alternatively, if
is small, then
will likely see few unique values. In
fact, Antoniak (1974) shows that kn , the number of unique
values in a sample of size n, satis…es
E(kn j )
log(( + n)= ). The limit case
= 0 results in
parametric model in section 2. By placing a prior on
i
=
for all i, which leads to the
it is possible to learn about the number
of mixture components, after seeing the data. Escobar and West (1995) use a convenient gamma
prior which yields a posterior mixture of two gamma distributions. See appendix B for details.
Alternatively, it is possible to specify a prior for kn explicitly (rather than implicitly through ) as
in Escobar (1994).
The Markov chain constructed in algorithm 5 may converge slowly if the Gibbs sampler ’gets
stuck’ at a few …xed values of
c
1 p(w jX ;
i
i
;
j ; cu )
i.
From (6.4) this could happen when the sum (over j 6= i) of
gets large relative to c
1
p(wi jXi ; ; cu ). It is possible to slightly reparame-
terize the model and associated Gibbs sampler, such that at each iteration the distinct values in
are ’remixed’; see West, Müller, and Escobar (1994), MacEachern (1998) and the appendix.
Simulation Example
Consider the same simulation set-up as in section 5, where we now apply algorithm 5. In principle
a density estimate of (ui1 ; ui2 ), computed over a grid of values, can be constructed by adding an
27
n
i gi=1
extra step in algorithm 5: after updating f i ;
sample a new value (
n+1 ;
n+1 )
from (6.4)
and evaluate the density at this new value. Taking an average over all MCMC iterations then yields
a density estimate, from which we may learn about the dependence between ui1 and ui2 .25 Here
we only present the marginal posterior density estimates for the various coe¢ cients.
β01
0.4
β11
0.8
0.3
0.6
0.2
0.4
0.1
0.2
0
-6
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0
-4
-4
-2
0
00
2
1
2
β02
-3
-2
3
4
5
1.1
1.2
1.3
β12
-1
0
1
6
5
4
3
2
1
00.8
2
0.9
1
Figure 9: algorithm 5; n = 500; p0 = 0:50; mixture of 2 normals
#mixturecomponents
7000
6000
5000
4000
3000
2000
1000
00
2
4
6
8
10
12
14
16
18
20
Figure 10: histogram of number of mixture components (mixture of 2 normals)
As before the posterior of
12
in …gure 9 is closely centered around its true value. The likelihood
is not very informative about the remaining coe¢ cients. The posterior of the number of mixture
components puts the largest probability on a single component. In this case the mixture components
are apparently not ’separated’enough.
As a second example we generate n = 500 observations from (4.1) where
ui1
ui2
25
!
=
"
1
0:5
0:5
1
#
See Conley, Hansen, McCulloch, and Rossi (2007) for details.
28
vi1
vi2
!
;
where vi1 and vi2 have independent log-chi2 distributions with 1 degree of freedom. The coe¢ cients
are
01
=
02
=
0:50 and
11
=
12
= 1, which again leads to about 50% zero outcomes.
β01
0.5
β11
2
0.4
1.5
0.3
1
0.2
0.5
0.1
0
-4
-2
0
00
2
0.5
1
β02
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0
1.5
2
2.5
β12
4
3
2
1
-3
-2
-1
0
1
00.4
2
0.6
0.8
1
1.2
Figure 11: algorithm 5; n = 500; p0 = 0:50; log-chi2 errors
#mixturecomponents
3500
3000
2500
2000
1500
1000
500
00
2
4
6
8
10
12
14
16
18
20
Figure 12: histogram of number of mixture components (log-chi2 errors)
Interestingly, the posteriors for the coe¢ cients of the selection equation are now more accurate:
compare …gures 9 and 11. The posterior of the marginal e¤ect
12
of xi2 on the log-outcome is
underestimated. On the other hand the sampler does use a larger number of mixture components
to approximate the bivariate log-chi2 distribution, with the largest posterior probability on 2 and
3 components and zero probability on a single component. In this example there appears to be
a trade-o¤ between modeling the error distribution and obtaining an accurate posterior of the
coe¢ cients of interest.
7
Conclusion
In this paper we have considered models that can be used to describe nonnegative outcome variables.
The sample selection model essentially treats the outcome as a censored quantity, and speci…es a
29
structure for the latent process. The two-part model focuses on the observed outcomes directly.
Given the strong parametric assumption of normal errors, Bayesian inference in both models can
proceed straightforwardly, using a combination of data augmentation and Gibbs sampling. The
MCMC schemes can be formulated with or without an identi…cation restriction. We have found
there to be hardly any di¤erence in the posterior approximation.26
A possible Bayesian semiparametric treatment introduces more ‡exibility in the (joint) error
distribution. When mixtures of normal distributions are used, together with natural conjugate
priors, we can construct a Gibbs sampler by augmenting the parameter space with a set of latent
state variables. Within this MCMC algorithm the number of mixture components is …xed. In
principle di¤erent speci…cations can be compared on the basis of a Bayes factor.
An attractive alternative to comparing many di¤erent models is the use of Dirichlet process
mixtures. We have modeled the errors as having a bivariate normal distribution whose parameters
may di¤er across observations. In this case the Dirichlet process essentially amounts to using a
mixture of normals with an unknown number of mixture components. The data may then be used
to make inference about this number. Our paper has constructed an MCMC algorithm for use in
the sample selection model. The only requirement for tractability is then the choice of conjugate
priors.
Many questions remain that we shall address in future work. First of all, we aim to provide a
thorough comparison between the mixture of normals and Dirichlet process models. A potential
problem with the …nite mixture model is over…tting: when the chosen number of components is
too large, problems arise due to a lack of identi…cation. For a number of mixture components
the parameters are then sampled from the prior and no Bayesian updating can take place. The
Dirichlet mixture approach on the other hand appears to avoid over…tting. Finding and comparing
analytical bounds on the convergence rates of each MCMC algorithm should be useful in this
context. Second, imposing any type of identi…cation restriction in a general mixture model will
typically destroy the natural conjugacy between the prior and likelihood. Reparameterizing the
model may then be helpful and we expect that the more general Metropolis-Hastings algorithm
is needed to approximate the posteriors. Finally, Richardson and Green (1997) have proposed an
alternative method for using mixtures with an unknown number of components. It remains to
contrast their approach with the Dirichlet process model.
26
This is not always the case: McCulloch, Polson, and Rossi (2000) …nd large di¤erences in the autocorrelation
and convergence behavior of the chains, in case of the multinomial probit model.
30
References
Albert, J. H., and S. Chib (1993): “Bayesian Analysis of Binary and Polychotomous Response
Data,” Journal of the American Statistical Association, 88(422), 669–679.
Amemiya, T. (1985): Advanced Econometrics. Harvard University Press.
Antoniak, C. E. (1974): “Mixtures of Dirichlet Processes with Applications to Bayesian Nonparametric Problems,” The Annals of Statistics, 2(6), 1152–1174.
Blackwell, D., and J. B. MacQueen (1973): “Ferguson Distributions via Pólya Urn Schemes,”
The Annals of Statistics, 1(2), 353–355.
Cameron, A. C., and P. K. Trivedi (2005): Microeconometrics: Methods and Applications.
Cambridge.
Casella, G., and E. I. George (1992): “Explaining the Gibbs Sampler,”The American Statistician, 46(3), 167–174.
Chib, S. (1995): “Marginal Likelihood from the Gibbs Output,”Journal of the American Statistical
Association, 90(432), 1313–1321.
Conley, T., C. Hansen, R. McCulloch, and P. E. Rossi (2007): “A Semiparametric Bayesian
Approach to the Instrumental Variable Problem,” Graduate School of Business, University of
Chicago Working Paper.
Cragg, J. G. (1971): “Some Statistical Models for Limited Dependent Variables with Application
to the Demand for Durable Goods,” Econometrica, 39(5), 829–844.
Dow, W. H., and E. C. Norton (2003): “Choosing Between and Interpreting the Heckit and
Two-Part Models for Corner Solutions,” Health Services and Outcomes Research Methodology,
4, 5–18.
Duan, N., W. G. Manning, C. N. Morris, and J. P. Newhouse (1983): “A Comparison
of Alternative Models for the Demand for Medical Care,” Journal of Business and Economic
Statistics, 1(2), 115–126.
(1984): “Choosing Between the Sample-Selection Model and the Multi-Part Model,”
Journal of Business and Economic Statistics, 2(3), 283–289.
Escobar, M. D. (1994): “Estimating Normal Means With a Dirichlet Process Prior,” Journal of
the American Statistical Association, 89(425), 268–277.
Escobar, M. D., and M. West (1995): “Bayesian Density Estimation and Inference Using
Mixtures,” Journal of the American Statistical Association, 90(430), 577–588.
31
(1998): “Computing Nonparametric Hierarchical Models,”in Practical Nonparametric and
Semiparametric Bayesian Statistics, ed. by D. Dey, P. Müller, and D. Sinha, pp. 1–22. Springer.
Ferguson, T. S. (1973): “A Bayesian Analysis of Some Nonparametric Problems,” The Annals
of Statistics, 1(2), 209–230.
(1974): “Prior Distributions on Spaces of Probability Measures,”The Annals of Statistics,
2(4), 615–629.
Frühwirth-Schnatter, S. (2006): Finite Mixture and Markov Switching Models. Springer.
Gelman, A., J. B. Carlin, H. S. Stern, and D. B. Rubin (1995): Bayesian Data Analysis.
Chapman & Hall.
Geweke, J. (2005): Contemporary Bayesian Econometrics and Statistics. Wiley.
Gilks, W., S. Richardson, and D. Spiegelhalter (1996): Markov Chain Monte Carlo in
Practice. Chapman & Hall.
Gronau, R. (1974): “Wage Comparisons –A Selectivity Bias,”The Journal of Political Economy,
82(6), 1119–1143.
Hay, J. W., and R. J. Olsen (1984): “Let Them Eat Cake: A Note on Comparing Alternative
Models of the Demand for Medical Care,” Journal of Business and Economic Statistics, 2(3),
279–289.
Heckman, J. J. (1979): “Sample Selection as a Speci…cation Error,” Econometrica, 47(1), 153–
162.
Huang, H.-C. (2001): “Bayesian Analysis of the SUR Tobit Model,” Applied Economics Letters,
8, 617–622.
Kass, R. E., and A. E. Raftery (1995): “Bayes Factors,” Journal of the American Statistical
Association, 90(430), 773–795.
Koop, G., and D. J. Poirier (1997): “Learning About the Across-Regime Correlation in Switching Regression Models,” Journal of Econometrics, 78, 217–227.
Lancaster, T. (2004): An Introduction to Modern Bayesian Econometrics. Blackwell.
Lee, L. F. (2003): “Self-Selection,” in A Companion to Theoretical Econometrics, ed. by B. H.
Baltagi, chap. 18. Blackwell Publishing.
Leung, S. F., and S. Yu (1996): “On the Choice Between Sample Selection and Two-Part
Models,” Journal of Econometrics, 72, 197–229.
32
Li, K. (1998): “Bayesian Inference in a Simultaneous Equation Model with Limited Dependent
Variables,” Journal of Econometrics, 85, 387–400.
MacEachern, S. N. (1998): “Computational Methods for Mixture of Dirichlet Process Models,”
in Practical Nonparametric and Semiparametric Bayesian Statistics, ed. by D. Dey, P. Müller,
and D. Sinha, pp. 23–44. Springer.
Manning, W., N. Duan, and W. Rogers (1987): “Monte Carlo Evidence on the Choice Between
Sample Selection and Two-Part Models,” Journal of Econometrics, 35, 59–82.
Manning, W. G. (1998): “The Logged Dependent Variable, Heteroscedasticity, and the Retransformation Problem,” Journal of Health Economics, 17, 283–295.
McCulloch, R. E., N. G. Polson, and P. E. Rossi (2000): “A Bayesian Analysis of the
Multinomial Probit Model with Fully Identi…ed Parameters,”Journal of Econometrics, 99, 173–
193.
McCulloch, R. E., and P. E. Rossi (1994): “An Exact Likelihood Analysis of the Multinomial
Probit Model,” Journal of Econometrics, 64, 207–240.
Muirhead, R. J. (1982): Aspects of Multivariate Statistical Theory. Wiley.
Müller, P., and F. A. Quintana (2004): “Nonparametric Bayesian Data Analysis,” Statistical
Science, 19(1), 95–110.
Munkin, M. K., and P. K. Trivedi (2003): “Bayesian Analysis of a Self-Selection Model with
Multiple Outcomes Using Simulation-Based Estimation: An Application to the Demand for
Healthcare,” Journal of Econometrics, 114, 197–220.
Nobile, A. (2000): “Comment: Bayesian Multinomial Probit Models with a Normalization Constraint,” Journal of Econometrics, 99, 335–345.
Richardson, S., and P. J. Green (1997): “On Bayesian Analysis of Mixtures with an Unknown
Number of Components (with discussion),” Journal of the Royal Statistical Society B, 59(4),
731–792.
Roeder, K., and L. Wasserman (1997): “Practical Bayesian Density Estimation Using Mixtures
of Normals,” Journal of the American Statistical Association, 92(439), 894–902.
Tanner, M. A., and W. H. Wong (1987): “The Calculation of Posterior Distributions by Data
Augmentation,” Journal of the American Statistical Association, 82, 528–550.
Vella, F. (1998): “Estimating Models with Sample Selection Bias: a Survey,”Journal of Human
Resources, 33, 127–169.
33
Verdinelli, I., and L. Wasserman (1995): “Computing Bayes Factors Using a Generalization
of the Savage-Dickey Density Ratio,” Journal of the American Statistical Association, 90(430),
614–618.
West, M., P. Müller, and M. D. Escobar (1994): “Hierarchical Priors and Mixture Models,
With Applications in Regression and Density Estimation,” in Aspects of Uncertainty: a Tribute
to D.V. Lindley, ed. by P. Freedman et al., pp. 363–386.
Wooldridge, J. M. (2002): Econometric Analysis of Cross Section and Panel Data. MIT Press.
34
A
Mixtures of Normals
Sampling
i
To sample
i
from its posterior distribution (5.6) in the mixture model, we can use the following
steps:
1. Calculate
Prf
i
= jjwi ; ; ; ; ; cu g = P
k
j jcu
l=1
2. Calculate the CDF Prf
i
l
jj
jcu
1=2
exp
lj
1=2
n
1
2cu (wi
exp
n
1
2cu (wi
jjwi ; ; ; ; g for j = 1; : : : ; k
1
0
j)
Xi
Xi
j
0
l)
(wi
1
l
Xi
(wi
Xi
o
)
j
1;
3. Generate a random number u from the U (0; 1) distribution;
4. Find j , such that
Prf
and set
Sampling
If
i
i
j
1jwi ; ; ; ; g < u
Prf
=j .
j
has a Dirichlet prior distribution with parameters (
( )=
(k
[ (
0)
k
0 )]
1
0
1
0
1
k
0; : : : ;
I
8
k
<X
:
j=1
The distribution of the component selectors is multinomial:
p( 1 ; : : : ;
where nj =
posterior of
j jwi ; ; ; ; g;
i
Pn
i=1 If i
= jg and
Pk
j=1 nj
nj
)=
n1
1
0 ),
j
its density is given by
9
=
=1 :
;
nk
k ;
= n. If ( ; ; ; cu ) is a priori independent of ( ; ), the
conditional on the completed data W and the remaining parameters satis…es
p( jW; ; ; ; ; cu ) / p(W j ; ; ; ; ; cu )
Conditional on
p( j )
( ):
the likelihood p(W j ; ; ; ; ; cu ) does not depend on , so that
p( jW; ; ; ; ; cu ) = p( j )
/ p( j ) ( );
35
o:
l)
from which (5.7) follows. To generate draws from this distribution, perform the following two steps
P
(e.g. Ferguson 1973): (1) for j = 1; : : : ; k, generate Zj G( 0 +nj ; 1) and (2) set j = Zj = kl=1 Zl .
Sampling (
j;
j)
Multiplying the completed data mixture likelihood in (5.1) with the normal-inverse Wishart prior,
we have
p(
j;
j jW;
; ; cu ) / j
1
tr( j 1 H)
2
1 0
1
j cu j j 1=2 exp
2 cu j j
8
< 1 X
nj =2
(ui
jcu j j
exp
: 2cu
jj
(v+3)=2
exp
j
1
0
j)
j
(ui
i: i =j
The exponent involving
exp
8
<
j
can be rewritten as
2
1 4
(
: 2cu
bj )0 Mj 1 (
j
j;
j jW;
; ; cu ) / p(
j j j ; W;
j
jj
exp
/ p(
u0i
given
j
j
then follows. As for
j,
we can write
; ; cu )
(v+nj +3)=2
8
<
2
1 4
: 2cu
1
tr(
2
exp
X
u0i
1
j
ui
i: i =j
1
j
H)
39
=
b0j Mj 1 bj 5
;
; ; cu ) j j j (v+nj +3)=2
8
0
< 1
X
tr j 1 @H + cu 1
ui u0i
exp
: 2
j j j ; W;
19
2
=
n
j
1
0A
cu
uj uj
;
;
1 + nj
i: i =j
from which (5.8) follows.
39
=
1 5
j ui ; ;
i: i =j
with bj and Mj de…ned in (5.9). The posterior of
p(
b0j Mj 1 bj +
bj )
j
X
9
=
j) :
;
Sampling cu
Given that (cu ) = IG(v=2; d) and cu enters the prior of
j
and the completed data likelihood, we
have
p(cu jW; ; ; ; ; ) / cu (v=2+1) e
d=cu
k
Y
j=1
j cu
36
jj
1=2
exp
1
2 cu
0
j
1
j
j
n
Y
1
(wi Xi
2cu
i=1
8
2
k
< 1
X
0
4d + 1
/ cu (v=2+k+n+1) exp
j
: cu
2
j=1
" n
(
1 1X
1
0
(ui
exp
j(i) ) j(i) (ui
cu 2
jcu
j(i) j
1=2
0
j(i) )
exp
1
j
39
=
5
j
;
#)
j(i) )
i=1
1
j(i) (wi
Xi
j(i) )
;
from which (5.10) follows.
B
Dirichlet Mixtures of Normals
Posterior of
We will use the following shorthand notation for the Polya urn prior in (6.3):
( ij
where
j
i)
=
+n
1
is the measure with unit mass at
p( i j
n
Y
i ; W; ; cu ) /
i=1
/
1
X
1
+n
j6=i
j.
j
1
( i );
Then
p(wi jXi ; ; i ; cu ) ( i j
+n
/
dG0 ( i ) +
i)
dG0 ( i )p(wi jXi ; ; i ; cu ) +
p(wi jXi ; ; i ; cu )dG0 ( i ) +
X
j6=i
X p(wi jXi ; ; i ; cu )
+n 1
j
( i)
j6=i
p(wi jXi ; ;
j ; cu )
j
( i ):
(B.1)
The normalizing constant c satis…es
c =
Z
=
2
4 p(wi jXi ; ; i ; cu )dG0 ( i ) +
p(wi jXi ; ; cu ) +
X
j6=i
p(wi jXi ; ;
X
j6=i
p(wi jXi ; ;
j ; cu )
j ; cu ):
j
3
( i )5 d
i
The posterior (B.1) then becomes
p( i j
i ; W;
; cu ) =
8
1<
c:
= (c
1
X
9
=
p(wi jXi ; ; j ; cu ) j ( i )
;
p(wi jXi ; ; i ; cu )dG0 ( i )
+
p(wi jXi ; ; cu )
j6=i
X
p(wi jXi ; ; cu ))p( i jwi ; Xi ; ) +
(c 1 p(wi jXi ; ;
p(wi jXi ; ; cu )
j6=i
37
j ; cu ))
j
( i );
which yields (6.4).
In order to sample from this distribution three elements are needed: p(wi jXi ; ;
j ; cu ),
p(wi jXi ; ; cu )
and p( i jwi ; Xi ; ; cu ). The …rst is simply the completed data likelihood contribution of observation
i:
p(wi jXi ; ;
j ; cu )
= (2 )
1
jcu
jj
1=2
1
(wi
2cu
exp
For the second component, integrating out
1
0
j)
Xi
j
(wi
Xi
j)
:
(B.2)
from the equation above yields p(wi jXi ; ; cu ). In
j
order to do so analytically, we take
dG0 ( i ) = N ( i ; 0; cu
i)
1
W
(
i ; H; v):
(B.3)
Thus, the base distribution is the product of an inverse Wishart distribution for
H and v) and a conditional normal distribution for
(with parameters
i
(with mean zero and variance cu
i
i ).
This
choice of G0 allows us to center prior beliefs around the parametric SSM in section ?? by taking
= 0 and
= 0.
The conditional joint distribution of wi and
p(wi jXi ; ; i ; cu )dG0 ( i ) = kv jHjv=2 j
ij
cu 1 (2 )
1
cu 1 (2 )
1
is now
(v+3)=2
1
j
i
ij
1=2
1=2
exp
j
ij
1
tr
2
exp
1
i
1
2 cu
exp
1
(wi
2cu
H
i
1
0
i
Xi
i
0
i)
1
i
(wi
Xi
i)
where kv is a normalizing constant:
kv =
2
v
1=2
v 1
2
v
2
:
(B.4)
Collecting all constants and rewriting the exponent as a quadratic function in
p(wi jXi ; ; i ; cu )dG0 ( i ) = cu 2 k~v j
(v+5)=2
ij
exp
1
tr(
2
1
i
i,
it follows that
H)
1 0 1
[u
ui b0i Mi 1 bi g
2cu i i
1
exp
(
bi )0 Mi 1 ( i bi ) ;
2cu i
= wi Xi ;
expf
ui
bi =
1+
ui ;
Mi =
k~v
1+
= (2 )
38
(B.5)
i;
2
1
kv jHjv=2 :
;
Integrating out
p(wi ;
i
we …nd
i jXi ; ; cu ) =
Z
p(wi jXi ; ; i ; cu )dG0 ( i )d
= cu 2 k~v j
ij
(v+5)=2
1 0
[u
2cu i
expf
1
tr(
2
exp
1
i
i
1
i
b0i Mi 1 bi g
ui
kv jHjv=2
j i j (v+4)=2
cu (1 + )
1
exp
tr( i 1 H) + cu 1 u0i
2
= (2 )
= (2 )
where
~i = H +
H
H)
2 cu
j
(1 + )
1=2
ij
1
1
kv jHjv=2
j
cu (1 + )
ij
1
(wi
cu (1 + )
1
p(wi jXi ; ; cu ) =
i
p(wi ;
cu 1 b0i Mi 1 bi
ui
exp
1
tr(
2
1
i
~ i) ;
H
Xi )0 :
Xi )(wi
Using the density of the inverse Wishart distribution
Z
(v+4)=2
1
i
can be integrated out, so that
i jXi ;
; cu )d
=
(2 ) 1 kv jHjv=2
cu (1 + ) kv~ ~ v~=2
Hi
=
(v 1) jHjv=2
;
2 cu (1 + ) ~ v~=2
Hi
i
where v~ = v + 1 and kv~ is de…ned in (B.4). A typical choice of vague prior uses H = "0 I2 for some
large "0 > 0. In that case the determinants can be explicitly calculated. Using the result that for
any jIp + aa0 j = 1 + a0 a for any a 2 Rp , it follows that
p(wi jXi ; ; cu ) =
=
Finally, from (6.4), if the value of
(v 1)
"v0
v+1
2 cu (1 + ) "0 (1 + ["0 cu (1 + )]
(v 1)
1
1+
u0 ui
2 cu "0 (1 + )
cu "0 (1 + ) i
i
is not equal to any other
j
1 u 0 u )(v+1)=2
i i
(v+1)=2
:
it is distributed according to the
posterior arising from G0 ( i ). The chosen form of G0 now allows us to generate values of
i
in two simple steps: …rst generate
generate
i
i
i
and
from its posterior (conditional on (wi ; ; cu )) and then
from its posterior conditional on (
i ; wi ;
39
; cu ).
Multiplying the likelihood (B.2) and prior (B.3) and rearranging the exponent, it follows that
p( i ;
i jwi ; Xi ;
; cu ) / j
(v+4)=2
ij
j
1=2
ij
1
tr
2
1
(
2cu
exp
exp
1
i
~i
H
bi )0 Mi 1 (
i
i
bi ) ;
where bi and Mi were de…ned in (B.5). Thus:
B.1
i jwi ; Xi ;
; cu
W
i j i ; wi ; X i ;
; cu
N
~ i ; v + 1);
(H
cu
ui ;
1+
1+
1
i
):
Remixing Unique Values of
To describe the remixing step for , let k be the number of unique values in , denoted by
f
k
j gj=1 .
De…ne the component selectors
i
Let k
i
be the number of distinct
The posterior of
ij
=j
i ; W; ; cu
ij
(
i
=f
,
i
n
i gi=1
=
values in
as before:
j;
i
j = 1; : : : ; k:
and nj;
i
=
in (6.4) then becomes
=
w. prob. c
j
p( i jwi ; Xi ; ; cu ) w. prob. c
Note that knowledge of
=
1n
1
P
l:l6=i If l
j; i p(wi jXi ;
;
= jg for j = 1; : : : ; k i .
j ; cu );
j = 1; : : : ; k
p(wi jXi ; ; cu )
i
:
(B.6)
is equivalent to knowing ( ; ; k). The remixing algorithm is based on
sampling ( ; k) from its conditional distribution given
, and
from its conditional distribution
given ( ; k). From (B.6) it follows immediately that
Prf
i
= jj
i ; W;
;
; cu g = c
1
nj; i p(wi jXi ; ;
j ; cu );
j = 1; : : : ; k i :
(B.7)
Also, with probability
Prf
set
i
i
= 0j
i ; W;
equal to zero and generate
i
;
; cu g = 1
c
1
k i
X
j=1
nj; i p(wi jXi ; ;
j ; cu );
from p( i jwi ; Xi ; ; cu ). After cycling through for i = 1; : : : ; n,
and potentially relabeling , we obtain a new value of ( ; k). In the prior
represents k i.i.d.
draws from G0 (Antoniak 1974). Then:
p( 1 ; : : : ;
k jW;
(B.8)
; ; k; cu ) /
8
k < Y
Y
j=1
:
i: i =j
40
9
=
p(wi jXi ; ; j ; cu )dG0 ( j ) ;
;
so that the
j ’s
are conditionally independent and
p( j jW; ; ; k; cu ) /
Applying this to (
(
j;
j)
j;
j)
Y
i: i =j
p(wi jXi ; ;
j ; cu )dG0 ( j );
j = 1; : : : ; k:
(B.9)
yields exactly the posterior given by (5.8) and (5.9). Thus, the posterior for
in the …nite mixture model can be used in the Dirichlet model as a remixing distribution:
n
i )gi=1
…rst update f( i ;
and determine the number of unique components, then regenerate the
unique components alone and assign the k values across the sample, according to the component
selectors
i.
Step 3 in algorithm 5 can now be replaced by
3a. Sample
i
from the distribution in (B.7) and (B.8) for i = 1; : : : ; n, and determine the number
of unique components k;
3b. sample (
j;
j)
from (5.8) and (5.9) for j = 1; : : : ; k.
Uncertainty about
It is possible to incorporate uncertainty about
into a Gibbs sampling algorithm, thereby allowing
the data to determine how many mixture components are appropriate. A convenient approach,
due to Escobar and West (1995), is described here. As before let
of unique values of
likelihood p(wi jXi ; ;
p( jW; ;
i.
Let
= f j gkj=1 be the collection
be a priori independent of ( ; cu ) with prior
; cu ; ; k; ) does not depend on . The posterior of
; cu ; ; k) / p(W j ;
( ). Note that the
then satis…es
; cu ; ; k; )p( ; ; kj ; ; cu ) ( )
/ p( j ; k; ; ; cu )p( jk; ; ; cu )p(kj ; ; cu ) ( ):
Given (6.3), the prior distribution of
given (k; ; ; cu ) only depends on the sample size and k.
Also, the random number of mixture components only depends on , so that p(kj ; ; cu ) = p(kj ).
Finally, conditional on k,
not depend on
and cu the unique values
follow the prior distribution G0 , which does
or . With these conditional independence relations, the posterior simpli…es to
p( jW; ;
; ; k) = p( jk)
/ p(kj ) ( )
k ( )
/
( ):
( + n)
Using a property of the beta function B(b1 ; b2 ):
B(b1 ; b2 ) =
Z
1
xb1
1
(1
0
=
(b1 ) (b2 )
;
(b1 + b2 )
41
x)b2
1
dx
the posterior of
can be written as
( )
p( jk) /
k 1(
+ n)
(n)
( )
1
)n
(1
1
d :
0
The posterior corresponds to a joint posterior of
p( ; jk) /
Z
and a latent variable
k 1(
+ n)
)n
(1
(n)
1
2 (0; 1), given by
;
from which it is clear that p( j ; k) is the beta B( +1; n) distribution. The joint posterior suggests
using a G(c1 ; c2 ) distribution as prior for . Then:
c1 +k 1
p( j ; k) /
[c2
+n
(c2 log )
+n
c1 +k 2
e
(c2 log )
1
log ]c1 +k
(c1 + k)
[c2
/
e
G (c1 + k; c2
log ]c1 +k
(c1 + k 1)
log )
1
1
G(c1 + k
1; c2
log );
which is a mixture of two gamma distributions. The mixing probabilities satisfy
p
1
p
=
=
log ]c1 +k
(c1 + k)
c1 + k 1
:
n(c2 log )
[c2
1
n
1
[c2
log ]c1 +k
(c1 + k 1)
1
(B.10)
We can now augment algorithm 5 with the following steps:
4a. sample
from B( + 1; n);
4b. calculate p according to (B.10);
4c. with probability p , sample
from G(c1 + k
1; c2
from G(c1 + k; c2
log );
5. return to step 1 in algorithm 5 and repeat.
42
log ); with probability 1
p , sample