Censored and Sample Selection Models

Censored and Sample Selection Models
Li Gan
October 2010
Examples
Censored data examples:
(1) stadium attendance= min(attendance*, capacity)
Here grouping (those observations with attendance rates less than one vs those
observations with attendance rates at one) are clearly based on observed factors.
(2) Top-coding: wealth = min (wealth*, 1 million)
We are interested in understanding how wealth is determined:
wealth* = xβ + u, where u|x ~ log-normal
Corner solution / censored regression (no data observables)
(3) Testing scores: The Texas Assessment of Knowledge and Skills (TAKS)
typically has 42 or 43 multiple choice questions. If a student gets all correct,
then he/she gets maximum scores. We are interested in how testing score for
student i at school j is affected by, among others, per student spending at
school j:
scoreij* = xij β + γ PerStudent Spending j + uij
Observed scores are: scoreij = min{scoreij* , maximum possible score}
It is also the case that a portion of students may get the minimum scores too.
(4) Testing scores again:
scoreijt* = ρ scoreijt* −1 + xij β + γ PerStudent Spending jt + uij
In this case, both dependent and independent variables could be censored.
Examples of Sample Selection Models
(5) The classical example of sample selection:
Consider the model. We are interested in how x affect wages:
wage* = xβ + u,
wage = wage* if work = 1
However, we only observe wages from those who are working.
work = 1(zγ + v > 0)
1
It is likely that some unobserved heterogeneity (such as earning ability, personal
ambition, etc) would affect both the wage equation and the work equation. So the errors
between u and v are correlated.
Intuitively, conditional on observables, the groups who work are different the
group who do not work because they may have different unobserved heterogeneities.
This is the so-called sample selection.
(6) In surveys, it is typical to see that a large portion of people would not give
continuous responses on their wealth values. For example, when asked about the value of
their stock holdings, 45.2% of people gave no amount.
Can we ignore these people (treat them as missing values)? This is the same
questions as if those non-responses are random or not. In fact, there are two responses:
DK (Don’t Know) and RF(refused to answer this question)
Pr(DK = 1) = professional (-) widow(+)
Pr(RF = 1) = HS Grad(-1) College (-) professional (-1)
If we allow people to give bracketed responses, a much larger percentage of people
are willing to give bracketed responses.
Differences between censored (or truncated) data and the sample selection data
In general, suppose we have a set of information (x, y, z). The population model is:
y = xβ + u, and z are instruments.
(2)
Let s be: s = 1 if belong to one sample; and s = 0 if belong to another sample. Note if
s = 0, y is not observed, this is called the truncated sample.
The key difference between Heckman’s sample selection and censored regression is
that if the selection is random.
In the classical example of the wage regression for women:
log(wagei) = xiβ+ ui
(3)
Case 1: s = 1 observe those women who work.
Case 2: s = 1 observe wages, including some would have the minimum wage, wage0.
Case 3: s = 1 observe those born in odd months of the year.
Case 4: s = 1 observe those schooling years > 13. Consider the case that there is no
unobserved ability (this may be true if we observed AFQT scores). In another word, the
error in the wage equation is uncorrelated with all xi.
2
In each of four cases, the starting point is similar:
E(log(wagei)|si =1) = E(xiβ+ ui | si =1)
= xiβ + E(ui| si =1)
Therefore, the critical issue is if E(ui| si =1) = 0. This largely depends if ui and si
are correlated.
Case 1: Sample selection problem:
E(log(wagei)|si =1) = E(xiβ+ ui | ziγ + vi > 0)
= xiβ + E(ui| vi > -ziγ )
≠ xiβ, if Cov(ui, vi) ≠ 0.
Therefore, ignoring sample selection would create biased estimates because E(ui|
vi > -ziγ ) would enter into the error term – and it is obviously correlated with the
regressor xi.
Case 2: Censored or truncated data:
E(log(wagei)|s =1) = E(xiβ+ ui |log(wage) >log(wage0))
= xiβ + E(ui| ui > log(wage0) -xiγ )
≠ xiβ, this is obviously true.
Similarly, ignoring censoring would create biased estimates because E(ui| ui >
log(wage0) -xiγ ) would enter into the error term – and it is obviously correlated with the
regressor xi.
Case 3: Random sampling:
E(log(wagei)|si =1) = E(xiβ+ ui | dummy-born-in-odd-monthi = 1)
= xiβ + E(ui| dummy-born-in-odd-monthi = 1)
= xiβ, since ui and dummy-born-in-odd-monthi are independent.
In this case, ignoring the unobserved sample does NOT create any problems in
estimation. This is called random sample.
Case 4: Sampling based on some observables xi.
E(log(wagei)|si =1) = xiβ+E(ui | schoolingi > 13)
If the original model is correctly specified, then cov(xi, ui) = 0.
Therefore, E(ui | schoolingi > 13) = 0.
Sampling based on observables typically behaves like a random sampling.
3
However, suppose we only observe a person has college degree or not (a dummy
variable).
If the Si = 1 if College Degreei = 1. In this case, the selection creates problem.
Because the selection process has essentially creates a missing variable problem. If
College Degreei is correlated with the rest of regressors xi, then this creates a missing
variable problem.
For comparison:
(1) Censored (or truncated) sample can be considered as a special case for sample
selection in which vi = ui, and ziγ= xiβ.
(2) Between the sample selection model and random selection model, the key is
that Si and ui are uncorrelated.
Estimation Methods:
A general censored regression model:
yi* = xiβ+ ui
yi = max(0, yi*), or: si = 1(yi* > 0).
There are two methods to estimate such models.
1. The regression method:
To construct regression models, it is necessary to find out conditional expectation:
E(yi | si = 1). It is useful to work out how the conditional expectation for the standard
normal. Suppose ε ~ N(0,1).
E (ε | ε > c ) = ∫ ε f (ε | ε > c )dε
= ∫ε
∞
f (ε , ε > c )
1
ε f (ε )dε
dε =
∫
Pr (ε > c )
1 − Φ (c ) c
=
∞
⎛ ε2 ⎞
ε
1
⎜⎜ −
⎟⎟dε
exp
1 − Φ (c ) ∫c 2π
⎝ 2 ⎠
=
∞
⎛ ε2 ⎞ ⎛ε2 ⎞
1
1
⎜⎜ −
⎟⎟d ⎜⎜ ⎟⎟
exp
1 − Φ (c ) ∫c 2π
⎝ 2 ⎠ ⎝ 2 ⎠
∞
⎛ ε2 ⎞
1
1
⎟⎟
=−
exp⎜⎜ −
1 − Φ (c ) 2π
⎝ 2 ⎠c
φ (c )
=
1 − Φ (c )
4
Similarly, we can work out the case: E(ε| ε < c):
E (ε | ε < c ) = ∫ ε f (ε | ε < c )dε
c
⎛ ε2 ⎞
1 c
1
1
⎜⎜ − ⎟⎟
(
)
=
=
−
exp
f
d
ε
ε
ε
Φ (c ) ∫−∞
Φ (c ) 2π
⎝ 2 ⎠ −∞
φ (c )
=−
Φ (c )
Given this, we have:
⎛u u c⎞
φ (c / σ )
E (u | u > c ) = σ E ⎜⎜
> ⎟⎟ = σ
1 − Φ (c / σ )
⎝σ σ σ ⎠
Similarly, one can obtain the expectation: E (u, u > c ) = σ φ (c / σ ) . Therefore:
E ( y | x, y > 0)
= xβ + E (u | u > − xβ )
φ ( xβ / σ )
= xβ + σ
Φ ( xβ / σ )
Previous equation suggests a nonlinear regression method. For those observations
that yi > 0:
φ ( xi β / σ )
(4)
y i = xi β + σ
+ ui .
Φ ( xi β / σ )
Discussions:
(1) A somewhat common approach to this problem is a two-step method. In the first
step, one estimates a binary probit model, and uses the coefficients estimates from
the first step to calculate the inverse mills ratio. At the second step, one estimate
the equation (4). However, this two-step method has several problems.
(a) Note in this nonlinear equation model, the parameter set β appears at both the
φ ( xi β / σ )
.
linear part of the model xiβ and the nonlinear part of the model:
Φ ( xi β / σ )
Therefore, it is necessary to estimate the linear part and the nonlinear part
simultaneously to ensure that these parameter estimates have the same values.
Estimating them separately would not guarantee the same parameter
estimates.
(b) Second, using the estimated parameters to generate inverse-mills ratio suffer
the usual “forbidden” regression problem.
(c) Even one can estimate the model consistently by working with a nonlinear
least square, only a subset of information is used. So it is less efficient.
Fortunately, the usual likelihood function is simple to estimate and it is also
efficient.
5
(2) If we are interested in applying all observations, and suppose that yi is censored at
yi = c and we observe xi for all yi:
Therefore,
yi* = xiβ+ ui
yi = max(c, yi*).
E ( yi | xi ) = E ( yi | xi , yi ≥ c ) Pr ( yi ≥ c | xi ) + E ( yi | xi , yi < c ) Pr ( yi < c | xi )
⎛
⎛ c − xi β ⎞
φ⎜
⎜
⎟
⎝ σ ⎠
= ⎜ xi β + σ
⎜
⎛ c − xi β
1 − Φ⎜
⎜
⎝ σ
⎝
⎛
⎛ c − xi β
= xi β ⎜⎜1 − Φ ⎜
⎝ σ
⎝
⎞
⎟
⎟⎛⎜1 − Φ ⎛⎜ c − xi β
⎞ ⎟⎜⎝
⎝ σ
⎟⎟
⎠⎠
⎛ c − xi β ⎞
⎞⎞
⎟ ⎟⎟ + cΦ ⎜
⎟
⎠⎠
⎝ σ ⎠
⎞⎞
⎛ c − xi β ⎞
⎛ c − xi β ⎞
⎟ ⎟⎟ + σφ ⎜
⎟ + cΦ ⎜
⎟
⎠⎠
⎝ σ ⎠
⎝ σ ⎠
So, a nonlinear regression method that applies all data is given by (assume c=0):
⎛xβ⎞
⎛xβ⎞
yi = xi β Φ ⎜ i ⎟ + σφ ⎜ i ⎟ + ui
⎝ σ ⎠
⎝ σ ⎠
(5)
According to equations (4) and (5), using OLS of yi on xi in either the sub-sample
or the whole sample would lead to the biased estimates.
(3) Alternatively, one may apply a Heckman-type two step least squares.
a. Estimate a Probit of y = 0 vs y > 0:
Let the coefficient be γˆ .
b. Estimate a linear regression
For the sub-sample y > 0: y = xβ + λ
φ ( xγˆ )
+v
Φ ( xγˆ )
According to (4), the coefficients β and γ should be the same. This procedure does
not guarantee that the two coefficients to be the same. A specification test H0: β =
γ can be performed here.
2. The Maximum Likelihood Estimation method:
Again, consider a censored regression model:
yi* = xiβ+ ui
yi = max(0, yi*).
The density is given by:
6
⎛
⎛xβ
f ( yi | xi ) = ⎜⎜1 − Φ ⎜ i
⎝ σ
⎝
`
1( yi =0 )
⎞⎞
⎟ ⎟⎟
⎠⎠
⎛ 1 ⎛ y i − xi β
⎜⎜ φ ⎜
⎝σ ⎝ σ
1( yi >0 )
⎞⎞
⎟ ⎟⎟
⎠⎠
This is the Tobit model.
In the charity example (equation (1)) , the likelihood function is given by:
qi is the amount of money given to charities.
1( qi =0 )
⎛
⎛ z γ − log( pi ) ⎞ ⎞
f (qi | zi , pi ) = ⎜⎜1 − Φ ⎜ i
⎟ ⎟⎟
σ
⎝
⎠⎠
⎝
1(qi >0 )
⎛ 1 ⎛ log(1 + qi ) − ziγ − log( pi ) ⎞ ⎞
⎜⎜ φ ⎜
⎟ ⎟⎟
σ
⎠⎠
⎝σ ⎝
Other Types of Censoring:
(1) Double censoring
⎧a
⎪
y = ⎨ y * = xβ + u
⎪b
⎩
y* ≤ a
b < y* < a
y* ≥ b
(7)
The density function for (7) is:
⎛a⎞
f ( y i | xi ) = Φ ⎜ ⎟
⎝σ ⎠
1( yi = a )
1( yi =b )
⎛
⎛ b ⎞⎞
⎜⎜1 − Φ ⎜ ⎟ ⎟⎟
⎝ σ ⎠⎠
⎝
⎛ 1 ⎛ y i − xi β
⎜⎜ φ ⎜
⎝σ ⎝ σ
1(b< yi < a )
⎞⎞
⎟ ⎟⎟
⎠⎠
One can apply MLE for this density function.
(2) Endogenous explanatory variable model with censoring:
y1 = max (0, z1δ1+α1y2+u1)
y2= zδ2+ v2
Note that u1 and v2 are correlated. Rewrite the u1 as:
u1 = θv2 +ε1
Plug it into the previous equation:
y1 = max (0, z1δ1+ α1y2+ θv2 +ε1)
This suggests a two-step procedure (similar to the discrete case) – Smith and
Blundell (1986).
Step 1: estimate the model y2=zδ2+ v2, and obtain the residual vˆ2 .
Step 2: estimate a standard Tobit model of
7
y1 = max (0, z1δ1+ α1y2+ θ vˆ2 +ε1)
This two-step procedure gives consistent estimators coefficients.
(6)
Alternatively, one can apply the full maximum likelihood:
f(y1,y2|z) = f(y1|y2,z) f(y2|z)
The densities are given by:
⎛ ⎛ z δ + a1 y 2 + θ ( y 2 − zδ 2 ) ⎞ ⎞
⎟⎟ ⎟⎟
f ( y1 | y 2 , z ) = ⎜⎜ Φ ⎜⎜ 1 1
σε
⎠⎠
⎝ ⎝
and,
f ( y2 | z ) =
1
σv
y1 =0
⎛ 1
⎜
⎜σ
⎝ ε
⎛ y − z δ − a y − θ ( y 2 − zδ 2 ) ⎞ ⎞
⎟⎟ ⎟⎟
φ ⎜⎜ 1 1 1 1 2
σε
⎝
⎠⎠
⎛ y 2 − zδ 2 ⎞
⎟⎟
⎝ σv ⎠
φ ⎜⎜
Discussions: Note here we substitute v2 by y2-zδ2. The key reason is that in (6) y1 is
continuous if y1>0. So the continuous part of y1 can be used to figure out the variance.
For a binary y1, this no longer holds.
Sample selection model:
Consider a classical sample selection:
y1* = x1 β1 + u1
y 2 = 1[x2δ 2 + v2 > 0]
y1 = y1* [ y 2 = 1]
We discuss estimation of the model if:
(a) (x2, y2) are always observed, y1 is observed only when y2 =1.
(b) (u1, v2) is independent of x.
(c) v2 ~ N(0,1)
(d) E(u1|v2) = γ1v2
Note that x2 has to be observed always while x1 only need to be observed when y2=1.
The classic example is the women’s labor force participation, in which y1 is the wages
that the women gets if she is working, and y2 is the labor force participation dummy. We
observe factors that affect women’s labor force participation, such as number of kids,
husband’s income, etc, regardless if the women is working or not. However, we only
observe if the women’s wage if she is working (y2=1).
Again, there are two methods to estimate this model:
1. Regression Method:
8
y1 >0
E(y1|x,v2)=x1β1+ E(u1|x, v2) = x1β1+ E(u1|v2) = x1β1+ γ1v2
If γ1 = 0 Î no endogeneity. OLS is fine. However, when γ1 ≠ 0, since v2 is
unobserved, we need take the condition expectation (conditioning y2 = 1):
E (v2 | x, y 2 = 1) = E (v 2 | x, x2δ 2 + v 2 > 0 )
= E (v 2 | x, v 2 > − x2δ 2 )
φ ( x 2δ 2 )
Φ ( x 2δ 2 )
The last equality applies the earlier result for the standard normal:
φ (c )
E (ε | ε > c ) =
.
1 − Φ (c )
=
Therefore,
E ( y1 | x, y 2 = 1) = x1 β1 + γ 1 E (v 2 | x, y 2 = 1)
= x1 β1 + γ 1
φ ( x 2δ 2 )
Φ ( x 2δ 2 )
(6)
As before, one can use the non-linear regression method to estimate this model.
However, it is computationally more difficult than MLE and has no advantage over MLE.
Heckman suggests a two-step estimator:
Step 1: Estimate a binary probit model:
Pr(y2=1) = Φ(x2δ2)
The estimated parameter is used to construct inverse mills ratio:
( )
( )
φ x2δˆ2
Φ x2δˆ2
Step 2: Estimate the following regression:
E ( y1 | x, y 2 = 1) = x1 β1 + γ 1 E (v 2 | x, y 2 = 1)
φ x 2δˆ2
= x1 β1 + γ 1
Φ x 2δˆ2
( )
( )
Discussions:
(1) Note given that E(u1|v2) = γ1v2, and var(v2) = 1, it is therefore necessary to
have: Cov(u1, v2) = γ1, and the correlation coefficient between u1 and v2 is
given by: r = γ1/σu.
(2) An important but less noticeable fact is: in the case that we observe y1 in
cases that y2 = 0 and y2 = 1, we can develop a specification test:
9
E ( y1 | x, y 2 = 0) = x1 β1 + γ 1 E (v2 | x, y 2 = 0)
= x1 β1 + γ 1 E (v 2 | x, v 2 < − x 2δ 2 )
(7)
⎛ − φ ( x 2δ 2 ) ⎞
⎟⎟
= x1 β1 + γ 1 ⎜⎜
⎝ 1 − Φ ( x 2δ 2 ) ⎠
The last inequality applies the earlier result we have: E (ε | ε < c ) = −
Note in (6) and (7), the inverse Mills ratio term is different,
(6) , and
φ (c )
Φ (c )
φ (x2δ 2 )
in
Φ ( x2δ 2 )
− φ ( x2δ 2 )
in (7). However, their coefficient is the same. Therefore, one
1 − Φ ( x2δ 2 )
can estimate both (6) and (7) by applying Heckman two step estimator, and test if
the coefficient from (6) and (7) are the same. This test can serve as a specification
test for the model.
2. Maximum Likelihood Estimator
Since we only observe y1 when y2 = 1, so we only need to find out the joint
density of the case (y1, y2=1). The density function for y2 is:
f ( y 2 | x ) = (Φ ( xδ 2 )) 2 (1 − Φ (xδ 2 ))
1− y2
y
The conditional density for y1 is given by:
f ( y1 | y 2 = 1, x ) =
Pr ( y 2 = 1 | y1 , x ) f ( y1 | x )
Pr ( y 2 = 1 | x )
Note that y1|x ~ N(xβ, σ12), and assume that cov(u1, v2) = σ12. We can write:
v2 =
⎛
σ 122 ⎞
σ 12
⎜
⎟
(
)
where
e
~
N
0
,
1
−
y
−
x
β
+
e
2
1
1
2
⎜
σ 12 ⎟⎠
σ 12
⎝
Therefore,
⎛ xδ + ( y − x β )σ / σ 2 ⎞
1
1
12
1 ⎟
Pr ( y 2 = 1 | y1 , x ) = Φ ⎜ 2
2
2 1/ 2
⎜
⎟
1 − σ 12 / σ 1
⎝
⎠
(
)
So the likelihood function is given by:
10
(
= ln (Pr ( y
)
li (θ ) = ln Pr ( yi 2 = 0)
yi 2 =0
⋅ Pr ( yi1 , yi 2 = 1)
= 0)
yi 2 =0
⋅ (Pr ( yi1 | yi 2 = 1) Pr ( yi 2 = 1))
i2
yi 2 =1
yi 2 =1
)
= (1 − yi 2 ) ln (1 − Φ ( xδ 2 )) + yi 2 ln Pr ( y 2 = 1 | y1 , x ) f ( y1 | x )
Applications:
Example: Wage offer function: only those who have jobs have observed wages. Suppose
we are interested in how wage offers are determined, E (wio | xi ) .
The observation rule for the wage is such that: wio = wi if and only if that the
worker is working.
Suppose that the wio = exp(xi1β + ui1 ) . The decision rule for the work is that she
would work if and only if that wio > wir , where wir is the reservation wage. To study how
wir is determined, we consider explicitly a labor supply model:
(
max u wio h + ai , h
h
)
s.t. 0≤h≤1
ai is the nonwage income of person i. Since we have: du / dh ≤ 0. At h=0, we get the
reservation wage:
(
)
(
du
= u1 wio h + ai , h wio + u 2 wio h + a i , h
dh
)
where u1 is marginal utility from income, and u2 is marginal utility from working.
du
dh
= u1 (ai ,0 )wio + u2 (ai ,0 ) ≤ 0
h =0
ÎThe reservation wage is obtained by setting previous equation to zero:
wir = −
u1 (ai ,0)
u2 (ai ,0)
Therefore, the reservation wage wir will definitely depend on non-labor income ai.
Let wir be determined by the following equation:
wir = exp( xi 2 β 2 + γ 2 ai + ui 2 )
11
where ui1 and ui2 are independent of (xi1, xi2, ai). xi1 represents the productivity
characteristics, while xi2 are variables that determine the marginal utility of leisure and
income, and ai is the non-wage income. Rewrite the previous equations in logarithm:
log wio = xi1 β1 + ui1
log wir = xi 2 β 2 + γ a i + ui 2
The selection rule is given by the difference: log wio − log wir > 0 .
If wir were observed and exogenous and xi1 were always observed Îcensored
regression
If wir were observed and exogenous but xi were only observed when wio is
observed Î Truncated Tobit.
If wir is not observed, as in most cases, we have:
log wio − log wir = xi1 β1 − xi 2 β 2 − γ ai + ui1 − ui 2 = xi δ + vi > 0 .
Here the selection process and the wage regression are given by:
S i = 1( xi δ + vi > 0 )
log wio = xi1 β1 + ui1
if S i = 1
,
and the wage regression is given by: log wio = xi1β1 + ui1 .
Note in the selection process, vi includes both ui1 and ui2. Therefore, it is obvious
that ui1 and vi are correlated, and clearly the correlation is positive.
Another important point from this model is about what xi should be used. It is
clear that the sample selection model should include all xi. It is also important to notice
that the xi1 in the wage regression should NOT include ai, the non-labor income of the
person.
Example (charitable contributions): we are interested in finding demand for charity
donation.
Maxc,q ui(c, q) = ci + ailog(1+qi), s.t. ci + piqi = mi, and qi ≥0
where c is the annual consumption, q is the annual charitable giving, and αi is the
marginal utility from giving. In addition, mi is the family income, and the pi is the dollar
12
price of charitable contribution, depending on the marginal tax rate of the person. For
example, for a person with a marginal tax rate of 30%, his pi is 0.7.
Plug the budget constraint in the utility function, take the derivatives with respect
to q, the first order condition:
− pi +
α
=0
1 + qi*
The solution to this problem is:
qi = 0
if ai/pi ≤ 1
qi = ai/pi
if ai/pi > 1
If we are interested in what characteristics would determine the charitable
giving, we model ai = exp(ziγ+ui), we have our estimation model:
log(1+qi) = max(0, ziγ – log(pi)+ui)
(1)
Now we are interested in understanding charitable giving among faculty at the
Texas A&M University. There are two ways to obtain a sample.
(1) We randomly draw N people from the university’s accounting office. The
office has detailed information of all faculty members on campus, including their
donation amount.
(2) We randomly draw N faculty members and call them. For each faculty we ask
the amount of their charitable giving. Inevitably, some faculty would refuse to answer
this question (recorded as RF i = 1).
Censored sampling: In the first case, we know exactly if a person donates or not. Some
faculty may make zero amount of donation. This is the classic censored model:
log(1+qi) = max(0, ziγ – log(pi)+ui)
Usual censored regression can be applied here.
Selected Sample:
First we assume a simplified case that all faculty whose RFi = 0 have positive
donation amount:
log(1+qi) = ziγ – log(pi)+ui
if RFi = 0.
13
Suppose we do observe zi and pi for all people.
Pr(RFi = 0) = Pr(miη + αlog(pi)+ vi < 0).
How to estimate this problem? Note it is often the case that RF is not random,
i.e., cov(ui, vi) ≠ 0. Intuitively, those people who refuse to answer are more likely to
donate less than those who gave a response (response could be zero). Therefore, we have:
E (log(1 + qi ) | RFi = 0)
= z i γ − log( pi ) + E (ui | miη + α log( pi ) + vi < 0 )
= z i γ − log( pi ) + ρ
− φ (miη + α log( pi ))
1 − Φ (miη + α log( pi ))
A standard two-step Heckman approach can be used here.
Now consider the Maximum Likelihood approach. Here we assume that some
faculty may give zero amount of donation (RFi = 0):
We have three cases: (qi > 0, RF = 0) and (qi = 0, RF = 0), and (RF = 1).
Write ui = ρvi + εi.
Case 1: (qi > 0, RFi = 0)
f (qi , RF = 0 ) = f (qi , miη + α log( pi ) + vi < 0 )
= f (qi , vi < − miη − α log( pi ))
=∫
− miη −α log ( pi )
−∞
1
σε
⎛ log(1 + qi ) − zi γ + log( pi ) + ρ vi ⎞
⎟⎟φ (vi )dvi
σε
⎝
⎠
φ ⎜⎜
Case 2: (qi = 0, RF i = 0)
Pr (qi = 0, RF = 0) = Pr (z i γ − log( pi ) + ρ vi + ε i < 0, miη + α log( pi ) + vi < 0)
= Pr (ε i < − zi γ + log( pi ) − ρ vi , vi < −miη − α log( pi ))
=∫
− miη −α log ( pi )
−∞
⎛ − z γ + log( pi ) − ρ vi
Φ ⎜⎜ i
σε
⎝
⎞
⎟⎟φ (vi )dvi
⎠
Case 3: (RF i = 0):
Pr (RF = 1) = 1 − Φ (miη + α log( pi ))
Therefore, the likelihood for one observation is give by:
14
1(qi > 0, RFi = 0 )
f (qi , RFi ) = f (qi > 0, RFi = 0)
1(qi = 0, RFi = 0 )
Pr (qi = 0, RFi = 0)
1( RFi =1)
Pr (RFi = 1)
A further complication: Assume that pi is endogenous. The endogeneity of pi
could come from the fact that choice of qi could affect pi by switching to a different tax
bracket, or pi is measured with error.
Let
log(pi) = xiβ + wi.
Let ui = ρ1wi + ε1i, and vi = ρ2wi + ε2i
First consider the case that when RFi =0, all faculty make positive amount of
donation, i.e., qi > 0. In this case, we apply the regression model:
E (log(1 + qi ) | RFi = 0)
= z i γ − log( pi ) + E (ui | miη + α log( pi ) + vi < 0 )
= z i γ − log( pi ) + E (ρ 1 wi + ε 1i | miη + α log( pi ) + ρ 2 wi + ε 2i < 0)
= z i γ − log( pi ) + ρ 1 E (wi | miη + α log( pi ) + ρ 2 wi + ε 2i < 0 )
= z i γ − log( pi ) + ρ 1 E (wi | ρ 2 wi < − miη − α log( pi ) − ε 2i )
⎛ m η + α log( pi ) + ε 2i ⎞
⎟⎟
− φ ⎜⎜ i
∞
ρ 2σ w
⎝
⎠ 1 φ ⎛⎜ ε 2i
= z i γ − log( pi ) + ρ 1 ∫
−∞
⎛ m η + α log( pi ) + ε 2i ⎞ σ 2 ⎜⎝ σ 2
⎟⎟
1 − Φ ⎜⎜ i
ρ 2σ w
⎝
⎠
⎞
⎟⎟dε 2i
⎠
The last equality is obtained by first conditional on ε2i to get inverse of the Mills
ratio, and then integrate out the ε2i since it is not observed.
Therefore, it is not easy to estimate such a model using a two-step approach.
Maximum likelihood is probably the only plausible approach to solve this problem.
There are three endogenous variables, (qi, RFi, pi). We consider the following
transformation:
f (qi , RFi , pi ) = f (qi , RFi | pi ) f ( pi )
The density for f(pi) is easy. So our focus is on the first part: f(qi, RFi|pi). To write
the likelihood function, consider again the three cases for the likelihood function as
before: (qi > 0, RF = 0|pi) and (qi = 0, RFi = 0|pi), and (RFi = 1|pi). Here we again assume
that some faculty members make zero amount of contribution.)
Case 1: (qi > 0, RFi = 0)
15
f (qi , RFi = 0) = f (qi , miη + α log( pi ) + vi < 0 | pi )
= f (qi , ε 2i < − miη − α log( pi ) − ρ 2 wi | wi )
=
1 ⎛ log(1 + qi ) − zi γ + log( pi ) + ρ1 wi
φ⎜
σ 1 ⎜⎝
σ1
⎛ m η + α log( pi ) + ρ 2 wi
⎞⎡
⎟⎟ ⎢1 − Φ ⎜⎜ i
σ2
⎝
⎠⎣
=
1 ⎛ log(1 + qi ) − zi γ + log( pi ) + ρ1 (log( pi ) − xi β ) ⎞
⎟⎟ ⋅
φ⎜
σ 1 ⎜⎝
σ1
⎠
⎞⎤
⎟⎟⎥
⎠⎦
⎡
⎛ miη + α log( pi ) + ρ 2 (log( pi ) − xi β ) ⎞⎤
⎟⎟⎥
⎢1 − Φ ⎜⎜
σ
2
⎝
⎠⎦
⎣
Note in the last equality, we replace wi by log(pi)-xiβ. This is possible because
log(pi) is continuous and there is no constraint on the error term wi. It is also noted that
conditional on pi is equivalent to wi.
Case 2: (qi = 0, RFi = 0)
Pr (qi = 0, RF = 0) = Pr (zi γ − log( pi ) + ρ1 wi + ε 1i < 0, miη + α log( pi ) + ρ 2 wi + ε 2i < 0 )
= Pr (ε 1i < − zi γ + log( pi ) − ρ1 wi , ε 2i < −miη − α log( pi ) − ρ 2 wi )
⎛ − z γ + log( pi ) − ρ1 wi
= Φ ⎜⎜ i
σ1
⎝
⎞ ⎛ − miη − α log( pi ) − ρ 2 wi
⎟⎟Φ ⎜⎜
σ2
⎠ ⎝
⎞
⎟⎟
⎠
⎛ − z γ + log( pi ) − ρ1 (log( pi ) − xi β ) ⎞ ⎛ − wiη − α log( pi ) − ρ 2 (log( pi ) − xi β ) ⎞
⎟⎟Φ ⎜⎜
⎟⎟
= Φ ⎜⎜ i
σ1
σ2
⎝
⎠ ⎝
⎠
Case 3: (RFi = 1)
Pr (RFi = 1) = Pr (miη + α log( pi ) + ρ 2 wi + ε 2i > 0 )
⎛ m η + α log( pi ) + ρ 2 wi
= Φ ⎜⎜ i
σ2
⎝
⎞
⎟⎟
⎠
⎛ m η + α log( pi ) + ρ 2 (log( pi ) − xi β ) ⎞
⎟⎟
= Φ ⎜⎜ i
σ2
⎝
⎠
Finally, we need to have a density for wi (or, another word, a density for log(pi)).
Therefore, the likelihood is given by:
f (qi , RFi , pi ) = f (qi , RFi | pi ) f ( pi )
(
)
(
)
(
= f (qi > 0, RFi = 0)1 qi >0, RFi =0 Pr (qi = 0, RFi = 0)1 qi =0,RFi =0 Pr (RFi = 1)1 RFi =1
) 1
⎛ log pi − xi β ⎞
⎟⎟
σw
⎝
⎠
φ ⎜⎜
σw
It is interesting to point out that this likelihood function does not involve
integration (other than the cdf of normal density). This property makes the current model
easy to estimate.
16
Summary
In all previous examples, including examples in the discrete choice part, figuring out the
density is the critical step. In general, there are three steps in figuring out the density.
Step 1: determine the density of “what”.
The “what” should be all potentially endogenous variables.
Step 2: determine the relations between two endogenous variables. Often the error term
in one of the equations is potentially “contaminated”. We need to write out the
contaminated random variable as a function of independent random variables. T
Step 3: write the joint density as a conditional density and a marginal density. In this
step, it is important to keep in mind the ranges that error terms may lie in.
Example: Discrete/Continuous Model (Dubin and McFadden, Econometrica, 1985)
Consumers face a choice of m mutually exclusive, exhaustive appliance
portfolios, which can be index as i = 1, …, m. Portfolio i has a rental price (annualized
cost) ri. Given i, the consumer has a conditional indirect utility function:
u = V(i, y-ri, p1, p2, si, εi, η)
where p1 is the price of electricity, p2 is price of alternative energy sources, y is income, si
is observed attributes of i, εi is unobserved attributes of i, ri is the price of i, η is
unobserved characteristics of the consumer. Electricity and alternative energy
consumption levels, given i, are (by Roy’s identity):
x1 =
− ∂V (i, y − ri , p1 , p2 , si , ε i ,η ) / ∂p1
∂V (i, y − ri , p1 , p2 , si , ε i ,η ) / ∂y
x2 =
− ∂V (i, y − ri , p1 , p2 , si , ε i ,η ) / ∂p2
∂V (i, y − ri , p1 , p2 , si , ε i ,η ) / ∂y
The probability that portfolio i is chosen:
17
Pi = Pr{(ε 1 , L, ε m ,η ) : V (i, y − ri , p1 , p2 , si , ε i ,η ) > V (i, y − r j , p1 , p2 , s j , ε j ,η ) for j ≠ i}
First consider the demand system is linear in income:
x1 = α 0i + α1 p1 + α 2 p2 + w' γ + β i ( y − ri ) + η + v1i
The indirect utility that leads to such a demand function is given by:
⎛
⎞
α
Vi = ⎜⎜ α 0i + + α1 p1 + α 2 p2 + w' γ + β i ( y − ri ) + η + v1i ⎟⎟e − βp1 + ε i
β
⎝
⎠
So the probability of choice i is given by:
Pi = Pr (Vi > V j for j ≠ i )
Estimation process:
(1) estimate a discrete choice model
(2) estimate a continuous demand model
The problem of the second stage model is: E(η|i) is not zero.
x1 = α 0i + α1 p1 + α 2 p2 + w' γ + β i ( y − ri ) + η + v1i
Example: vehicle choice and vehicle miles driven (VMT). Let the indirect utility from
driving is given by:
max ci ,VMTi
u(ci ,VMTi )
s.t.
pg
MPGi
VMTi + ci = y − δk i
where ci is the consumption, ki is the cost of owning the vehicle bundle i, VMTi is the
vehicle miles driven, and pg is the price of gasoline, which does not vary over i, MPGi is
the miles per gallon, which does vary across vehicle bundle i. pg/MPGi is the cost per
mile of driving vehicle bundle i.
Let the optimal solution – the indirect utility be denoted:
⎛ pg
⎞
Vi = v ⎜⎜
, y − δk i ; x,η ⎟⎟
⎝ MPGi
⎠
where indirect utility is obtained by maximizing the direct utility under budget constraint.
η represents unobserved characteristics, such as preference to driving, and distance from
work, traffic congestion, etc.
18
Individual chooses vehicle bundle i if and only that Vi > Vj for all i≠j (discrete
choice part of the model)
Next is to specify VMT. The VMT is obtained by Roy’s identity:
VMTi = −
∂v (mpg i , y , z i ,η ) / ∂p i
= α 0 + α1i pi + β ( y − ri ) + x ' γ + η
∂v (mpg i , y , zi ,η ) / ∂y
Since we only observe the VMTi for the chosen vehicle bundle, the conditional
expectation E(η|p,y,r,x) ≠ 0.
Dubin-McFadden suggests three alternative ways to do this: one is similar to
Heckman two-stage model, another one is to use instrumental variables, and the third one
is to use reduced form estimation method:
(1) Heckman-type method:
ln(VMTi ) = α 0 + α1i p + β ( y − ri ) + x ' γ + E (η | i chosen )
(*)
The expectation E(η|i chosen) is not zero. In the Heckman sample selection
model, this expectation is inverse of mills ratio, multiplied by a constant. Here it is a
function of Pr(i chosen) for all i = 1, …, m.
(2) IVs: use the predicted Pr(i chosen) in the discrete choice part as IVs.
One can rewrite (*) as:
m
m
i =1
i =1
ln(VMT ) = α 0 + ∑ Diα1i p + β ∑ Di ( y − ri ) + x ' γ + η
(**)
where Di is the dummy indicating if choice i is chosen. Obviously Di is
dependent on η. Dubin and McFadden suggest using Pr(i chosen) from the discrete
choice part as IVs for Di.
(3) Reduced form:
Taking expectation of (**) (unconditional on choice i is chosen), we have:
m
m
i =1
i =1
E (ln(VMT ) ) = α 0 + ∑ Piα1i p + β ∑ Pi ( y − ri ) + x ' γ
19
(***)
where Pi is the probability of choice i is chosen. E(η) = 0 because the
expectation is taken unconditionally for the full sample (not just the subsample
of those who choose bundle i). Dubin and McFadden suggest using the
estimated Pi, Pˆi , from the discrete choice part to substitute Pi.
Final complication: if only observed shares of vehicle bundles (instead of
individual choice of vehicle bundles):
S in = ∫
η
exp(Vin (η ))
K
∑ exp(V
jn
f (η ) dη ,
(η ))
j
ln(VKTin ) = α 0 + ∑ α1 j p jn S jn + βy n + β ∑ ( y n − r jn ) S jn + x n' γ + vin ,
j
j
20