Censored Regression, Sample Selection, Endogenous Switching, and Treatment-Effect Regression Models Dan Powers Soc385K

Censored Regression, Sample Selection, Endogenous Switching, and
Treatment-Effect Regression Models
Dan Powers
Soc385K
November 26, 2007
Censored Regression Models
Truncated and censored regression models are useful when the dependent variable is observed in
some ranges but not in others. A classical example is Tobin’s (1958) study of household
expenditures, in which expenditures on durable goods were seen to generally increase with
income, but a non-trivial number of households had no expenditures on durable goods. The
problem, in this case, is how to model the response.1
It is clearly inappropriate for a classical linear model due to the presence of many observations
with values of 0. However, a model can be constructed using a latent variable approach. Suppose
that we consider a latent continuous variable Y ∗ that is normally distributed with mean µ and
variance σ 2 . The standard Tobit model is given by
Y ∗ = Xβ + ε
Y =Y∗
if Y ∗ > 0
=0
if Y ∗ ≤ 0.
To estimate this model we need to consider two pieces of the overall likelihood, the probability
that Y ∗ is non-masspoint and the probability that Y ∗ is a masspoint (at zero). Combining these
into a single expression for the joint probability (likelihood) results in the following expression
L=
Y
[1 − Φ(Xβ/σ)]
Y 1
φ[(Y − Xβ)/σ],
σ
y∗>0
y∗≤0
where Φ(·) and φ(·) are, respectively, the cumulative standardized normal distribution function
and standardized normal density function. If we were to use OLS on all observations, we would
get biased estimates. It is also the case that OLS applied to the non-masspoint observations
would yield biased estimates. In this case, the expectation of Y is not equal to Xβ as it would be
in a classical linear regression model. In fact,
E(Y |Y ∗ > 0) = Xβ + E(ε|ε > −Xβ)
The last term is not necessarily equal to zero. If we assume normality of ε, then
E(Y |Y ∗ > 0) = Xβ + σλ(Xβ/σ),
1
We will drop individual subscripts and change notation to simplify things a bit. For example, x0iβ = Xβ.
1
where λ(·) = φ(·)/Φ(·).2
Estimation strategies involve simple (two-step) procedures or ML (there are also other
possibilities). A two step strategy involves first using a probit model on the complete data, coding
Z = 1 if Y > 0, and 0 otherwise.
probit[Pr(Z = 1)] = Xα
b = φ(X α
We then compute λ
b)/Φ(X α
b), and use this quantity as an additional regressor in an OLS
model for the non-masspoint observations. However, it is generally easier ( and results in a more
efficient estimator) to estimate this model using maximum likelihood.
Example. This example models the number of hours worked per year as a function of
educational level, age, and number of children.
. regress hours edLTHS edCOL age children
Source |
SS
df
MS
-------------+-----------------------------Model |
489867964
4
122466991
Residual |
324711042 1121 289661.946
-------------+-----------------------------Total |
814579006 1125 724070.228
Number of obs
F( 4, 1121)
Prob > F
R-squared
Adj R-squared
Root MSE
=
=
=
=
=
=
1126
422.79
0.0000
0.6014
0.6000
538.2
-----------------------------------------------------------------------------hours |
Coef.
Std. Err.
t
P>|t|
[95% Conf. Interval]
-------------+---------------------------------------------------------------edLTHS | -550.7335
43.67641
-12.61
0.000
-636.4302
-465.0368
edCOL |
687.8252
36.75285
18.71
0.000
615.713
759.9373
age | -1.254959
1.239134
-1.01
0.311
-3.686242
1.176324
children | -210.2569
8.367137
-25.13
0.000
-226.6739
-193.8399
_cons |
1198.584
53.0985
22.57
0.000
1094.4
1302.767
-----------------------------------------------------------------------------. tobit hours edLTHS edCOL age children, ll(0)
Tobit estimates
Number of obs
LR chi2(4)
Prob > chi2
Pseudo R2
Log likelihood = -6335.3063
=
=
=
=
1126
1257.40
0.0000
0.0903
-----------------------------------------------------------------------------hours |
Coef.
Std. Err.
t
P>|t|
[95% Conf. Interval]
-------------+---------------------------------------------------------------edLTHS | -837.2152
62.12268
-13.48
0.000
-959.1049
-715.3255
edCOL |
751.6508
44.09414
17.05
0.000
665.1345
838.167
age | -2.339379
1.515065
-1.54
0.123
-5.312059
.6333014
children | -433.9989
16.773
-25.87
0.000
-466.9089
-401.0889
_cons |
1349.601
65.84254
20.50
0.000
1220.413
1478.79
-------------+---------------------------------------------------------------_se |
621.3327
16.2205
(Ancillary parameter)
-----------------------------------------------------------------------------2
See the Appendix for more details.
2
Obs. summary:
342
784
left-censored observations at hours<=0
uncensored observations
We find that OLS underestimates the effects of all of the predictors in this case. Interpret the
results as you would any regression model. The parameter labelled se is the “selection effect,” or
bias correction factor.
Selection Models
The selection model of Heckman (1976) describes an estimation problem resulting from
incomplete data. To motivate this model, consider the problem of estimating a regression for
women’s earnings. We would not have a problem if we observed earnings data for the majority of
working-age women. However, labor-force participation rates in the 1970s, when Heckman worked
on this problem, where hovering around 43 percent (compared to more than 60 percent
currently), and full-time LFP rates were only about 30 in 1970. This is a problem if the goal is to
make inferences about the factors that contribute to women’s earnings. We can generalize this to
any sample. For the data in the example above, we can assume that we would only observe wages
for the subsample of labor-force participants. Participation depends on covariates that are
assumed to affect the reservation wage, or the wage that would need to be offered to induce a
respondent to enter the market. The econometric model postulates that if a wage offer exceeds
the reservation wage, then we will observe a wage for that individual.
Theory. The selection problem is that earnings are observed only for labor force participants;
earnings are unobserved or “latent” for nonparticipants. Economists envision two equations:
Yo = Xβ + u1 ,
wage offered
and
Yr = Xβ + u2 ,
reservation wage
It follows that, Y = Yo if Yo ≥ Yr and Y is missing or 0 otherwise. Selectivity bias refers to the
fact that if we were to estimate the earnings function based on observations for which we have
data, then we get inconsistent (biased) estimates of the effects of variables affecting wage rates.
Heckman’s solution was to view the wage equation as the substantive equation of interest.
Y = Xβ + u1 ,
along with a participation equation,
W ∗ = Zγ + u2 ,
in which a woman works if W ∗ ≥ 0, and
Pr(W = 1|Z) = Φ(Zγ).
It can be shown that,
"
E(u1 |W = 1) = E(u1 |u2 > −Zγ) = E(u1 |u2 < Zγ) = σ12
Note, by the symmetry of the normal distribution, we have:
φ(−z)
φ(z)
=
1 − Φ(−z)
Φ(z)
3
φ(Zγ)
Φ(Zγ)
#
and
φ(−z)
φ(z)
=
Φ(−z)
1 − Φ(z)
Logit models can also be used. As an exercise, show that for a logit model
f (z)
1
=1−p=
.
F (z)
1 + exp(z)
Note that the ratio φ(z)/Φ(z) is referred to as the inverse Mills ratio (or non-selection hazard
rate). and, φ(z)/[1 − Φ(z)] = φ(z)/Φ(−z), is referred to as “selection” hazard rate. From the
modeling standpoint, we have:
W ∗ = Zγ + u2 ,
W = 1, if W ∗ ≥ 0, 0 otherwise
Y ∗ = Xβ + u1 ,
Y ∗ = Y if W = 1.
The equation for W is a probit model; Y ∗ is a latent continuous dependent variable. Extensions
to the model allow for different possibilities for Y ∗ (censored, binary, etc). Here, of course, wages
are treated as continuous. We assume that u1 and u2 are distributed as bivariate normal, with
2 , σ 2 , and σ .3
means zero, (co)variances σu1
12
u2
The nature of the selectivity bias can be shown by evaluating the expectation of Y . In this
case,
E(Y |W ∗ > 0) = E(Y |Zγ + u2 > 0)
= Xβ + E(u1 |u2 > −Zγ)
= Xβ + E(u1 |u2 < Zγ)
φ(Zγ)
= Xβ + σ12
Φ(Zγ)
Both the two-step and ML approaches are used for this model. For the two-step approach, we fit
b = φ(Zb
a probit model to W , generate λ
γ )/Φ(Zb
γ ), and include this as a regressor. For example,
. * Heckman selection model 2-stage (by ’hand’)
.
. probit works married children education age
Probit estimates
Number of obs
LR chi2(4)
Prob > chi2
Pseudo R2
Log likelihood = -1027.0616
=
=
=
=
2000
478.32
0.0000
0.1889
-----------------------------------------------------------------------------works |
Coef.
Std. Err.
z
P>|z|
[95% Conf. Interval]
-------------+---------------------------------------------------------------married |
.4308575
.074208
5.81
0.000
.2854125
.5763025
children |
.4473249
.0287417
15.56
0.000
.3909922
.5036576
education |
.0583645
.0109742
5.32
0.000
.0368555
.0798735
age |
.0347211
.0042293
8.21
0.000
.0264318
.0430105
_cons | -2.467365
.1925635
-12.81
0.000
-2.844782
-2.089948
-----------------------------------------------------------------------------. predict Zgamma, xb
3
Multivariate normality is tractable. Other joint distributions are less convenient.
4
. gen Invmills = normden(Zgamma)/norm(Zgamma)
. gen Selhazard = normden(Zgamma)/( 1 - norm(Zgamma) )
. regress wage education age Invmills
Source |
SS
df
MS
-------------+-----------------------------Model | 14904.6806
3 4968.22688
Residual |
38450.214 1339 28.7156191
-------------+-----------------------------Total | 53354.8946 1342 39.7577456
Number of obs
F( 3, 1339)
Prob > F
R-squared
Adj R-squared
Root MSE
=
=
=
=
=
=
1343
173.01
0.0000
0.2793
0.2777
5.3587
-----------------------------------------------------------------------------wage |
Coef.
Std. Err.
t
P>|t|
[95% Conf. Interval]
-------------+---------------------------------------------------------------education |
.9825259
.0504982
19.46
0.000
.8834616
1.08159
age |
.2118695
.0206636
10.25
0.000
.171333
.252406
Invmills |
4.001616
.5771027
6.93
0.000
2.869492
5.133739
_cons |
.7340391
1.166214
0.63
0.529
-1.553766
3.021844
------------------------------------------------------------------------------
We can compare these results to a “naive” regression for LF participants.
. regress wage education age
-----------------------------------------------------------------------------wage |
Coef.
Std. Err.
t
P>|t|
[95% Conf. Interval]
-------------+---------------------------------------------------------------education |
.8965829
.0498061
18.00
0.000
.7988765
.9942893
age |
.1465739
.0187135
7.83
0.000
.109863
.1832848
_cons |
6.084875
.8896182
6.84
0.000
4.339679
7.830071
------------------------------------------------------------------------------
It is clear that the naive regression will underestimate the effects of age and education. Using
the results from the two-step procedure, we can retrieve the correlation between u1 and u2 . In the
b σu1 = 4.00/5.36 = 0.75. Similarly, σ
b We will get
probit model σu2 = 1, so ρb = λ/b
b12 = ρbσ
bu1 = λ.
identical results with the “canned” two step procedure in Stata. However, the standard errors will
be adjusted for the joint estimation of the two equations. In most cases, it is easiest to use the
The default ML estimator. However, it is always a good idea to consider two-step estimators
when full-information maximum likelihood solutions are not available. That is, there is
considerable added value in thinking statistically about joint processes rather than relying on
canned or convenient solutions. The ML solution below is not too different from the two-stage
results presented earlier.
. heckman wage education age, select(works = married children education age)
Heckman selection model
(regression model with sample selection)
Log likelihood = -5178.304
5
Number of obs
Censored obs
Uncensored obs
=
=
=
2000
657
1343
Wald chi2(2)
Prob > chi2
=
=
508.44
0.0000
-----------------------------------------------------------------------------|
Coef.
Std. Err.
z
P>|z|
[95% Conf. Interval]
-------------+---------------------------------------------------------------wage
|
education |
.9899537
.0532565
18.59
0.000
.8855729
1.094334
age |
.2131294
.0206031
10.34
0.000
.1727481
.2535108
_cons |
.4857752
1.077037
0.45
0.652
-1.625179
2.59673
-------------+---------------------------------------------------------------works
|
married |
.4451721
.0673954
6.61
0.000
.3130794
.5772647
children |
.4387068
.0277828
15.79
0.000
.3842534
.4931601
education |
.0557318
.0107349
5.19
0.000
.0346917
.0767718
age |
.0365098
.0041533
8.79
0.000
.0283694
.0446502
_cons | -2.491015
.1893402
-13.16
0.000
-2.862115
-2.119915
-------------+---------------------------------------------------------------/athrho |
.8742086
.1014225
8.62
0.000
.6754241
1.072993
/lnsigma |
1.792559
.027598
64.95
0.000
1.738468
1.84665
-------------+---------------------------------------------------------------rho |
.7035061
.0512264
.5885365
.7905862
sigma |
6.004797
.1657202
5.68862
6.338548
lambda |
4.224412
.3992265
3.441942
5.006881
-----------------------------------------------------------------------------LR test of indep. eqns. (rho = 0):
chi2(1) =
61.20
Prob > chi2 = 0.0000
Switching Regression Models
The switching regressions model is a variant of the classical Heckman selection model. Suppose
that instead of observing a truncated distribution for Y , we observe Y1 and Y0 from two truncated
distributions. Let Z ∗ denote a social position, status, or other characteristic. Assume, for now
that this is a latent variable with the following index function:
Z ∗ = Xγ + v,
Z = 1 if Z ∗ > 0,
Z = 0 otherwise ,
which can be estimated using models for binary data. Then suppose that we observe the following:
Y1 = X1 β1 + u1
if Z = 1
Y0 = X0 β0 + u0
if Z = 0
and
That is, we observe Y1 when Z = 1, in which case Y0 is unobserved, latent, or missing. Similarly,
we observe Y0 when Z = 0, in which case Y1 is missing. Note that in practice, we observe sample
respondents in only one state Z = 1 or Z = 0. In the experimental literature, there is the notion
that those assigned to status 1 are identical to those assigned to status 0, so that there is an
interchangeability across statuses. It then makes sense to ask what the outcome would be if a
respondent in status 1 were assigned to status 0 and vice versa.
The model presented above is referred to as an endogenous switching regressions model. It is
used to address issues of self selection and the estimation of treatment effects when there is
nonrandom allocation of subjects to treatment and control groups as is generally the case with
observational (as opposed to experimental) data.
6
Example
Suppose that a vocational/academic training program has two tracks or specializations consisting
of demography and theory. Students select into the tracks based on the evaluations of teachers
(grades in theory and demography) in addition to their own choices (past courses), as well as
tastes and preferences for the subject matter.
The probability that a student enters a particular track is not necessarily independent of the
expected outcome in that track due to the “self-selection” problem. That is, selection into track is
not determined by a random mechanism as it would be in a controlled experiment. It may be the
case that the choice is not simply a function of observed characteristics and that random factors
affect the choice of track and the eventual outcome in the track. If we were to compare differences
in outcomes (taken to be earnings in this case) between members of both tracks via a simple
difference in means, part of this difference may be due to selectivity, making it difficult to know
the “true” effect of the specialization (or track) on the outcome. To summarize, conventional
models ignore the possibility that common unmeasured factors affect track assignment and the
rewards accruing to individuals in chosen specializations. Individuals may enter the track based
on expected or perceived rewards. From a statistical standpoint, this means that the errors in Z
are correlated with the errors in Y1 and Y0 .
Model Specification
Let the regression function for placement into the demography specialty be
Z ∗ = Xγ + v,
where Z = 1 if the individual chooses demography, Z = 0 if the individual chooses theory. Let Y1
be the rewards or earnings for demographers and let Y0 be the earnings for theorists. Assume, for
simplicity, that the effects of exogenous variables (prestige of university, demand and supply
factors, etc.) on earnings in each field are equal yielding the following earnings functions:
Y1 = α1 + Xβ + u1
if Z = 1 demographer
and
Y0 = α0 + Xβ + u0
if Z = 0 theorist
Because, of the selection problem (the failure to observe Y0 when Z = 1 and the failure to observe
Y1 when Z = 0), we need to write these outcomes in a selection-equation format, much like the
Heckman model. Taking expectations of the outcome equations, we can find the expected
earnings for a demographer who self-selected into demography.
E(Y1 |Z = 1) = E(Y1 |Z ∗ > 0) = E(Y1 |Xγ + v > 0)
= E(Y1 |v > −Xγ)
= α1 + Xβ + E(u1 |v < Xγ)
φ(Xγ)
= α1 + Xβ + σ1v
,
Φ(Xγ)
7
which follows due to the truncation of the distribution of Y1 from below. Similarly, the expected
earnings for a theorist who self-selects into theory is
E(Y0 |Z = 0) = E(Y0 |Z ∗ < 0) = E(Y0 |Xγ + v < 0)
= E(Y0 |v < −Xγ)
= α0 + Xβ + E(u0 |v > Xγ)
φ(Xγ)
= α0 + Xβ − σ0v
,
1 − Φ(Xγ)
which follows from truncation of Y0 ’s distribution from above. You should recognize the
additional model terms as the inverse Mills ratio and the selection hazard rate. For two-stage
estimation of this model, we substitute these expressions (which are based on a first stage probit
model) into the second stage regression model. It should be noted that under this specification,
the outcome becomes two variables for each individual. Of course, we typically only observe the
outcome associated with the position (or track) the individual actually enters. However, using
this formulation, we can consider the possibility that both the observed outcomes and the
hypothetical outcomes that individuals would have if they entered different tracks would affect
their assignment to tracks. Direct ML is an alternative to the two-stage approach. Letting
αj + Xβ = Xβj for j = 0, 1, the log likelihood for this model is,
X
Y
1
log(2πσ1v ) −
log L =
log{Φ(η1 )} −
2
Z=1
X
1
Y
+
log{Φ(η0 )} −
log(2πσ0v ) −
2
Z=0
where
η1 =
Xγ + (Y − Xβ1 )σ1v /σ12
p
1 − ρ21v
η0 =
Xγ + (Y − Xβ0 )σ0v /σ02
p
1 − ρ20v
and
Note that: ρjv =
− Xβ1
σ1
− Xβ0
σ0
2 2 ,
σjv
, for j = 0, 1.
σj
Interpreting the Covariance Parameters
The covariance parameters tell the direction and magnitude of the selection effects.
• Case 1— If σ0v < 0 and σ1v > 0, this is a case of positive selection into both tracks. This
implies that the mean income of those who enter demography is greater than the population
average for demographers and the mean income of those entering theory is higher than the
population mean income of theorists. This implies that if demographers were to be placed
in (or assigned to) the theory track, they would have lower incomes than those who entered
the theory track. This pattern of covariances is consistent with the notion of two latent
dimensions along which individuals can be evaluated, each of which is best suited to one
of the positions. In summary, those who enter either program tend to do better than the
population average in the respective groups (positive selection).
8
• Case 2— If σ0v > 0 and σ1v > 0, this is a case where the mean income of those who choose
demography is greater than the population average for demographers, but the mean income
for those who choose theory is less than the population average for theorists. Those who
choose demography are better than average in both demography and theory—but better in
demography than theory. Those who choose theory are below average in both demography
and theory, but they are better in theory than demography. If persons who entered the
demography track were placed in the theory track, they would do better than the persons
who in fact entered the theory track. This is a case of positive selection into regime 1 and
negative selection into regime 0.
• Case 3— If σ0v < 0 and σ1v < 0, this mirrors case 2.
Self selectivity. Typically Case 2 occurs when σ1 is very large relative to σ0 . Individuals with
higher skills enter professions with the greater variance in earnings.
Selection and Treatment Effects
We can evaluate the treatment effect (the effects of tracking, program placement, specialization)
as the difference in conditional expectations
T (Z) = E(Y1 |X, Z = 1) − E(Y0 |X, Z = 0) = α1 − α0 ,
which is simplified by the constraints placed on β (i.e., that β0 = β1 . Due to the presence of the
additional model terms φ(·)/Φ(·) and φ(·)/[1 − Φ(·)], our estimates of α1 , α0 and β will likely be
different than they would be had we not controlled for selectivity. Thus, treatment effects would
be biased in the presence of nonrandom selection into regimes (also called endogeneity bias).
Example 1 Using Stata
We now show how to estimate this model using two-stage procedures and a way to trick the
Heckman model into estimating a switching regression model. We use the following Stata code to
generate the (artificial) data for this example.
clear
*log using switchreg.log, replace
set seed 12309
matrix m=(0,0,0)
matrix c=(1, .5, .5\.5,1,0\.5,0,1)
drawnorm u1 u2 u3, n(1000) means(m) corr(c)
* generate some x-variables
gen x1=uniform() - .50
gen x2=uniform() + .25
* generate latent regime selection equation
gen zstar = -1 + 1.0*x1 + 1.25*x2 + u1
* generate observed counterpart
gen zobs = (zstar > 0)
sum zobs
* generate y’s
gen y1s = 20 + 5.75*x1 + 5*u2 if zobs==1
gen y0s = 10 + 5.75*x1 + 4*u3 if zobs==0
* write this data for use by other programs
replace y1s = 0 if y1s == .
9
replace y0s = 0 if y0s == .
gen y = y1s+y0s
egen id=fill(1 2)
outfile id y x1 x2 zobs using swregsim.raw, replace
We could estimate the model using this code. However, we may want to use the same data to
estimate the model using special-purpose software such as aML (Applied Maximum Likelihood),
which offers numerous advantages when estimating complex multi-process models. We will
provide details later. For now, we will estimate the endogenous switching model in Stata.
clear
infile id y x1 x2 zobs using swregsim.raw
gen y0s = y if zobs==0
gen y1s = y if zobs==1
gen z0 = zobs==0
gen z1 = zobs==1
probit z1 x1 x2
predict xb1, xb
gen m11 =
normden(xb1)/(norm(xb1))
gen m00 =
-normden(xb1)/(norm(-xb1))
* use usual formula for two-step Heckman model (twice)
regress y x1 m00 if zobs==0
regress y x1 m11 if zobs==1
The two-step procedure gives the following results.
. regress y x1 m00 if zobs==0
Source |
SS
df
MS
-------------+-----------------------------Model | 1609.68059
2 804.840293
Residual | 6471.80983
514 12.5910697
-------------+-----------------------------Total | 8081.49041
516 15.6618031
Number of obs
F( 2,
514)
Prob > F
R-squared
Adj R-squared
Root MSE
=
=
=
=
=
=
517
63.92
0.0000
0.1992
0.1961
3.5484
-----------------------------------------------------------------------------y |
Coef.
Std. Err.
t
P>|t|
[95% Conf. Interval]
-------------+---------------------------------------------------------------x1 |
5.413118
.6954651
7.78
0.000
4.046814
6.779422
m00 | -1.408397
.7325275
-1.92
0.055
-2.847513
.0307189
_cons |
10.5414
.5541568
19.02
0.000
9.452709
11.63009
-----------------------------------------------------------------------------. regress y x1 m11 if zobs==1
Source |
SS
df
MS
-------------+-----------------------------Model | 629.518624
2 314.759312
Residual | 9159.29437
480 19.0818633
-------------+-----------------------------Total |
9788.813
482 20.3087407
10
Number of obs
F( 2,
480)
Prob > F
R-squared
Adj R-squared
Root MSE
=
=
=
=
=
=
483
16.50
0.0000
0.0643
0.0604
4.3683
-----------------------------------------------------------------------------y |
Coef.
Std. Err.
t
P>|t|
[95% Conf. Interval]
-------------+---------------------------------------------------------------x1 |
5.035281
.8915633
5.65
0.000
3.283431
6.78713
m11 |
2.429054
.9070122
2.68
0.008
.6468493
4.21126
_cons |
20.25713
.7405297
27.35
0.000
18.80205
21.71221
------------------------------------------------------------------------------
Estimates of Selection Effects. Letting λ1 = φ(Xβ)/Φ(Xβ) and λ0 = −φ(Xβ)/Φ(−Xβ), we
can retrieve the covariances and correlations as follows:
b0 σ
σ
bv0 = λ
b0 = (−1.41)(3.55) = −5.0
b
σ
bv1 = λ1 σ
b1 = (2.43)(4.37) = 10.6
ρbv0 = −0.40
ρbv1 = 0.56
Thus, we find the case (Case 1) of positive selection into both tracks. The ML solution that
uses the heckman procedure twice is similar. However, we must use a different probit model for
each estimation (unlike the two-step approach used above). As a result, we will have to change
the sign of ρ in the Y0 equation when interpreting the results. Including the first stage probit
model twice will have some effect on the estimates of the selection parameters. In either case, it is
not possible to constrain the effects of X to be the same across separate models. However, it is
not difficult to write a Stata program to do this, as we will see later.
* Trick Heckman: Because it uses only phi/PHI, we need to
* change signs of effects involving z0 (including rhoV0)
*
* change sign for rho and Z0 terms
heckman y x1, select(z0 = x1 x2)
* OK
heckman y x1, select(z1 = x1 x2)
Results.
Heckman selection model
(regression model with sample selection)
Log likelihood = -2013.432
Number of obs
Censored obs
Uncensored obs
=
=
=
1000
483
517
Wald chi2(1)
Prob > chi2
=
=
69.13
0.0000
-----------------------------------------------------------------------------|
Coef.
Std. Err.
z
P>|z|
[95% Conf. Interval]
-------------+---------------------------------------------------------------y0s
|
x1 |
5.561111
.6688425
8.31
0.000
4.250204
6.872018
_cons |
10.7338
.4810195
22.31
0.000
9.79102
11.67658
-------------+---------------------------------------------------------------z0
|
x1 | -1.042018
.1448909
-7.19
0.000
-1.325999
-.7580367
x2 | -1.293231
.1446665
-8.94
0.000
-1.576772
-1.009689
_cons |
1.030022
.1175535
8.76
0.000
.7996216
1.260423
-------------+---------------------------------------------------------------/athrho |
.3244194
.1790397
1.81
0.070
-.0264918
.6753307
/lnsigma |
1.29397
.0433012
29.88
0.000
1.209101
1.378839
11
-------------+---------------------------------------------------------------rho |
.3134975
.1614435
-.0264857
.5884755
sigma |
3.647238
.1579296
3.350473
3.970289
lambda |
1.1434
.6243263
-.0802569
2.367057
-----------------------------------------------------------------------------LR test of indep. eqns. (rho = 0):
chi2(1) =
3.12
Prob > chi2 = 0.0774
-----------------------------------------------------------------------------. heckman y1s x1, select(z1 = x1 x2)
Heckman selection model
(regression model with sample selection)
Log likelihood = -2021.482
Number of obs
Censored obs
Uncensored obs
=
=
=
1000
517
483
Wald chi2(1)
Prob > chi2
=
=
33.71
0.0000
-----------------------------------------------------------------------------|
Coef.
Std. Err.
z
P>|z|
[95% Conf. Interval]
-------------+---------------------------------------------------------------y1s
|
x1 |
5.2987
.9126277
5.81
0.000
3.509982
7.087417
_cons |
19.93652
.7172913
27.79
0.000
18.53065
21.34238
-------------+---------------------------------------------------------------z1
|
x1 |
1.03737
.1439349
7.21
0.000
.7552631
1.319477
x2 |
1.279762
.1446453
8.85
0.000
.9962629
1.563262
_cons |
-1.01894
.1172908
-8.69
0.000
-1.248825
-.7890542
-------------+---------------------------------------------------------------/athrho |
.6657136
.2135947
3.12
0.002
.2470756
1.084352
/lnsigma |
1.581906
.0678032
23.33
0.000
1.449014
1.714798
-------------+---------------------------------------------------------------rho |
.5821532
.141207
.2421677
.7948073
sigma |
4.864216
.3298095
4.258912
5.555551
lambda |
2.831719
.8606427
1.144891
4.518548
-----------------------------------------------------------------------------LR test of indep. eqns. (rho = 0):
chi2(1) =
8.68
Prob > chi2 = 0.0032
A more convenient approach is to use the add on procedure movestay, written by M. Lokshin
(DECRG, The World Bank) and Z. Sajaya (Stanford University), which was designed specifically
for this type of endogenous switching regressions model.
.movestay (y = x1)(y = x1), select(zobs=x1 x2)
Endogenous switching regression model
Number of obs
Wald chi2(1)
Prob > chi2
Log likelihood = -3408.6387
=
=
=
1000
33.69
0.0000
-----------------------------------------------------------------------------|
Coef.
Std. Err.
z
P>|z|
[95% Conf. Interval]
-------------+---------------------------------------------------------------y_1
|
x1 |
5.281593
.9098828
5.80
0.000
3.498255
7.06493
_cons |
19.94938
.7152708
27.89
0.000
18.54747
21.35128
12
-------------+---------------------------------------------------------------y_0
|
x1 |
5.564055
.6678406
8.33
0.000
4.255112
6.872999
_cons |
10.73037
.4826484
22.23
0.000
9.784397
11.67634
-------------+---------------------------------------------------------------zobs
|
x1 |
1.033465
.1439562
7.18
0.000
.7513159
1.315614
x2 |
1.288041
.1437439
8.96
0.000
1.006308
1.569774
_cons | -1.024566
.1167285
-8.78
0.000
-1.253349
-.7957818
-------------+----------------------------------------------------------------------------+---------------------------------------------------------------sigma_1 |
4.859087
.3278679
4.257157
5.546125
sigma_2 |
3.647968
.1584643
3.350238
3.972157
rho_1 |
.5799307
.1413935
.2400231
.7931687
rho_2 | -.3143474
.1617676
-.589687
.0264554
-----------------------------------------------------------------------------LR test of indep. eqns. :
chi2(1) =
11.80
Prob > chi2 = 0.0006
------------------------------------------------------------------------------
We note that the results are very similar to running heckman twice. However, this approach uses
full information maximum likelihood so the standard errors are correct. A drawback with either
of these approaches is that it is not possible to constrain covariate effects to be the same across
the different regimes (tracks). For example, we have no compelling reason to assume that the
effects of X are different for theorists and demographers. To invoke constrained estimation, we
need to modify the movestay routine to allow this. It is convenient to write a new program swreg
as a single call to ML as follows:
cap prog drop swreg
prog def swreg
version 8
*set trace on
args todo b lnf
#delimit ;
tempname xb1 xb2 xb3 sig1 sig2 rho1 rho2 rr1 rr2;
mleval ‘xb1’
=‘b’, eq(1);
mleval ‘xb2’
=‘b’, eq(2);
mleval ‘xb3’
=‘b’, eq(3);
mleval ‘sig1’
=‘b’, eq(4) scalar;
mleval ‘sig2’
=‘b’, eq(5) scalar;
mleval ‘rho1’
=‘b’, eq(6) scalar;
mleval ‘rho2’
=‘b’, eq(7) scalar;
scalar ‘rr1’
= 1/sqrt(1-‘rho1’^2);
scalar ‘rr2’
= 1/sqrt(1-‘rho2’^2);
tempvar eta1 eta2 lf1 eps1 eps2;
qui generate double ‘eta1’ = (‘xb3’ + ($ML_y1 - ‘xb1’) * ‘rho1’/‘sig1’)*‘rr1’;
qui replace ‘eta1’=-37 if (‘eta1’<-37);
qui generate double ‘eta2’ = (‘xb3’ + ($ML_y2 - ‘xb2’) * ‘rho2’/‘sig2’)*‘rr2’;
qui replace ‘eta2’= 37 if (‘eta2’> 37);
qui generate double ‘eps1’ = $ML_y1 - ‘xb1’;
qui generate double ‘eps2’ = $ML_y2 - ‘xb2’;
tempname const1 const2;
scalar const1=0.5*ln(2*_pi*‘sig1’^2);
scalar const2=0.5*ln(2*_pi*‘sig2’^2);
13
qui
mlsum ‘lnf’ = cond($ML_y3==1, ln(norm( ‘eta1’))-const1-0.5*(‘eps1’/‘sig1’)^2,
ln(norm(- ‘eta2’))-const2-0.5*(‘eps2’/‘sig2’)^2);
if (‘todo’==0 | ‘lnf’==.) exit;
#delimit cr
end
*
* Get starting values for b1, b2, b3, sigma1, and sigma2.
*
qui probit zobs x1 x2
matrix b3 = e(b)
predict xg, xb
qui gen lam1 = normden(xg)/norm(xg)
qui gen lam2 = -normden(xg)/norm(-xg)
qui reg y x1 if zobs==1
matrix b1 = e(b)
scalar s1 = e(rmse)
qui reg y x1 if zobs==0
matrix b2 = e(b)
scalar s2 = e(rmse)
*
* Obtain starting values for rho1 & rho2. [optional]
*
qui reg y x1 lam1 if zobs==1
scalar rho1 = _b[lam1]/s1
qui reg y x1 lam2 if zobs==0
scalar rho2 = _b[lam2]/s2
matrix b = (b1, b2, b3, s1, s2, rho1, rho2)
cons def 1 [y1]x1 = [y0]x1
ml model d0 swreg (y1: y = x1)(y0: y = x1)(z:zobs=x1 x2) ///
(sigma1:)(sigma2:)(rho1:)(rho2:), const(1)
ml init b, copy
ml maximize
We can now estimate the model with any number of constraints on the parameters. However, I
have made some modifications which, while offering some flexibility, will not guarantee that the
σ’s and ρ’s will not be “out of bounds.” This means that we will require good starting values
based on previous regressions as shown above. It is always a good idea to compare the results of
complicated models to those from alternative approaches. A comparison is provided below.
Comparison of Results.
Heckman Twice
movestay
swreg
aML
---------------------------------------------------------------------------------------x0
5.41 (0.695)
5.57 (0.668)
5.46 (0.526)
5.45 (0.546)
x1
5.04 (0.892)
5.28 (0.910)
(constrained)
(constrained)
constant0
10.54 (0.554)
10.73 (0.483)
10.69 (0.453)
10.68 (0.505)
constant1
20.26 (0.740)
19.95 (0.715)
19.86 (0.610)
19.87 (0.601)
sigma0
3.55
3.86 (0.159)
3.66 (0.157)
3.66 (0.167)
sigma1
4.36
4.86 (0.328)
4.89 (0.297)
4.89 (0.296)
rho0
-0.40
-0.31 (0.162)
-0.33 (0.151)
-0.33 (0.168)
rho1
0.56
0.58 (0.141)
0.60 (0.120)
0.60 (0.121)
14
Example 2 using aML
It is best to use full information ML on this problem. Outside of the movestay and the
custom-built ML routine above (swreg), there are only a couple of canned programs that can
estimate this kind of model. The most user-friendly of these special-purpose programs is aML
(Applied Maximum Likelihood), which is a general purpose package for models with mixed
structures (as well as multiple levels). This is a free package written by people at the Rand
corporation. I encourage everyone to learn it, or at least become familiar with the idea of
multi-process models. It can do everything from structural equation modeling to multilevel
models, and is the only package to allow users to mix both of these kinds of models. For more
information see www.applied-ml.com.
Below is the setup for an endogenous switching regression using aML. The script and the
results are shown. Note that the setup requires some additional effort on the part of the user to
provide details about the correct joint (trivariate normal) distribution for the error terms and the
starting values. The results are similar to those reported above, with minor differences due to the
fact that we have constrained the effect of X1 to be the same across regimes. Note that the aML
results compare favorably to the swreg routine that I wrote, which is not surprising since the same
parameter constraints are used by both programs. Interpretation of the error correlations is more
straightforward in ML as these measures reflect the direct association between Y and Pr(Z ∗ > 0).
dsn = swregsim.dat;
define regressor set alphaX; var= 1 x1 x2;
define regressor set betaX; var= x1; define par mu0; define par mu1;
define normal distribution; dim = 3; number of integration points = 15;
name = v; name = u0; name = u1;
probit model;
outcome=z; model = regset alphaX + res(draw=1,ref=v);
continuous model; keep if z==0;
outcome=y; model = par mu0 + regset betaX +
res(draw=1,ref=u0);
continuous model; keep if z==1;
outcome=y; model = par mu1 + regset betaX + res(draw=1,ref=u1);
starting values;
Zcons
T
-1.
Zx1
T
1.0
Zx2
T
1.2
Yx1
T
5
mu0
T
10.213598521
mu1
T
19.539468801
sigV
F
1.0
sigU0
T
3.9396435831
sigU1
T
4.6505359691
rhoVU0
T
0
rhoVU1
T
0
rhoU0U1
F
0.0;
We can compare the results to a model that constrains the selection effects. For this, we simply
modify the script above and constrain the ρ terms to zero.
15
Zcons
Zx1
Zx2
Yx1
mu0
mu1
sigV
sigU0
sigU1
rhoVU0
swregsim0
-1.4434 ***
(0.1703)
1.4746 ***
(0.2079)
1.8103 ***
(0.2102)
5.2060 ***
(0.4319)
11.5075 ***
(0.1580)
22.0471 ***
(0.2056)
1.0000
swregsim
-1.0195 ***
(0.1164)
1.0420 ***
(0.1400)
1.2812 ***
(0.1438)
5.4452 ***
(0.5475)
10.6783 ***-------|
(0.5053)
|----Compare Differences
19.8692 ***-------|
(0.6010)
1.0000
3.5632 ***
(0.1123)
4.4109 ***
(0.1450)
0.0000
3.6594 ***
(0.1666)
4.8919 ***
(0.2956)
-0.3304 **
(0.1683)
0.5950 ***
(0.1206)
0.0000
rhoVU1
0.0000
rhoU0U1
0.0000
ln-L
-3418.90
-3408.67
Here (as in all the other models) we find that taking account of self-selection results in a smaller
estimated track effect (b
α1 − α
b0 ), which is 10.55 in the conventional model and 9.19 in the
endogenous switching model. Thus, we find that earnings differences are overstated in the
conventional model by about 1.4 units.
Next we examine how we might evaluate the expected earnings if individuals were assigned to
statuses other than the ones they in fact entered. This might be useful from a policy standpoint.
For example, if a sudden paradigm shift makes all past theory obsolete, theorists would be forced
to become demographers (as opposed to cab drivers). The expected outcome for demographers if
they would have entered theory is:
E(Y1 |Z ∗ < 0) = α1 + Xβ1 + E(u1 |v > Xγ)
φ(Xγ)
.
= α1 + Xβ1 − σ1v
1 − Φ(Xγ)
Similarly, the expected outcome for theorists had they entered demography would be
E(Y0 |Z ∗ > 0) = α0 + Xβ0 + E(u0 |v < Xγ)
φ(Xγ)
.
= α0 + Xβ0 + σ0v
Φ(Xγ)
Note that for the general model, we can allow X to vary by track as is the default in procedures
like heckman-twice and movestay. However, for simplicity, and without a compelling reason to
expect that returns to X differ by group, let us assume that respondents do not differ except with
regard to mean earnings, that is, both the endowments (X) and returns to X (β) are identical for
a hypothetical demographer and theorist. Then the only difference is between the means, which
16
are 20 income units for demographers and 10 income units for theorists. Assume, for convenience,
that Xγ = 1, σ1v = 5 and σ0v = −4, which are similar to our estimated values. Then
E(Y1 |Z ∗ > 0) = E(Y for demographer|in demography track) = 21.44
E(Y0 |Z ∗ < 0) = E(Y for theorist|in theory track) = 16.10
E(Y0 |Z ∗ > 0) = E(Y for theorist|in demography track) = 8.85
E(Y1 |Z ∗ < 0) = E(Y for demographer|in theory track) = 12.37
Verify these results as an exercise.
Example Using Real Data from the NELS.
In this example a switching regression model is used to assess the effect of becoming sexually
active prior to the senior year in high school on 12th-grade math achievement test scores using
the NELS data. About 70% of the sample had sex before 12th grade. We wish to examine
possible endogeneity of first sex (i.e., common unmeasured factors affect sexual initiation and
academic achievement) and measure the “treatment” effect controlling for differential selectivity
into “virgin” and “nonvirgin” statuses.
In this case, the selection process is best modeled using a hazard model for the risk of first sex
at age t, which is the kind of model that I discuss in another course (SOC386L). However, to be
consistent with the previous discussion, we will use a probit model instead. This involves a probit
selection equation with latent outcome Z ∗ and achievement outcome (math test score) Y .
Z ∗ = Xα + v,
Z = 1 if Z ∗ > 0, 0 otherwise
Y0 = µ0 + Xβ + u0 ,
if virgin
Y1 = µ1 + Xβ + u1 ,
if nonvirgin
where v, u1 , and u0 follow a trivariate normal distribution. Note that cov(u1 , u0 ) = 0 because a
person can’t be observed in both regimes (virgin or nonvirgin). The sign and magnitude of
cov(v, u1 ) and cov(v, u0 ) give the magnitude and direction of the selection bias and µ1 − µ0 gives
the treatment effect. Note that the heterogeneity component v in the first sex equation cannot be
identified because 1st sex happens only once and we have a single-level data structure. So we fix
the variance in heterogeneity in sexual initiation at 1 but model the correlation between random
components of both process. The sign and magnitude of cov(v, u1 ) and cov(v, u0 ) give the
magnitude and direction of the selection bias and µ1 − µ0 gives the treatment effect.
aML Model Code
The model is easiest to set up using specialized software (aML).
dsn = ach0.dat;
define regressor set AlphaX;
var = 1 black intact;
define regressor set Beta1X;
var = black cath pub urb faminc;
define par mu1;
define par mu2;
17
define normal distribution; dim=3;
number of integration points = 10;
name=v; name=u0; name=u1;
probit model;
outcome=(1-censor);
model = regset AlphaX + res(draw=1,ref=v);
continuous model; keep if censor==1; /* virgins */
outcome=y; model=par mu1 +
regset Beta1X + res(draw=1,ref=u0);
continuous model; keep if censor==0; /* nonvirgins */
outcome=y; model=par mu2 +
regset Beta1X + res(draw=1,ref=u1);
starting values;
Const
T
1.3419743976
Blk
T
.23864170709
Intact
T
-0.5956678435
Black
T
-6.0047121792
CathSch
T
-2.8252736269
PubSch
T
-5.5770334811
Urb
T
.24995452413
Inc
T
1.6443668192
Mu
0
T
41.045145274
Mu
1
T
37.437397243
sigv
T
1.0
sigu0
T
12.993118301
sigu1
T
12.871405018
r12
T
0.0
r13
T
0.0
r23
F
0.0
;
Results
Model ach_0
Model
ach_1
Probit(Nonvirgin)
Intercpt
1.5337 ***
1.5101 ***
(0.0639)
(0.0667)
Blk
0.2968 ***
0.3013 ***
(0.1038)
(0.1039)
Intact
-0.7107 ***
-0.6856 ***
(0.0706)
(0.0734)
sigv
1.0000
1.0000
-------------------------------------Regression (Test Score)
Black
-5.7804 ***
(0.6328)
-6.2059 ***
(0.6717)
18
CathSch
PubSch
Urb
Inc
Constant
Constant
sigu0
sigu1
r12
-2.8012 ***
(0.9320)
-5.6172 ***
(0.8032)
0.3078
(0.4574)
1.6290 ***
(0.0811)
41.2275 ***
(1.3009)
36.7348 ***
(1.2628)
12.9935 ***
(0.2817)
12.7756 ***
(0.1533)
0.0000
r13
0.0000
r23
0.0000
ln-L
-20446.85
-2.7273 ***
(0.9324)
-5.5940 ***
(0.8035)
0.2987
(0.4578)
1.6667 ***
(0.0831)
40.6370 *** Virgins
(2.0136)
37.9446 *** Nonvirgins
(1.3864)
12.9974 ***
(0.2822)
13.0431 ***
(0.2918)
-0.0188
(0.1268)
-0.4255 **
(0.1938)
0.0000
TREATMENT EFFECT
Difference
exog
endog
4.492
2.692
-20446.52
NOTE: Asymptotic standard errors in parentheses;
Significance: ’*’=10%; ’**’=5%; ’***’=1%.
NOTE: all standard errors are based on numerical Hessians.
Thus, we find that the gain in 12th grade math test scores from being a virgin are overstated by a
bit under two points under the assumption that first sex is exogenous. The estimate of the
correlation (r13) between v and u1 is negative and statistically different from 0, which implies
that nonvirgins have higher test scores than would a population of students who are assigned at
random to nonvirgin status.
Other Approaches
As we saw, the switching regressions approach is well-suited for estimating treatment effects
under endogenous (i.e., self-selected) assignment to treatment and control groups. Next, we
illustrate a simpler approach to estimating endogenous treatment effects.
You have probably thought that if we are interested in the effect track membership on
earnings, why not simply include a dummy variable in our earnings model as,
Y = α + Xβ + Zδ + u
Of course, now you know that δ will not measure the value of a specialization in demography,
because, as we all know, a person who chooses the demography track will have high earnings
whether or not they chose demography as a specialty (due to innate abilities that we can dream
about, but cannot hope to measure). This means that our estimate of δ will overstate the true
treatment effect (of specializing in demography).
19
The appropriate model in this case is,
Z ∗ = Xγ + v
Y ∗ = α + Xβ + Zδ + u
We assume, as before, that u and v are bivariate normal with zero means and variance covariance
2
matrix
σv σuv
Σvu =
σu2
.
We can estimate this model using Stata’s treatreg procedure.
. treatreg y x1,treat(z1=x1 x2)
Treatment effects model -- MLE
Number of obs
=
1000
Log likelihood = -3429.6537
Wald chi2(2)
Prob > chi2
=
=
534.37
0.0000
-----------------------------------------------------------------------------|
Coef.
Std. Err.
z
P>|z|
[95% Conf. Interval]
-------------+---------------------------------------------------------------y
|
x1 |
5.230804
.5684312
9.20
0.000
4.1167
6.344909
z1 |
9.754676
.9276166
10.52
0.000
7.936581
11.57277
_cons |
11.88237
.4639237
25.61
0.000
10.97309
12.79164
-------------+---------------------------------------------------------------z1
|
x1 |
1.045804
.1447645
7.22
0.000
.7620707
1.329537
x2 |
1.286205
.1456226
8.83
0.000
1.00079
1.57162
_cons | -1.027875
.1182486
-8.69
0.000
-1.259638
-.7961118
-------------+---------------------------------------------------------------/athrho |
.1368113
.1480757
0.92
0.356
-.1534118
.4270344
/lnsigma |
1.389886
.0250241
55.54
0.000
1.34084
1.438933
-------------+---------------------------------------------------------------rho |
.1359641
.1453384
-.1522195
.40284
sigma |
4.014394
.1004568
3.822253
4.216194
lambda |
.5458133
.5897084
-.6099939
1.701621
-----------------------------------------------------------------------------LR test of indep. eqns. (rho = 0):
chi2(1) =
0.87
Prob > chi2 = 0.3512
------------------------------------------------------------------------------
Note that our estimate of the treatment effect is 9.75, which compares to an estimate of 10.57
from a naive regression. Heckman suggested a two-stage approach for this model, which is referred
to as the dummy endogenous regression model with structural shift. He suggested to obtain
predicted values for Pr(Z = 1|X) from a probit model, say pbZ , and include this as an additional
regressor in a regression model. Following this procedure for these data gives a treatment effect of
9.70. The quality of this estimate is likely to depend on the predictive ability of the Z equation
(which is too good because we simulated this data). Note that the estimate of ρ of 0.136 is not
significant, and differs from the two estimates we obtained using the switching regression
approach. This is because this single estimate is a weighted average of ρv0 and ρv1 , which have
20
opposite signs. For this reason it is informative about the general issue of endogeneity but not of
the direction of selection, except to suggest there is a slight self-selection effect into demography.
The set-up for this model in aML is straightforward, and gives results identical to treatreg.
However, it should be noted that aML will accommodate different types of outcome variables, not
simply continuous response variables.
dsn = swregsim.dat;
/* specify regressor set for regime */
define regressor set alphaX;
var= 1 x1 x2;
/* specify regressor set of outcome */
define regressor set betaX;
var= 1 x1 z;
/* set up equation error structure */
define normal distribution; dim = 2; number of integration points = 15;
name = v;
name = u;
probit model;
outcome=z;
model = regset alphaX + res(draw=1,ref=v);
continuous model;
outcome=y;
model = regset betaX +
res(draw=1,ref=u);
starting values;
Zcons T -1.0
Zx1
T 1.0
Zx2
T 1.2
mu0
T 10.213598521
Yx1
T 5
Yz
T 10
sigV F 1.0
sigU T 5
rhoVU T 0
;
======================================================================
=
ESTIMATION CONVERGED SUCCESSFULLY
=
=
RESULTS OF ESTIMATION
=
======================================================================
Log Likelihood: -3429.6545
BHHH-based, non-corrected
Parameter
Free?
Estimate
Std Err
T-statistic
1
2
3
4
5
Zcons
Zx1
Zx2
mu0
Yx1
T
T
T
T
T
-1.0267557313
1.044635091
1.2845931836
11.874803358
5.2317064567
.12061251389
.14656887929
.14862568326
.48747752746
.58004436879
21
-8.5128
7.1273
8.6431
24.3597
9.0195
6
7
8
9
Yz
sigV
sigU
rhoVU
T
F
T
T
9.761195825
1.0
4.0137859425
.13538399022
.95280957924
10.2446
.10137260868
.15030444659
39.5944
0.9007
======================================================================
Extensions to Binary Responses
So far we have considered continuous variables as the substantive outcomes. What if the outcome
is binary? In this case, we can use variants of bivariate probit models. The bivariate probit
consists of two probit equations with correlated errors
Z ∗ = Xα + u
Y ∗ = Xβ + v,
where u and v are distributed bivariate normal with 0 means and unit variances and correlation ρ.
A complication arises in the situation when Z (the realization of Z ∗ ) appears as an endogenous
regressor in the Y ∗ equation. In this case, we need identifying restrictions, meaning that some of
the X’s in the Z ∗ equation must not appear in the Y ∗ equation in order to strongly identify the
effect of Z. Consider the example
Z ∗ = α0 + α1 X1 + α2 X2 + u
Y ∗ = β0 + γZ + β1 X2 + v.
We can use ML or a two-stage approach. The two-stage approach uses a probit model for Z ∗ and
generates a predicted probability pZ that is used as a regressor in the Y ∗ equation.
. probit z x1 x2
Probit estimates
Number of obs
LR chi2(2)
Prob > chi2
Pseudo R2
Log likelihood = -617.56905
=
=
=
=
1000
149.04
0.0000
0.1077
-----------------------------------------------------------------------------z |
Coef.
Std. Err.
z
P>|z|
[95% Conf. Interval]
-------------+---------------------------------------------------------------x1 |
1.079573
.1479664
7.30
0.000
.7895645
1.369582
x2 |
1.468242
.1482265
9.91
0.000
1.177724
1.758761
_cons | -1.174744
.1200381
-9.79
0.000
-1.410015
-.939474
-----------------------------------------------------------------------------. predict pz
(option p assumed; Pr(z))
. probit y z x2
Probit estimates
Number of obs
LR chi2(2)
Prob > chi2
Pseudo R2
Log likelihood = -338.77391
22
=
=
=
=
1000
693.33
0.0000
0.5058
-----------------------------------------------------------------------------y |
Coef.
Std. Err.
z
P>|z|
[95% Conf. Interval]
-------------+---------------------------------------------------------------z |
2.545253
.1240325
20.52
0.000
2.302154
2.788352
x2 |
.3813336
.192569
1.98
0.048
.0039052
.758762
_cons | -1.106581
.1469428
-7.53
0.000
-1.394583
-.8185779
-----------------------------------------------------------------------------. probit y pz x2
Probit estimates
Number of obs
LR chi2(2)
Prob > chi2
Pseudo R2
Log likelihood = -640.96636
=
=
=
=
1000
88.95
0.0000
0.0649
-----------------------------------------------------------------------------y |
Coef.
Std. Err.
z
P>|z|
[95% Conf. Interval]
-------------+---------------------------------------------------------------pz |
1.362025
.3685597
3.70
0.000
.6396615
2.084389
x2 |
.5274081
.2360944
2.23
0.025
.0646715
.9901446
_cons | -.8794941
.1194334
-7.36
0.000
-1.113579
-.645409
------------------------------------------------------------------------------
ML Estimation of the Bivariate Probit. For full ML, we need to construct the likelihood
function from all possible outcomes. Define:
P11 = Pr(Z = 1, Y = 1) = F (αX, γ + βX, ρ)
P10 = Pr(Z = 1, Y = 0) = F (αX, −γ − βX, −ρ)
P01 = Pr(Z = 0, Y = 1) = F (−αX, βX, −ρ)
P00 = Pr(Z = 0, Y = 0) = F (−αX, −γ − βX, ρ),
where F (·, ·, ρ) is the standard bivariate normal cdf. Then,
n X
log L =
ZY log P11 + Z(1 − Y ) log P10 + (1 − Z)Y log P01 + (1 − Z)(1 − Y ) log P00 .
i=1
. biprobit (z x1 x2) (y z x2)
Seemingly unrelated bivariate probit
Log likelihood =
Number of obs
Wald chi2(4)
Prob > chi2
-953.4767
=
=
=
1000
162.45
0.0000
-----------------------------------------------------------------------------|
Coef.
Std. Err.
z
P>|z|
[95% Conf. Interval]
-------------+---------------------------------------------------------------z
|
x1 |
1.082444
.1475838
7.33
0.000
.7931849
1.371703
x2 |
1.467911
.1479426
9.92
0.000
1.177949
1.757873
_cons | -1.172706
.1198119
-9.79
0.000
-1.407533
-.9378794
-------------+---------------------------------------------------------------y
|
23
z |
1.4282
.4488288
3.18
0.001
.548512
2.307888
x2 |
.8401728
.2299791
3.65
0.000
.3894219
1.290924
_cons | -.9710812
.1523623
-6.37
0.000
-1.269706
-.6724566
-------------+---------------------------------------------------------------/athrho |
.6271636
.2447731
2.56
0.010
.1474171
1.10691
-------------+---------------------------------------------------------------rho |
.556096
.1690788
.1463584
.8029675
-----------------------------------------------------------------------------Likelihood-ratio test of rho=0:
chi2(1) = 5.73253
Prob > chi2 = 0.0167
Extension to Switching Regressions. This model can be readily adapted to a switching
regressions context. Let the regime equation be
Z ∗ = Xα + v
and the structural equations,
Y0∗ = Xβ0 + u0
if Z = 0
Y1∗ = Xβ1 + u1
if Z = 1
and
Define:
P1 = Pr(Z = 0, Y0 = 1) = F (−Xα, Xβ0 ; −ρ0 )
P2 = Pr(Z = 0, Y0 = 0) = F (−Xα, −Xβ0 ; ρ0 )
P3 = Pr(Z = 1, Y1 = 1) = F (Xα, Xβ1 ; ρ1 )
P4 = Pr(Z = 1, Y1 = 0) = F (Xα, −Xβ1 ; −ρ1 , )
where F (·, ·, ρ) is the standard bivariate normal cdf. Then,
n X
log L =
(1 − Z){Y0 log P1 + (1 − Y0 ) log P2 } + Z{Y1 log P3 + (1 − Y1 ) log P4 } .
i=1
This model cannot be estimated directly in Stata. However, it is not too difficult to program.
Here we define a new procedure called swbvprob for switching bivariate probit regressions.
cap prog drop swbvprob
prog def swbvprob
*set trace on
args todo b lnf
tempvar xb0 xb1 xg rho0 rho1 y0 y1 z lf
mleval ‘xb0’ = ‘b’, eq(1)
mleval ‘xb1’ = ‘b’, eq(2)
mleval ‘xg’ = ‘b’, eq(3)
mleval ‘rho0’ = ‘b’, eq(4)
mleval ‘rho1’ = ‘b’, eq(5)
* scalar ‘rho0’ = 0
* scalar ‘rho1’ = 0
local y0
"$ML_y1"
local y1
"$ML_y2"
local z
"$ML_y3"
qui gen double ‘lf’ = binorm(‘xb0’,-‘xg’, -‘rho0’) if (‘y0’==1 & ‘z’==0)
qui replace ‘lf’ = binorm(-‘xb0’,-‘xg’, ‘rho0’) if (‘y0’==0 & ‘z’==0)
qui replace ‘lf’ = binorm(‘xb1’,‘xg’, ‘rho1’) if (‘y1’==1 & ‘z’==1)
24
qui replace ‘lf’ = binorm(-‘xb1’,‘xg’, -‘rho1’) if (‘y1’==0 & ‘z’==1)
qui mlsum ‘lnf’ = log(‘lf’)
end
*
* run program
*
probit y0 x1
matrix b1 = e(b)
probit y1 x1
matrix b2 = e(b)
probit zobs x1 x2
matrix b3 = e(b)
matrix b = (b1,b2,b3,0,0)
cons def 1 [eq1]x1=[eq2]x1
ml model d0 swbvprob (eq1: y0 = x1)(eq2: y1 = x1) (eq3:zobs=x1 x2) (eq4:) (eq5:), constr(1)
ml init .83 1.65 .83 1.11 1.0 1.3 -1 .2 .5, copy
ml search eq4
0 0.3
ml search eq5
0 0.6
ml maximize
This routine requires good starting values. After a few iterations, we get the following estimates.
Number of obs
Wald chi2(0)
Prob > chi2
Log likelihood = -840.17903
=
=
=
1000
.
.
( 1) [eq1]x1 - [eq2]x1 = 0
-----------------------------------------------------------------------------|
Coef.
Std. Err.
z
P>|z|
[95% Conf. Interval]
-------------+---------------------------------------------------------------eq1
|
x1 |
.8274907
.2589903
3.20
0.001
.3198791
1.335102
_cons |
1.648328
.2526571
6.52
0.000
1.153129
2.143527
-------------+---------------------------------------------------------------eq2
|
x1 |
.8274907
.2589903
3.20
0.001
.3198791
1.335102
_cons |
1.122869
.3617985
3.10
0.002
.4137572
1.831981
-------------+---------------------------------------------------------------eq3
|
x1 |
1.057045
.1444147
7.32
0.000
.7739977
1.340093
x2 |
1.2843
.1458508
8.81
0.000
.9984372
1.570162
_cons | -1.023984
.1183104
-8.66
0.000
-1.255869
-.7921003
-------------+---------------------------------------------------------------eq4
|
_cons |
.1915166
.4360708
0.44
0.661
-.6631665
1.0462
-------------+---------------------------------------------------------------eq5
|
_cons |
.5463507
.2407519
2.27
0.023
.0744857
1.018216
------------------------------------------------------------------------------
This model is much easier to estimate in aML. Below is the aML script to estimate this model.
Note that we need to fix the variances to 1 as required by the normalization of the probit model.
For this model, we constrain the slopes to be equal across regimes. Then the treatment effect can
be estimated as a difference in intercepts.
25
/* data set */
dsn = tvswregsim.dat;
/* specify regressor set for regime */
define regressor set alphaX;
var= 1 x1 x2;
/* specify regressor set of outcome */
define regressor set betaX;
var= x1;
define par mu0;
define par mu1;
define
name =
name =
name =
normal distribution; dim = 3; number of integration points = 15;
v;
u0;
u1;
probit model;
outcome=z;
model = regset alphaX + res(draw=1,ref=v);
probit model; keep if z==0;
outcome=y;
model = par mu0 + regset betaX +
res(draw=1,ref=u0);
probit model; keep if z==1;
outcome=y;
model = par mu1 + regset betaX + res(draw=1,ref=u1);
starting values;
Zcons
T
-1.0242318948
Zx1
T
1.0453925122
Zx2
T
1.284764978
Yx1
T
.63345352653
mu0
T
1.512191648
mu1
T
1.7127904538
sigV
F
1.0
sigU0
F
1.0
sigU1
F
1.0
rhoVU0
T
0.0
rhoVU1
T
0.0
Log Likelihood:
Parameter
1
2
3
4
5
6
Zcons
Zx1
Zx2
Yx1
mu0
mu1
-840.1798
Free?
Estimate
T
T
T
T
T
T
-1.0240495122
1.0566987099
1.2844424075
0.8319107477
1.6486932406
1.1107510248
BHHH-based, non-corrected
Std Err
T-statistic
.11971602174
0.1457274833
.14802401634
.27142110864
.23151806228
.44377621134
26
-8.5540
7.2512
8.6773
3.0650
7.1212
2.5030
7
8
9
10
11
12
sigV
sigU0
sigU1
rhoVU0
rhoVU1
rhoU0U1
F
F
F
T
T
F
1.0
1.0
1.0
.18989034371
.55400851566
0.0
.37980849053
.29657219387
0.5000
1.8680
Comparison of Results.
x1
constant0
constant1
rho1
rho2
swbvprob
0.827 (0.259)
1.648 (0.253)
1.123 (0.362)
0.192 (0.436)
0.546 (0.241)
aML
0.832 (0.271)
1.649 (0.232)
1.111 (0.444)
0.190 (0.380)
0.554 (0.297)
Appendix
Truncated and Censored Distributions. This material is taken almost verbatim from the
great book on limited dependent variables by G.S. Maddala (1983).4
Define F (y) to be the normal distribution function, which is also denoted by Φ(y). Let us also
define f (y) to be the normal density function, dF (y)/dy, which is also referred to as φ(y). For a
normal distribution with mean µ and variance σ, we scale the arguments of these functions, so
that
y−µ
F (y) → F
σ
and
1
f (y) → f
σ
y−µ
σ
.
Mean and Variance of the Truncated Normal Suppose that Z is a standard normal
variate and we consider the truncated distribution of Z > b. We say that the distribution of Z is
truncated from below at point b. This is depicted in the first panel of Figure 1.
Assuming standardized arguments, the mean and variance of the truncated distribution are
given by
ordinate at Z = b
f (b)
=
= M1
E(Z|Z > b) =
1 − F (b)
right-hand tail area
and
var(Z|Z > b) = 1 − M1 (M1 − b).
In the original metric,
E(Y |Y > µ + bσ) = µ + σM1
and
var(Y |Y > µ + bσ) = {1 − M1 (M1 − b)}σ 2 .
Suppose that Z is censored from above at point a, as depicted in the second panel of Figure 1.
Then,
−f (a)
E(Z|Z < a) =
= M2
F (a)
4
G.S. Maddala (1983) Limited Dependent and Qualitative Variables in Econometrics, New York:Cambridge Univ.
Press.
27
b
a
Z
Z
Figure 1: Left Truncation (from below at b) and Right Truncation (from above at a) of a Standard
Normal Distribution
and
var(Z|Z < a) = 1 − M2 (M2 − a)
The transformation back to the original metric (Y ) is the same as earlier.
If Y has a normal distribution with mean µ and variance σ 2 (instead of N(0,1)), then
substitute (Y − µ)/σ, (b − µ)/σ, and (a − µ)/σ, for Z, b, and a in the expressions above. It is also
possible to have double truncation (from above and below).
Use of the Quantities in Modeling. These quantities are used extensively in selection and
switching regression models. For example, we often need to know certain conditional expectations
of the form,
E(u1 |u2 < c)
and
E(u1 |u2 > c),
where u1 and u2 are distributed bivariate normal with mean 0, unit variances, and covariance σ12 .
Note that E(u1 |u2 ) = σ12 u2 . Thus,
E(u1 |u2 < c) = σ12 E(u2 |u2 < c) = −σ12
and
E(u1 |u2 > c) = σ12 E(u2 |u2 > c) = σ12
f (c)
F (c)
f (c)
1 − F (c)
Simulation Checks. You can do a simple simulation to check the analytic expressions against
some test data. Here we find the means for various truncated distributions
# verifying the properties of truncated distributions
# simulate normal with mean=100 std. dev=20
mu
<- 100
sigma <- 20
n <- 30000
28
y <- rnorm(n,mu,sigma)
# suppose that distribution is truncated from below at B and we want E(Y|Y>B)
B <- 80
# standardize
b <- (B-mu)/sigma
# result will be in terms of standardized Y
EY.gt80 <- dnorm(b)/(1-pnorm(b))
# scale back to original
mu + EY.gt80*sigma
# empirically
mean(y[y>B])
# suppose that the distribution is truncated above at A and we want E(Y|Y<A).
A <- 110
# standardize
a <- (A-mu)/sigma
EY.lt110 <- -dnorm(a)/pnorm(a)
mu + EY.lt110*sigma
mean(y[y<A])
Adapt this program to check the expressions for the variances against the simulated data. For one
simulation from a normal distribution with mean 100 and standard deviation 20, I get
E(Y |Y > 80) = 105.75 analytically, and 105.73 based on the simulated distribution, with variance
251.87 analytically and 252.49 (simulated).
29