A Penalized Likelihood Estimation Approach to Semiparametric Sample Selection Binary Response Modelling

A Penalized Likelihood Estimation Approach to
Semiparametric Sample Selection Binary Response
Modelling∗
Giampiero Marra
Rosalba Radice
Dept of Statistical Science
Dept of Health Services Research & Policy
University College London
London School of Hygiene & Tropical Medicine
Gower Street, London WC1E 6BT, U.K.
Tavistock Place, London WC1H 9SH, U.K.
[email protected]
[email protected]
April 30, 2012
Abstract
Sample selection bias plays an important role when estimating the effects of covariates on an outcome variable using survey research data. Statistical methods addressing
this issue typically concern situations where the outcome is continuous. Here, we are
interested in the case where the response is binary. Moreover, continuous covariates
may have a nonlinear relationship to the outcome. We introduce two statistical methods
for the (separate or simultaneous) estimation of two binary regression models involving semiparametric predictors in the presence of sample selection bias. This is achieved
via an adaptation of the Heckman procedure, and a simultaneous equation estimation
scheme. Both approaches are based on the penalized likelihood estimation framework.
The problems of identification and inference are also discussed. The empirical properties of the proposed methods are studied through an extensive simulation study. As an
application, we analyse data from the American National Election Study with the aim of
more realistically quantifying public support for school integration.
Keywords: Binary responses; Bivariate probit; Heckman-type estimation approach; Penalized regression spline; Sample selection bias; Survey.
∗ Research
Report No. 315, Department of Statistical Science, University College London. Date: May 2012.
1
1 Introduction
Sample selection bias is an issue that has plagued much of survey research (e.g. Groves et al.,
2002). In public surveys, for example, it may be the case that some individuals choose not to
answer some specific questions because they feel that their opinion might paint them in an
unfavorable light. This leaves us with a self-selected sample. So if the interest is quantifying
the relationship between various demographic and socio-economic characteristics and an
outcome variable in the population as a whole, then using the responding subsample is
likely to produce biased estimates. This source of bias is known in the literature as sample
selection bias (Heckman, 1979; Dubin and Rivers, 1990).
To fix ideas, let us consider a study of public opinion polls on school integration
(http://www.electionstudies.org/). The main question was if respondents support
government intervention to ensure that black and white children go to the same school. In
particular, individuals were first asked if they had an opinion on the integration question
(0 = no, 1 = yes) and then what that opinion was (0 = no integration, 1 = yes integration).
Information on individual demographic and socio-economic characteristics was also recorded.
If the respondents who opposed government involvement in school integration chose not
to answer the integration question because they felt their opinion might be perceived as
socially unacceptable, then it might have been the case that the sample of individuals who
provided an opinion differed in systematic ways from the sample of non-respondents. To
clarify this (often misunderstood) concept, let us characterize each individual by some observable and unobservable features or confounders. If the responding and nonresponding
subsamples have similar characteristics, then the issue of sample selection bias does not
arise since the average (observable and unobservable) features of the responding sample
are similar to those of the population. If the decision to answer is no longer random, due
to potentially differing characteristics between the responding and nonresponding samples,
then sample selection bias arises if some of the determinants of the decision to answer also
influence the outcome/response itself. When the relationship between the decision to respond and outcome is only through observables, it is possible to correct for sample selection
bias by controlling for these variables in the outcome equation. However, in the presence
of unobservables influencing the decision to answer and the outcome, controlling only for
observables is clearly insufficient. That is, if some individuals are part of the responding
subsample because of their unobservable features, then regardless of whether the observables and unobservables are correlated in the overall population they will be in the selected
sample (e.g. Dubin and Rivers, 1990). This means that ignoring the potential correlation
between the unobservable factors influencing the decision to answer and the outcome can
lead to biased and inconsistent estimates of the covariate impacts in the outcome equation.
2
Note that there are a number of examples where the issue of sample selection bias occurs.
These include the study of prevention program for high school dropouts (Montmarquette et
al., 2001), family-related factors for foster approval (Cuddeback et al., 2004), and risk factors
for HIV prevalence (Bärnighausen et al., 2011), to name a few.
Statistical methods correcting for the phenomenon described in the previous paragraph
have been developed. Most of these works concern models where the response variable is
continuous (e.g. Chib et al., 2009; Heckman, 1979; Green, 2007; Wiesenfarth and Kneib, 2010,
and references therein). Here, we are interested in the case where the response is binary.
The only procedures currently available to fit a sample selection binary response model are
those presented in Van de Ven and Van Praag (1981) and Green (2007). These involve the
(separate or simultaneous) estimation of two binary regression models for the outcome and
selection equations. The outcome equation is used to examine the substantive question of
interest, whereas the selection equation is used to detect selection bias and obtain unbiased
estimates of the covariate effects in the outcome equation. One potential drawback to the
application of the techniques by Van de Ven and Van Praag (1981) and Green (2007) is the
lack of flexibility in handling the possible presence of nonlinear relationships. The need for
methods flexibly modelling regressor effects arises from the observation that all model parameter estimates are biased and inconsistent when the relationship between the observable
confounders and the outcome is mismodelled as a result of the use of linear or low-order
polynomial terms for example (Chib et al., 2009; Marra and Radice, 2011a). This is problematic as it may prevent the researcher from recognizing a strong covariate effect or, more
generally, revealing interesting relationships.
In this article, we discuss an extension of the procedure presented in Van de Ven and
Van Praag (1981) which allows for flexible functional dependence of the binary response
variables on continuous covariates. We also introduce a penalized likelihood estimation
approach for a simultaneous system of two binary equations which include smooth functions of continuous covariates. The methods discussed here are implemented in the freely
available R package SemiParBIVProbit (Marra and Radice, 2011b). To the best of our
knowledge, no other computational alternatives which consider a semiparametric sample
selection binary response model are available in the literature. It may be argued that the
model setup adopted here is pretty similar to that of Chib et al. (2009) and Wiesenfarth
and Kneib (2010) except for the fact that, in the current context, the outcome variable is binary. This suggests that the Bayesian estimation schemes introduced by these authors could
be easily extended to estimate the model considered in this paper. We did not pursue a
Bayesian approach because the computational developments by Chib et al. (2009) are not
available for public use, and because the approach by Wiesenfarth and Kneib (2010) suffers
from convergence problems due to the persistent autocorrelations of the parameters sam3
pled in the Markov chain Monte Carlo algorithm. As Wiesenfarth and Kneib (2010) point
out, this issue is common in Bayesian sample selection models and the autocorrelations do
not disappear even with long simulation runs and considerable thinning. Such an issue becomes even more severe when the correlation between the outcome and selection equations
is high.
The empirical properties of the methods are studied through an extensive simulation
study. The proposals are then illustrated using the above-mentioned data on public opinion
polls on school integration. Overall, the results indicate that the introduced approaches are
effective for flexibly estimating covariate impacts in the presence of selection bias, and for
more realistically quantifying public support for school integration.
2 Methods
In this section, we describe the model structure of a semiparametric binary response model
with sample selection, discuss two possible strategies for parameter estimation and address
the problems of identification and inference.
2.1 The model
The model consists of a first equation, known as selection equation, and a second outcome
equation determining the response variable of interest. The selection equation, expressed
using the latent variable representation, is given as
∗
+
y1i
= x1i
θ1 +
K1
∑
k 1 =1
f 1k1 (z1k1 i ) + ε 1i , i = 1, . . . n,
(1)
∗ is a latent continuous variable which is related to
where n denotes the sample size, and y1i
∗ > 0). The outcome equation is given as
its observable counterpart y1i through the rule 1(y1i
∗
+
y2i
= x2i
θ2 +
K2
∑
k 2 =1
f 2k2 (z2k2 i ) + ε 2i ,
(2)
where y2i is determined according to the rule
y2i =
(
∗
1 if (y2i
> 0 & y1i = 1)
∗
0 if (y2i
< 0 & y1i = 1)
4
.
+
+
+
In (1), x1i
= 1, x12i
, . . . , x1P
represents the ith row vector of x1+ , the n × P1 model matrix
1i
for any strictly parametric model components (such as dummy and categorical variables),
with corresponding parameter vector θ1 . The f 1k1 are unknown smooth functions of the continuous covariates z1k1 i , and K1 denotes the number of regressors z1k1 i . These components
are represented using the regression spline approach (see next section). Each smooth term
may also be multiplied by some predictor, yielding a ‘varying coefficients’ model (Hastie
and Tibshirani, 1993). Moreover, smooth functions of two covariates such as f 11,12 (z11i , z12i )
+
can be also considered (e.g. Wood, 2006, pp. 154-167). Similarly in (2), x2i
is the ith row vector of the ns × P2 model matrix x2+ , with coefficient vector θ2 , the f 2k2 are unknown smooth
terms of the continuous regressors z2k2 i , K2 denotes the number of regressors z2k2 i and ns
the size of the selected sample. For identification purposes, smooth terms are subject to
constraints such as ∑i f k (zki ) = 0 for all k. The error terms (ε 1i , ε 2i ) are assumed to follow
the bivariate normal distribution
!
" # "
#!
ε 1i
0
1
ρ
iid
∼N
,
,
(3)
ε 2i
0
ρ 1
where ρ is the correlation coefficient. The error variances are normalized to unity. This is a
conventional normalization that is necessary because the parameters in the model can only
be identified up to a scale coefficient (e.g. Greene 2007, p. 686).
2.1.1
Smooth function representation
A popular and effective way of representing smooth functions of continuous covariates is
the regression spline approach. The basic idea is to approximate a generic function f k (zki )
by a linear combination of known spline basis functions, bkj (zki ), and regression parameters,
β kj ,
Jk
f k (zki ) =
∑ β kj bkj (zki ) = Bk (zki )βk ,
j =1
where J is the number of spline bases and hence regression coefficients used to represent
f , Bk (zki ) represents the ith 1 × Jk row vector consisting of the basis functions evaluated at
the observation zki , i.e. Bk (zki ) = bk1 (zki ), bk2 (zki ), . . . , bkJ (zki ) , and βk is the corresponding parameter vector. Calculating Bk (zki ) for each i yields Jk curves encompassing different
degrees of complexity which multiplied by some real valued parameter vector βk and then
summed give a curve estimate for f k (zk ). Basis functions are usually chosen to have convenient mathematical properties and good numerical stability. Possible choices are B-splines,
cubic regression and thin plate regression splines. We will employ the latter splines although, in the one dimension case, they all lead to very similar results. See, e.g., Ruppert et
5
al. (2003) for a more detailed introduction. Based on the result above, equations (1) and (2)
can written as
∗
+
y1i
= x1i
θ1 + B1i β1 + ε 1i , i = 1, . . . , n,
(4)
and
where Bvi
1, 2.
∗
+
y2i
= x2i
θ2 + B2i β2 + ε 2i , i = 1, . . . , ns ,
(5)
T , β T , . . . , β T ), for v =
= Bv1 (zv1i ), Bv2 (zv2i ), . . . , BvK1 (zvKv i ) and βvT = (βv1
v2
vKv
2.2 Heckman-type estimation approach
Sample selection bias can be dealt with by using a device which Heckman (1979) introduced
for an analogous problem in binary-choice regression, and that Van de Ven and Van Praag
(1981) employed in the context of a fully parametric probit model with sample selection.
Here, we use the same device but in the context of semiparametric probit analysis.
The population regression function for (5) can be written as
+
∗ +
|x2i , B2i ) = x2i
θ2 + B2i β2 ,
E (y2i
∗ > 0)
while the regression function for the subsample of complete observations (i.e. when y1i
is
∗
+
∗
+
∗ +
, B2i , y1i
> 0).
(6)
θ2 + B2i β2 + E (ε 2i |x2i
> 0) = x2i
E (y2i
|x2i , B2i , y1i
Given that (ε 1i , ε 2i ) are assumed to follow a standardized bivariate normal distribution with
correlation coefficient ρ, we have that
+
∗
E (ε 2i |x2i
, B2i , y1i
> 0) = ρϑi
(7)
+
where ϑi = φ(η1i )/Φ(η1i ), and is typically called inverse Mills ratio, η1i = x1i
θ1 + B1i β1 ,
and φ and Φ are the density function and distribution function of a standardized normal.
Regression equation (6) can be therefore written as
+
∗
θ2 + B2i β2 + ρϑi + e
ε 2i ,
= x2i
y2i
(8)
∗ > 0) = 0 and E (e
∗ > 0) = τ 2 . Based on the derivations by Heckman
where E (e
ε 2i |y1i
ε22i |y1i
i
(1979), τi2 = 1 + ρ2 ϑi (−η1i − ϑi ). It is then clear to see that the coefficient estimates in (5)
obtained using a nonrandom selected subsample are biased if ρ 6= 0. This can be thought
of as arising from an ordinary specification error with the conditional mean (7) deleted as
a covariate in the model. Including ϑi as an explanatory variable, as in equation (8), would
6
rectify this situation. In practice we do not know ϑi , but it is possible to obtain a consistent
estimate of it based on the estimated coefficients of selection equation (4). After dividing
∗ > 0) = τ 2 , and using a selected sample, we can obtain
equation (8) by τi , because E (e
ε22i |y1i
i
unbiased parameter estimates using the model
+
∗
∗
β2 + ρ(ϑi /τi ) + ε¯ 2i ,
= (x2i
/τi )θ2 + B2i
y2i
(9)
∗ includes the quantities corresponding to the spline bases for the smooth functions
where B2i
∗ > 0) = 0, E ( ε¯2 | y∗ > 0) = 1, and ϑ and τ can
of the covariates z2k2 i rescaled by τi , E (ε¯ 2i |y1i
i
i
2i 1i
be consistently estimated as described above.
The algorithm to fit model (9) can be summarised as follows:
step 1 Fit a probit model for sample selection equation (4) so to obtain consistent estimates
of ηb1i and hence ϑbi , for all i.
step 2 Using the selected sample only, obtain a consistent estimate of ρ by fitting a linear
b
probability model
q for equation (8), where ϑi is replaced with ϑi for all i.
step 3 Estimate τbi via 1 + ρb2 ϑbi (−ηb1i − ϑbi ) where ϑbi , ηb1i and ρb are obtained in steps 1 and 2.
step 4 Using the selected sample only and after rescaling all variables in the model by the τbi ,
fit a probit model for equation (9).
Remark 1. Equation (9) is fitted using a probit model even if the normality assumption of
ε¯ 2i , which is necessary for consistency, is clearly not met (e.g. Van de Ven and Van Praag,
1981). Hence, the results obtained from step 4 will be affected by some bias in the estimates.
In addition, the method can produce an estimate for ρ which is not in the range [−1, 1].
Remark 2. Standard errors of the parameter estimates in (9) are not realistic in that, for
example, they do not account for that additional source of sampling variability introduced
because the ϑi and τi have to be estimated.
Remark 3. The models used in the steps outlined above may be fitted using unpenalized
parameter estimation procedures. However, because of the flexible model specification considered here, this is likely to result in smooth function estimates that are too wiggly to produce realistic and reliable results. This issue can be overcome by penalized estimation,
R
where an objective function is augmented by a penalty term, such as ∑k λk f k00 (zk )2 dzk ,
measuring the (second-order, in this case) roughness of the smooth terms in the model. The
λk are smoothing parameters controlling the trade-off between fit and smoothness. In practice, since regression splines are linear in their model parameters, such a penalty can be
expressed as a quadratic form in the generic parameter vector β (containing the coefficients
R
of all smooth terms in the model), i.e. ∑k λk f k00 (zk )2 dzk = β T (∑k λk Sk ) β, where the Sk are
positive semi-definite known matrices. Depending on the model employed, the regression
parameters can be estimated by either minimization of a penalized least squares criterion or
7
maximization of a penalized log-likelihood function. The λk can be selected via a prediction
error or maximum likelihood criterion (e.g. Ruppert et al., 2003, Ch. 8).
2.3 Bivariate probit estimation approach
Since it is assumed that the errors (ε 1i , ε 2i ) follow a standardized bivariate normal distribution with correlation coefficient ρ, simultaneous equation parameter estimation is advisable.
Let us define the linear predictors ηvi = x+
vi θv + Bvi βv for v = 1, 2. Recognizing that in the
sample selection context the data identify only the three possible events (y1i = 1, y2i = 1),
(y1i = 1, y2i = 0) and (y1i = 0), with the following probabilities
+
+
P (y1i = 1, y2i = 1|x1i
, B1i , x2i
, B2i ) = p11i = Φ2 (η1i , η2i ; ρ),
+
+
P (y1i = 1, y2i = 0|x1i
, B1i , x2i
, B2i ) = p10i = Φ(η1i ) − Φ2 (η1i , η2i ; ρ),
+
P (y1i = 0|x1i
, B1i ) = p0i = Φ(−η1i ),
where Φ2 is the distribution function of a standardized bivariate normal with correlation ρ,
the log-likelihood function is
n
`(δ ) =
∑ {y1i y2i log( p11i ) + y1i (1 − y2i ) log( p10i ) + (1 − y1i ) log( p0i )} ,
i =1
where δ T = (δ1T , δ2T , ρ), and δvT = (θvT , βvT ) for v = 1, 2.
As explained in the previous section, in a smoothing context, it is necessary to penalize
the regression spline coefficients in order to suppress that part of smooth term complexity
which has no support from the data. The model is therefore fitted by maximization of the
penalized log-likelihood
1
(10)
` p (δ ) = `(δ ) − β T Sλ β,
2
where β T = (β1T , β2T ) and Sλ = ∑2v=1 ∑kKvv=1 λkv Skv . Note that because ρ is bounded in
[−1, 1], we use the common transform for correlation ρ∗ = tanh−1 (ρ) = (1/2) log {(1 + ρ) / (1 − ρ)},
so that [−1, 1] is mapped to the real line. Given values for the λkv , we seek to maximize (10).
In practice, this can be achieved by Fisher scoring iterating
∗ −1 [ a ]
δˆ [a+1] = δˆ [a] + (I [a] + Sλ
) (g − Sλ∗ δˆ [a] )
(11)
∗
until convergence, where a is the iteration index and Sλ
denotes the overall block-diagonal
penalty matrix made up of the λkv Skv and 0 corresponding to any strictly parametric model
components, since such terms are not penalized. The gradient vector g is defined by two
subvectors g1 = ∂`(δ )/∂δ1 and g2 = ∂`(δ )/∂δ2 , and a scalar g3∗ = ∂`(δ )/∂ρ∗ , while the
8
Fisher information matrix has the following 3 × 3 matrix block structure



I = −E 

∂2 `(δ )
∂δ1 ∂δ2T
∂2 `(δ )
∂δ2 ∂δ2T
∂2 `(δ )
∂ρ∗ ∂δ2
∂2 `(δ )
∂δ1 ∂δ1T
∂2 `(δ )
∂δ2 ∂δ1T
∂2 `(δ )
∂ρ∗ ∂δ1
∂2 `(δ )
∂δ1 ∂ρ∗
∂2 `(δ )
∂δ2 ∂ρ∗
∂2 `(δ )
∂ρ∗2



.

The analytical expressions for g and I in the three main parameter space dimensions are
given in the Appendix. Step (11) is implemented using a trust region algorithm with eigendecomposition of I at each iteration (e.g. Nocedal and Wright, 1999, Section 4.2). This
proved to be faster and more reliable than the standard approaches adopted in the literature
to estimate likelihood-based models.
In (11), the smoothing parameters are fixed at some values. This is because joint estimation of δ and λ (containing all the λkv ) via maximization of (10) would clearly lead to severe
overfitting since the highest value for ` p (δ ) would be obtained when λ = 0. Hence the need
to estimate λ using an appropriate criterion.
2.3.1
Smoothing parameter estimation
Smoothing parameter selection can, in principle, be achieved by direct grid search optimization of, e.g., a prediction error criterion. However, if the model has more than two
or three smooth terms, then this typically becomes computationally burdensome, hence
making the model building process difficult in most applied contexts. There are a number
of techniques for automatic multiple smoothing parameter selection within the penalized
likelihood framework. Without claim of exhaustiveness, these include the performanceoriented iteration method originally proposed by Gu (1992) and mixed model approach to
penalized regression spline estimation (e.g. Ruppert et al., 2003). The former applies the
generalized cross validation or unbiased risk estimator (UBRE), introduced by Craven and
Wahba (1979), to each working linear model of the penalized iteratively re-weighted least
squares (P-IRLS) scheme used to fit the model. The latter consists of viewing the λkv as
variance components so that they can be estimated, e.g., by restricted maximum likelihood.
Here, we adapt Gu’s approach to the current context.
Given values for the λkv and considering that Fisher scoring step (11) can be written in
P-IRLS form, δˆ [a+1] is the solution to the problem
√ [ a]
∗
δ w.r.t. δ,
minimize k W (z[a] − Xδ )k2 + δ T Sλ
(12)
√ T√
√
where W is any iterative weight non-diagonal matrix square root such that W W =
W, zi is a 3-dimensional pseudodata vector given as zi = Xi δ [a] + Wi−1 di , where di =
9
∗ and η ∗ = ρ∗ , W is the 3 × 3 matrix
∂`(δ )i /∂η1i , ∂`(δ )i /∂η2i , ∂`(δ )i /∂η3i
i
3i



Wi = − E 

∂2 `(δ )i
2
∂η1i
2
∂ `(δ )i
∂η2i ∂η1i
∂2 `(δ )i
∗ ∂η
∂η3i
1i
∂2 `(δ )i
∂η1i ∂η2i
∂2 `(δ )i
2
∂η2i
2
∂ `(δ )i
∗ ∂η
∂η3i
2i
∂2 `(δ )i
∗
∂η1i ∂η3i
2
∂ `(δ )i
∗
∂η2i ∂η3i
2
∂ `(δ )i
∗2
∂η3i



,

and, assuming without loss of generality that the spline basis dimensions for the smooth
terms in the model are all equal to J, Xi is a 3 × {( P1 + K1 × J ) + ( P2 + K2 × J ) + 1} block
+
+
diagonal matrix, i.e. Xi = diag (x1i
, B1i ), (x2i
, B2i ), 1 . The superscript [ a] has been suppressed from di , zi , and Wi to avoid clutter.
The λkv should be selected so that the estimated smooth functions are as close as possible
to the true functions. Given an estimate for δ, multiple smoothing parameter estimation for
problem (12) can be achieved by minimization of the approximate UBRE score, also thought
of as an approximate rescaled Akaike information criterion,
Vuw (λ) =
2
1 √
k W(z − Xδ )k2 − 1 + tr(Aλ ),
n∗
n∗
(13)
where the working linear model quantities are estimated in the Fisher scoring step, n∗ = 3n,
∗ −1 T
Aλ = X(XT WX + Sλ
) X W is the hat matrix, and tr(Aλ ) represents the estimated degrees
of freedom of the penalized model. For each working linear model of the P-IRLS iteration,
Vuw (λ) is minimized with respect to λ. The two steps, one for δ the other for λ, are iterated until convergence. Here, this approach is implemented employing the approach by
Wood (2004), which is based on Newton’s method and can evaluate score (13) and their
derivatives in a way that is both computationally efficient and stable. Generally speaking,
√
Q and
this is achieved using WX = QR, obtained by pivoted QR decomposition,
" where
#
R
R are defined in the usual manner, and a singular value decomposition
= UDVT ,
L
∗
∗
where L is any matrix square root of Sλ
such that LT L = Sλ
, obtained for instance by eigendecomposition, the columns of U are those of an orthogonal matrix, V is an orthogonal
matrix, and D is a diagonal matrix of singular values which are very useful to detect numerical rank deficiency of the fitting problem. Based on this, evaluation of tr(Aλ ) for new
trial values of the smoothing parameters can be made relatively cheap, and the derivatives
of Vuw with respect to λ can be stably and efficiently evaluated. Note that minimization of
the score is with respect to λ∗ = log(λ) since the smoothing parameters must be positive.
See Wood (2004) for further details.
Remark 1. In the context of simultaneous equation estimation methods, the use of Fisher
10
scoring is recommended because the Wi are positive-definite over a larger region of the
parameter space as compared to those obtained by Newton-Raphson. This is crucial given
√
that W and W−1 (via z), obtained by eigen-decomposition, are needed in (13).
Remark 2. Because W is a non-diagonal matrix of dimension n∗ × n∗ , computation can
√
√
quickly become prohibitive, even for small problems. To calculate W−1 d, Wz and WX so
that the computational load and storage demand of the algorithm is reduced considerably,
the band structure of W is exploited. Hence, the working linear model in (13) is formed in
O(n∗ (m + 2)) rather than O(n2∗ (m + 2)) operations, where m is the number of columns of X.
Remark 3. As opposed to the Heckman-type approach, simultaneous estimation of all
model parameters via the bivariate probit scheme introduced here does not rely on approximations and does not require the use of quantities estimated in previous stages, hence
the procedure will yield unbiased and consistent estimates. Moreover, standard errors do
not need to be corrected since the approach does not require the use of the inverse Mills
ratio, and all parameters in δ are estimated jointly. All this convenience comes at expense
of computational cost and stability. However, these can be satisfactorily dealt with by taking the precautions described in last two sections, and suppling the Heckman estimates
(obtained using the procedure of Section 2.2) as starting values in the bivariate probit estimation scheme.
2.4 Empirical identification
The parameters of the approach described in Section 2.2 are formally identified even if
+
+
, B2i , because of the nonlinearity of the inverse Mills ratio (e.g. Puhani,
x1i
, B1i = x2i
2000). In practice, however, this will often result in substantial collinearity between ϑbi and
the other covariates in the outcome equation. This is especially true when the variation in
ηb1i will be such that the nonlinearity of the inverse Mills ratio does not play a major role,
case in which ϑbi would be approximated quite well by a linear function of the covariates
in the model. This collinearity can lead to large standard errors and instability in the estimates. The parameters of the method detailed in Section 2.3 are also formally identified, but
with the advantage of not dealing with the drawbacks deriving from the use of the inverse
Mills ratio. However, in the absence of equation-specific regressors the likelihood function
may not vary significantly over a wide region around the mode, hence causing numerical
difficulties.
This suggests that, although the model parameters are formally identified, empirical
identification is better achieved if the exclusion restriction (ER) on the covariates in the two
equations holds (e.g. Little, 1985; Puhani, 2000). That is, the regressors in the selection
equation should contain at least one or more regressors than those included in the outcome
11
equation. The simulation study in Section 3 will also shed light on this issue, particularly as
to when the exclusion restriction is crucial to identify empirically the model parameters.
2.5 Inference
The methods described in Section 2 rely on penalized log-likelihood estimation. Within this
framework, inferential theory is not standard. This is because of the presence of smoothing
penalties which undermines the usefulness of classic frequentist results for practical modelling (e.g. Wood, 2006). Solutions to this problem have been proposed. In this section, we
show how to construct confidence intervals for the components in a semiparametric sample
selection model adapting some of the results available in the literature.
The well known Bayesian ‘confidence’ intervals originally proposed by Wahba (1983)
in the univariate spline model context, and then generalized to the non-constant width
component-wise interval case when dealing with Gaussian and non-Gaussian data are typically used to reliably represent the uncertainty of smooth functions (e.g. Gu, 2002). An
interesting feature of these intervals is that they have close to nominal ‘across-the-function’
frequentist coverage probabilities, despite their derivation is Bayesian (Nychka, 1988; Marra
and Wood, in press). To better understand this point, let us consider a generic smooth model
component f (zi ). Intervals can be constructed seeking some constants Ci and A, such that
1
ACP = E
n
(
∑ I(| fˆ(zi ) − f (zi )| ≤ qα/2 A/
i
p
Ci )
)
= 1 − α,
(14)
where ‘ACP’ denotes ‘Average Coverage Probability’, I is an indicator function, α is a constant between 0 and 1, and qα/2 is the α/2 critical point from a standard normal distribution. Defining b(z) = E { fˆ(z)} − f (z) and v(z) = fˆ(z) − E { fˆ(z)}, so that fˆ − f = b + v,
and I to be a random variable uniformly distributed on {1, 2, . . . , n}, we have that ACP =
√
√
Pr (| B + V | ≤ qα/2 A), where B = C I b(z I ) and V = C I v(z I ). At this point, it is necessary
to find the distribution of B + V and values for the Ci and A so that requirement (14) is met.
As theoretically shown in Marra and Wood (in press), in the context of non-Gaussian response models involving several smooth components, such a requirement is approximately
met by the posterior distribution
ˆ Vδ ),
δ |yvN
˙ (δ,
(15)
where, for the approach described in Section 2.3, y refers to the response vectors, δˆ is the
∗ )−1 . Note that the same distributional result can be used
estimate of δ and Vδ = (I + Sλ
for the models described in Section 2.2. Given result (15), confidence intervals for linear
and nonlinear functions of the model parameters can be easily obtained. For any strictly
12
parametric model components, using (15) to obtain confidence intervals is equivalent to
using classic likelihood results. This is because such model terms are not penalized. It is
important to stress that there is no contradiction in fitting the sample selection model via
penalized log-likelihood estimation and then constructing confidence intervals following a
Bayesian approach, and such a procedure has been employed many times in the literature
(e.g. Gu, 2002; Marra and Radice, 2011a; Wood, 2006). Alternatively, one could employ a
fully Bayesian approach. However, as explained in the introductory section, this is unlikely
to be successful when used for parameter estimation.
Result (15) can produce intervals with close to nominal coverage probabilities for the
model components when using the estimation scheme described in Section 2.3. However,
this is not true for the approach detailed in Section 2.2, for the reasons given in Remark
2. As a solution, posterior simulation can be employed (e.g. Wood, 2006). Specifically, we
propose to adjust the intervals using the following algorithm:
• Let the parameter vector and covariance Bayesian matrix estimated in step 1 be δˆ1 and
ˆ δ . Draw Ns random vectors from N (δˆ1 , V
ˆ δ ) and then calculate the corresponding
V
1
1
∗
∗
b
Ns values ηb and ϑ , for all i.
1i
i
ols , δˆ ols , . . . , δˆ ols and V
ˆ ols , V
ˆ ols , . . . , V
ˆ ols . For each
• Fit Ns step 2 models to obtain δˆ2,1
2,2
2,Ns
δ2,1
δ2,2
δ2,Ns
parameter vector and covariance matrix combination, draw Ns random vectors from
the corresponding Gaussian distribution and then calculate the Ns2 values ρb∗ .
∗ and ρ
b∗ .
• Calculate the Ns2 values τbi∗ , for all i, using ϑbi∗ , ηb1i
• Fit Ns2 step 4 models, using each of the ϑbi∗ and τbi∗ combination, to obtain δˆ2,1 , δˆ2,2 , . . . , δˆ2,Ns2
ˆ δ ,V
ˆ δ ,...,V
ˆ δ . For each parameter vector and covariance matrix combinaand V
2,2
2,1
2,Ns2
tion, draw Nd random vectors from the corresponding Gaussian distribution so to
obtain Ns2 × Nd random draws from which approximate intervals for the component
functions of model (9) can be constructed.
This procedure can account for the extra source of variability introduced via the quantities
calculated in the steps 1–3 of the Heckman-type estimation approach. Simulation experience suggests that small values for Ns and Nd , say 20 and 100, will be tolerable in practice.
A similar scheme has been previously employed by Marra and Radice (in press) who found
close to nominal across-the-function observed coverage probabilities for the model components.
3 Simulation study
To gain insight into the empirical properties of the estimation approaches detailed in the previous sections, a Monte Carlo simulation study was conducted. All computations were per13
formed in the R environment (R Development Core Team, 2011) using the package SemiParBIVProbit
which implements the ideas discussed in this article (Marra and Radice, 2011b).
3.1 Design and model fitting details
The sampling experiments were based on the model
∗
y1i
= θ11 + θ12 xi+ + f 11 (z1i ) + f 12 (z2i ) + ε 1i
∗
y2i
= θ21 + θ22 xi+ + f 21 (z1i ) + ε 2i
,
0.0
0.2
0.4
0.6
0.8
1.0
0.0
−0.2
f21(z1)
−0.8
−0.6
−0.4
0.2
f12(z2)
−2
−0.4 −0.2
0.0
0
−1
f11(z1)
0.4
1
0.6
0.2
0.8
where the binary outcomes y1i and y2i were determined according to the rules described in
Section 2.1. The test functions were f 11 (z1i ) = −0.7 4z1i + 2.5z21i + 0.7 sin(5z1i ) + cos(7.5z1i ) ,
f 12 (z2i ) = −0.4 {−0.3 − 1.6z2i + sin(5z2i )}, and f 21 (z1i ) = 0.6 {exp(z1i ) + sin(2.9z1i )}; these
are displayed in Figure 1. (θ12 , θ22 ) was set to (2.5, −1.5). To generate binary values for y1i
so that approximately 50% of the total number of observations were selected to fit the outcome equation, and values for y2i which appeared approximately the same number of times,
(θ11 , θ21 ) was set to (0.58, −0.68). As for xi+ , z1i and z2i , using rmvnorm() in the package
mvtnorm, three uniform covariates on (0, 1) were obtained generating standardized multivariate random draws with correlation 0.5 and then applying pnorm() (Gentle, 2003). The
regressor xi+ was dichotomized using round(). Standardized bivariate normal errors with
correlations ρ = (±0.1, ±0.5, ±0.9) were considered, and sample sizes set to 1000, 3000 and
5000. In a full factorial design fashion, 1000 replications of each combination of parameter
settings were obtained.
0.0
0.2
0.4
z1
0.6
0.8
1.0
0.0
z2
0.2
0.4
0.6
0.8
1.0
z1
Figure 1: The test functions used in the simulation studies.
To explore empirically the issue of identification discussed in Section 2.4, we fitted the
model using the data generating process described above but with no exclusion restriction
(noER), i.e. without including z2i in the selection equation. We also considered the case
14
in which approximately 75% of the total number of observations were selected to fit the
outcome equation. This was achieved setting θ11 to 1.66.
In the two estimation approaches, the smooth components were represented using penalized thin plate regression splines with basis dimensions equal to 10 and penalties based
on second-order derivatives (Wood, 2006, pp. 154-160). In cross-sectional studies, 10 bases
typically suffice to represent reasonably well smooth functions, although sensitivity analysis using fewer or more spline bases is advisable in applied work. Models were also fitted
neglecting the sample selection issue, i.e. simply fitting equation (5) on the selected sample
(henceforth, this will be referred to as naive approach).
3.2 Results
In this section, we only show a subset of results; these are representative of all empirical
findings. As pointed out in the introduction, the selection equation is not affected by sample
selection bias, hence we focus on the estimation results for the outcome equation.
Figures 2, 3 and 4 present the boxplots of the estimates of θ22 , those of ρ, and the empirical root mean squared errors (RMSE) of fb21 (z1 ) when employing the naive, Heckman-type
and bivariate probit estimation approaches, ER holds and approximately 50% of the total
number of observations are available to fit the outcome equation. Figure 5 shows the estimated smooth functions for f 21 (z1 ) averaged over the simulation runs, whereas Figures 6
and 7 give the results for the estimates of θ22 and RMSE of fb21 (z1 ) when ER does not hold
and approximately 50% and 75% of the total number of observations are available to fit the
outcome equation. As, e.g., in Wiesenfarth and Kneibr(2010), based on the estimates for 200
n
o2
b21 (z1i ) − f 21 (z1i ) .
fixed covariate values, RMSE( fb21 ) was calculated as ∑200
f
b =1
The results can be summarized as follows:
• Overall, Figures 2 and 3 show that bivariate probit outperforms both the naive and
Heckman approaches in terms of bias and precision of θb22 and ρb. Specifically, as ρ
increases (i.e. as the sample selection issue becomes more pronounced) the accuracy
of the naive and Heckman estimates deteriorates quickly, with the naive being the
worst. In all cases, as n increases the bivariate probit estimates converge to their true
values; this is not observed in the Heckman estimates for the reasons given in Remark
1 of Section 2.2.
• The RMSEs and average fits of f 21 (z1 ), in Figures 4 and 5, indicate that the bivariate
probit method performs better than the other two approaches, although the RMSEs of
the former are characterized by more extreme values especially when ρ = 0.1. When
sample selection bias is fairly negligible (i.e. ρ is low), none of the methods outperforms the others although the bivariate probit and Heckman estimation schemes are
15
obviously inefficient. As ρ increases the naive estimates are severely biased and not reliable, whereas those of bivariate probit and Heckman are consistently more accurate
and precise.
• The results from the scenarios with noER, in Figures 6 and 7, suggest that when an extra regressor does not enter the selection equation and about half of the total number
of observations are available for the outcome equation, the estimates of all methods
are biased and not reliable. As the percentage of selected observations increases the
bivariate probit scheme strongly outperforms naive and slightly Heckman. This indicates that, under this circumstance and provided that assumption (3) is met, ER is not
necessary to achieve empirical identification, at least for bivariate probit.
Results from scenarios with ER
n=1000
n=3000
n=5000
0
ρ=0.1
−3
ρ=0.1
−2
ρ=0.1
−1
−4
−5
n=1000
n=3000
n=5000
0
−2
ρ=0.5
ρ=0.5
θ^22
ρ=0.5
−1
−3
−4
−5
n=1000
n=3000
n=5000
0
ρ=0.9
−3
ρ=0.9
−2
ρ=0.9
−1
−4
−5
Naive
Heckman
Biv probit
Naive
Heckman
Biv probit
Naive
Heckman
Biv probit
methods
Figure 2: Boxplots of the parameter estimates of θ22 based on 1000 replications obtained from the experiments
with exclusion restriction (ER), when employing the naive, Heckman-type and bivariate probit estimation approaches and approximately 50% of the total number of observations are available to fit the outcome equation.
The true value (dashed lines) is −1.5. ρ and n denote the correlation between the errors of the selection and
outcome equations, and the sample size. See Section 3.1 for further details.
Average coverage probabilities of the 95% confidence intervals for θ22 and f 21 (z1 ), constructed as described in Section 2.5, were calculated. Ns and Nd were set to 20 and 100.
For θ22 , the coverages rates of the Heckman approach were below the nominal level, with
values going from 0.93 to 0.80 as ρ was increasing. This was especially due to the bias in
the estimates highlighted in Figure 2. Despite the accuracy of the bivariate probit method
in estimating θ22 , rates were poor with values ranging from 0.72 to 0.86 as n was increasing.
A similar behaviour has already been observed in the literature when employing bivariate
16
Results from scenarios with ER
n=1000
n=3000
n=5000
ρ=0.1
1
ρ=0.1
2
ρ=0.1
3
0
−1
n=1000
n=3000
n=5000
2
ρ=0.5
ρ=0.5
^
ρ
ρ=0.5
3
1
0
n=1000
n=3000
−1
n=5000
ρ=0.9
1
ρ=0.9
2
ρ=0.9
3
0
−1
Heckman
Biv probit
Heckman
Biv probit
Heckman
Biv probit
methods
Figure 3: Boxplots of the parameter estimates of ρ based on 1000 replications obtained from the experiments with exclusion restriction (ER), when employing the Heckman-type and bivariate probit estimation
approaches. The dashed lines indicate the true values. Further details are given in the caption of Figure 2.
Results from scenarios with ER
n=1000
n=3000
n=5000
0.8
ρ=0.1
ρ=0.1
0.4
ρ=0.1
0.6
0.2
0.0
n=1000
n=3000
n=5000
^
RMSE( f 21)
0.8
ρ=0.5
ρ=0.5
ρ=0.5
0.6
0.4
0.2
n=1000
n=3000
0.0
n=5000
0.8
ρ=0.9
ρ=0.9
0.4
ρ=0.9
0.6
0.2
0.0
Naive
Heckman
Biv probit
Naive
Heckman
Biv probit
Naive
Heckman
Biv probit
methods
Figure 4: Boxplots of the empirical root mean squared errors (RMSE) of fb21 (z1 ) based on 1000 replications
obtained from the experiments with exclusion restriction (ER), when employing the naive, Heckman-type and
bivariate probit estimation approaches. Further details are given in the caption of Figure 2.
17
Results from scenarios with ER
0.0
0.2
n=1000
0.4
0.6
0.8
1.0
n=3000
n=5000
1.0
ρ=0.1
ρ=0.1
0.0
ρ=0.1
0.5
−0.5
−1.0
^
average f 21(z1)
n=1000
n=3000
n=5000
1.0
ρ=0.5
ρ=0.5
ρ=0.5
0.5
0.0
−0.5
−1.0
n=1000
n=3000
n=5000
1.0
ρ=0.9
ρ=0.9
0.0
ρ=0.9
0.5
−0.5
−1.0
0.0
0.2
0.4
0.6
0.8
1.0
0.0
0.2
0.4
0.6
0.8
1.0
z1
Figure 5: Average fits calculated using the estimated smooth functions for f 21 (z1 ) based on 1000 replications obtained from the experiments with exclusion restriction (ER), when employing the naive (grey lines),
Heckman-type (dashed lines) and bivariate probit (dotted lines) estimation approaches. The true function is
represented by the black solid lines. Note that in some cases the curves differ only minimally, hence they can
hardly be distinguished. Further details are given in the caption of Figure 2.
Results from scenarios with noER
n=1000
n=3000
n=5000
4
ρ=0.1
ρ=0.1
0
ρ=0.1
2
−2
−4
n=1000
n=3000
n=5000
4
ρ=0.5
ρ=0.5
θ^22
ρ=0.5
2
0
−2
−4
n=1000
n=3000
n=5000
4
ρ=0.9
ρ=0.9
0
ρ=0.9
2
−2
−4
50%
75%
50%
75%
50%
75%
percentage of observations in the selected sample
Figure 6: Boxplots of the parameter estimates of θ22 based on 1000 replications obtained from the experiments
with no exclusion restriction (noER), when employing the naive (white boxplots), Heckman-type (grey) and
bivariate probit (darker grey) estimation approaches, and when approximately 50% and 75% of the total number of observations are available to fit the outcome equation. The true value (dashed lines) is −1.5. Further
details are given in the caption of Figure 2.
18
Results from scenarios with noER
n=1000
n=3000
n=5000
2.5
ρ=0.1
ρ=0.1
1.5
ρ=0.1
2.0
1.0
0.5
n=1000
n=3000
n=5000
^
RMSE( f 21)
2.5
ρ=0.5
ρ=0.5
ρ=0.5
2.0
1.5
1.0
0.5
n=1000
n=3000
n=5000
2.5
ρ=0.9
ρ=0.9
1.5
ρ=0.9
2.0
1.0
0.5
50%
75%
50%
75%
50%
75%
percentage of observations in the selected sample
Figure 7: Boxplots of the empirical root mean squared errors (RMSE) of fb21 (z1 ) based on 1000 replications obtained from the experiments with no exclusion restriction (noER), when employing the naive (white boxplots),
Heckman-type (grey) and bivariate probit (darker grey) estimation approaches, and when approximately 50%
and 75% of the total number of observations are available to fit the outcome equation. Further details are given
in the caption of Figure 2.
probit estimation schemes and it is not yet clear why this is the case (Chiburis et al., 2011,
and references therein). Here, nonparametric bootstrapped confidence intervals based on
199 replications appeared to provide an effective and simple fix for undercoverage; rates
were in the interval (0.91, 0.97). For f 21 (z1 ), the bivariate probit approach yielded close to
nominal coverages, with values in the range (0.93, 0.97). Interestingly, for the Heckman
method, confidence intervals calculated without employing the correction procedure described in Section 2.5 produced rates which were not as low as expected, with values ranging from to 0.92 to 0.94. Corrected intervals offered marginal improvements, increasing the
rates by 0.02 points.
A number of alternative design settings were tried out using different nonlinear functions and changing the impact of the parametric components; the resulting conclusions were
fairly similar to those reported here. In addition, in the spirit of Little (1985), to gain further
insights into the identification problem, we carried out additional simulation experiments
where the model errors were generated according to a bivariate Student-t distribution. This
was achieved using rmvt() in the R package mvtnorm. Degrees of freedom equal to 3 and
15 were considered. All remaining design and model fitting settings were the same as those
described in the previous section. As expected, the results (available upon request) were
worse than those presented in this section. However, the use of ER helped to obtain better
19
estimates although still not as good as those produced when assumption (3) is met. These
findings were in agreement with those by Marra and Radice (2011a) obtained in a similar
context.
4 Application
In this section, we illustrate the proposed methods using data on public opinion polls on
school integration. As argued, e.g., in Verba et al. (1995), opinion polls may represent the
view of the public well. However, in certain situations, they may poorly reflect collective
public sentiment because some individuals choose not to respond to some specific questions
as they feel that their opinion may be perceived as socially unacceptable. When some individuals are not willing to show their views, polls measuring collective opinion on sensitive
topics typically provide biased estimates of preferences in the population a whole. A number of studies have recognized the presence of this phenomenon which can be problematic
for policy information (e.g. Groves et al., 2002). A survey of public opinion polls on school
integration conducted in the USA is one such study (Berinsky, 1999).
4.1 American national election study
We use data from the American National Election Survey (ANES, http://www.electionstudies.o
conducted in 1992, on public opinion polls on school integration. As mentioned in the introduction, the main question was whether respondents support government intervention
to ensure that black and white children go to the same school. About 700 individuals were
first asked if they had an opinion on the integration question (0 = no, 1 = yes) and then
what that opinion was (0 = no integration, 1 = yes integration). This gave respondents an
opportunity to opt out of the question answering process at an early stage. 64.57% of the
individuals chose to answer the integration question. Among these, the proportion of ‘yes’
answers was 46.43%. The dataset also included information on individual demographic and
socio-economic characteristics. The variables considered were age (in years), educ (number of years of education), sex (0 = female, 1 = male), race (0 = black, 1 = white), reg
(0 = North-Central, 1 = North-East, 2 = South, 3 = West), child (number of children in
the household), discpol (0 = never discuss politics, 1 = discuss politics), and perslett
which was a binary variable indicating whether the interviewer attempted to convert a respondent who initially refused to participate in the survey.
We analyzed the ANES dataset using the naive, Heckman-type and bivariate probit estimation approaches with the same model fitting settings as those described in the simulation
study section. The outcome equation included the variables sex, race, reg, and child as
20
parametric components, and smooth functions of age and educ. Following Berinsky (1999),
the selection equation included the same components as those of the outcome equation plus
discpol and perslett. child was included as a parametric component because it did
not have enough unique covariate values to justify the use of a smooth function.
4.2 Results and interpretation
Table 1 and Figure 8 report the parametric and smooth function estimates for the outcome
equation and, for completeness, also those for the selection equation, when applying the
three approaches on the ANES dataset. In the selection equation, the estimated effect of
discpol is the strongest and significant at the 5% level. The effect of age is linear for both
the bivariate probit and Heckman approaches. The function estimates of educ are linear
and nonlinear for bivariate probit and Heckman, respectively, and the bivariate probit intervals do not contain the Heckman estimated curve for a part of the range of covariate
values. In the outcome equation, the parameter of race exhibits the strongest effect, and is
nearly significant for bivariate probit. The effects of age and educ show different degrees
of nonlinearity across the three methods, and for age the bivariate probit intervals do not
contain the naive function estimate for a small part of the covariate value range. The estimates of ρ, which are important to ascertain the presence of selection bias, are high and, for
bivariate probit, statistically significant. This indicates that the process by which individuals
decide to answer the integration question is connected to the process by which they decide
what the answer is. The positive sign of ρb suggests that the unobservable factors which lead
individuals to take part in the survey also lead them to take a more supportive stance on
the integration issue.
A comparison between the outcome equation parametric estimates obtained from the
naive approach and those from the Heckman and bivariate probit methods indicates that,
while none of them change sign, the substantive power of some parameters is altered by
the correction for the selection bias present in the question-answering process. For instance,
the bivariate probit coefficient of race decreases by about 14% percent relative to that of
the naive approach. However, the movement of the coefficient estimates of (Intercept)
is more impressive. The naive estimate is positive, while it is negative for Heckman and
bivariate probit. This means that once selection effects are accounted for, respondents are
more likely to oppose school integration than the naive estimate suggests. This result is
consistent with that of Berinsky (1999) who also found that correcting for selection bias
decreases the probability of supporting government efforts to integrate schools.
The nonlinear impacts of age and educ obtained using the bivariate probit approach are
consistent with the interpretations that the effect of age on school integration is negative up
21
to a certain age (56-58 years) after which it becomes positive, and that educ has a positive
impact throughout the whole range of covariate values. Different interpretations are obVariable
Naive
Heckman
Bivariate probit
-
0.12 (−0.30, 0.54)
0.04 (−0.17, 0.25)
−0.23 (−0.56, 0.09)
0.27 (−0.04, 0.58)
0.19 (−0.09, 0.47)
0.32 (0.02, 0.62)
0.08 (−0.03, 0.20)
0.33 (0.08, 0.57)
0.27 (−0.03, 0.57)
0.06 (−0.45, 0.58)
0.05 (−0.19, 0.27)
−0.22 (−0.57, 0.10)
0.26 (−0.06, 0.59)
0.19 (−0.12, 0.46)
0.31 (0.01, 0.63)
0.08 (−0.06, 0.21)
0.32 (0.05, 0.58)
0.26 (−0.05, 0.58)
0.26 (−0.20, 0.73)
−0.11 (−0.37, 0.15)
−0.77 (−1.17, −0.37)
0.44 (0.06, 0.83)
0.36 (0.00, 0.72)
0.40 (0.03, 0.78)
0.11 (−0.03, 0.25)
−0.22 (−1.08, 0.62)
−0.07 (−0.32, 0.17)
−0.71 (−1.21, −0.25)
0.45 (−0.04, 0.98)
0.34 (−0.13, 0.85)
0.43 (−0.04, 0.95)
0.12 (−0.03, 0.30)
−0.30 (−1.15, 0.57)
−0.07 (−0.47, 0.33)
−0.66 (−1.29, −0.05)
0.46 (−0.03, 0.91)
0.35 (−0.10, 0.82)
0.44 (−0.05, 0.89)
0.13 (−0.33, 0.55)
-
0.88 (−0.28, 2.03)
0.84 (0.45, 0.99)
Selection Eq.
(Intercept)
sex.male
race.white
reg.northeast
reg.south
reg.west
child
discpol
perslett
Outcome Eq.
(Intercept)
sex.male
race.white
reg.northeast
reg.south
reg.west
child
ρ
Table 1: Parametric estimates obtained applying the naive, Heckman-type and bivariate probit estimation
approaches on the ANES dataset described in Section 4.1. Within parentheses are 95% confidence intervals
calculated using the procedure described in Section 2.5 with Ns = 20 and Nd = 100 for Heckman, and nonparametric bootstrap based on 199 replications for bivariate probit. Note that results from the naive approach
concern only the outcome equation.
tained when looking at the naive and Heckman smooth term estimates. These results could
offer new empirical insights if analysed in light of the sociological and economic literature,
but this is beyond the scope of this article.
To gauge the aggregate effects of the selection bias in the question-answering process, we
estimated the mean predicted probabilities of giving a supportive response under the three
methods. These were 0.39, 0.15 and 0.09 for the naive, Heckman-type and bivariate probit
approaches. So, predicted support for school integration is much lower when selection bias
is accounted for, hence indicating that expressed opinion on the school integration question
is a poor barometer of underlying support for integrationist policy. Based on our simulation
study, estimation results from the bivariate probit approach can be regarded as the most
accurate.
22
0.5
0.0
−1.0
−0.5
f(educ) − selection eq.
0.4
0.2
0.0
−0.2
−0.4
f(age) − selection eq.
0.6
Note that the goodness of the results presented in this section relies especially on whether
the assumption of normality is met. For the simultaneous equation estimation approach, a
possibility would be to employ a score test of bivariate normality whose density of the errors under the alternative hypothesis is based on a type AA bivariate Gram Charlier series
with 9 additional parameters (Chiburis, 2010; Lee, 1984). However, it is not clear how this
test could be satisfactorily extended to the penalized framework considered here.
20
30
40
50
60
70
80
5
15
0.0
f(educ) − outcome eq.
−1.0
−0.5
0.6
0.4
0.2
0.0
−0.2
−0.4
f(age) − outcome eq.
10
educ
0.5
age
20
30
40
50
60
70
80
5
age
10
15
educ
Figure 8: Smooth function estimates obtained applying the naive (grey lines), Heckman-type (dotted lines)
and bivariate probit (solid lines) estimation approaches on the ANES dataset described in Section 4.1. The
results are reported on the scale of the linear predictors of the selection and outcome equations. The dashed
lines represent 95% Bayesian ‘confidence’ intervals calculated from the bivariate probit estimates. The ‘rug
plot’, at the bottom of each graph, shows the covariate values. Note that: in the left-top plot the Heckman and
bivariate probit smooth estimates can not be distinguished; to avoid clutter, corrected confidence intervals for
the Heckman approach have not been reported; results from the naive approach concern only the outcome
equation. Due to the identifiability constraints, the estimated curves pass through zero.
5 Discussion
In this article, we introduced two statistical methods for the (separate or simultaneous) estimation of two binary regression models involving semiparametric predictors in the presence of sample selection bias. The problems of identification and inference have also been
discussed. The methods have been illustrated using data on public opinion polls on school
integration. The application of the proposed approaches revealed that predicted support for
school integration is much lower when selection bias is accounted for. This finding could
23
not have been obtained otherwise. The approaches presented here have a great potential of
applicability in several research areas including economics and epidemiology.
The results of our simulation study suggested that the bivariate probit and Heckmantype estimation approaches are effective for flexibly estimating covariate impacts in the
presence of selection bias, with the former being more accurate and reliable. We also shed
light on the issue of empirical identification. The main finding was that, in the absence of
exclusion restriction and provided that the assumption of normality is met, if the proportion of selected observations is fairly high then empirical identification is equally achieved,
especially when using the bivariate probit estimation method.
Since maximum likelihood estimation schemes are typically sensitive to model error
misspecification, extensions of our proposals allowing for different joint distributions of
the model errors seem feasible adopting either a copula approach (e.g. Nelsen, 2006) or a
nonparametric distribution function estimation framework (e.g. Gallant and Nychka, 1987;
Klein and Spady, 1993). To accommodate more complex data structures arising, for instance,
in longitudinal studies, future research will also focus on extending the methods presented
in this article to allow for random effects in the linear predictors.
Appendix
The expressions for the gradient vector and Fisher information matrix that are referred to in
Section 2.3 are given here. The elements of the gradient vector are defined as
g1
g2
g3∗
y1i y2i
y1i (1 − y2i ) (1 − y1i )
= ∑ φ(η1i ) Φ( Ai )
+ {1 − Φ( Ai )}
−
X1i ,
p11i
p10i
p0i
i =1
n
y1i y2i y1i (1 − y2i )
= ∑ φ(η2i ) Φ( Bi )
−
X2i ,
p11i
p10i
i =1
n
y1i y2i y1i (1 − y2i )
∂ρ
= ∑ φ2 (η1i , η2i ; ρ)
−
,
p11i
p10i
∂ρ∗
i =1
n
where Xvi = x+
norvi , Bvi for v = 1, 2, φ2 is the density function of a standardized bivariate
p
2
∗
∗
∗
mal with correlation ρ, ∂ρ/∂ρ = 4 exp (2ρ )/ {exp (2ρ ) + 1} , Ai = (η2i − ρη1i )/ (1 − ρ2 ),
p
and Bi = (η1i − ρη2i )/ (1 − ρ2 ). The remaining quantities are defined in Section 2.3. The
24
elements of the Fisher information matrix are given as
I 11 =
I 22 =
∗
I 33
=
I 12 =
∗
I 13
=
∗
I 23
=
1
1
2 1
∑ φ(η1i ) Φ( Ai ) p11i + {1 − Φ( Ai )} p10i + p0i XT1i X1i ,
i =1
n
1
1
2
∑ {φ(η2i )Φ( Bi )} p11i + p10i XT2i X2i ,
i =1
n
1
∂ρ 2
1
2
∑ φ2 (η1i , η2i ; ρ) p11i + p10i ∂ρ∗ ,
i =1
n
Φ( Ai )Φ( Bi ) {1 − Φ( Ai )} Φ( Bi ) T
−
X1i X2i ,
∑ φ(η1i )φ(η2i )
p
p
11i
10i
i =1
n
1
1
∂ρ
∑ φ(η1i )φ2 (η1i , η2i ; ρ) Φ( Ai ) p11i − {1 − Φ( Ai )} p10i ∂ρ∗ X1i ,
i =1
n
1
∂ρ
1
X2i .
∑ φ(η2i )φ2 (η1i , η2i ; ρ) Φ( Bi ) p11i + p10i
∂ρ∗
i =1
n
2
2
References
[1] Bärnighausen, T., Bor, J., Wandira-Kazibwe, S., and Canning, D. (2011). Correcting
HIV Prevalence Estimates for Survey Nonparticipation Using Heckman-type Selection
Models. Epidemiology, 22, 27–35.
[2] Berinsky, A. J. (1999). The Two Faces of Public Opinion. American Journal of Political
Science, 43, 1209–1230.
[3] Chib, S., Greenberg, E., and Jeliazkov, I. (2009). Estimation of Semiparametric Models in the Presence of Endogeneity and Sample Selection. Journal of Computational and
Graphical Statistics, 18, 321–348.
[4] Chiburis, R. C. (2010). Score Tests of Normality in Bivariate Probit Models: Comment.
Working paper.
Available at https://webspace.utexas.edu/rcc485/www/research.html.
[5] Chiburis, R. C., Das, J., and Lokshin, M. (2011). A Practical Comparison of the Bivariate
Probit and Linear IV Estimators. World Bank Policy Research, Paper 5601.
[6] Craven, P., and Wahba, G. (1979). Smoothing noisy data with spline functions. Numerische Mathematik, 31, 377–403.
[7] Cuddeback, G., Wilson, E., Orme, J. G., and Combs-Orme, T. (2004). Detecting and
Statistically Correcting Sample Selection Bias. Journal of Social Service Research, 30, 19–
33.
25
[8] Dubin, J. A., and Rivers, D. (1990). Selection Bias in Linear Regression, Logit and Probit
Models. Sociological Methods and Research, 18, 360–390.
[9] Gallant, A. R., and Nychka, D. W. (1987). Semi-Nonparametric Maximum Likelihood
Estimation. Econometrica, 55, 363–390.
[10] Gentle, J. E. (2003). Random Number Generation and Monte Carlo Methods. London:
Springer-Verlag.
[11] Greene, W. H. (2007). Econometric Analysis. New York: Prentice Hall.
[12] Groves, R. M., Dillman, D. A., Eltinge, J. L., and Little, R. J. A. (2002). Survey Nonresponse. New York: Wiley.
[13] Gu, C. (1992). Cross validating non-Gaussian data. Journal of Computational and Graphical Statistics, 1, 169–179.
[14] Gu, C. (2002). Smoothing Spline ANOVA Models. London: Springer-Verlag.
[15] Hastie, T., and Tibshirani, R. (1993). Varying-coefficient models. Journal of the Royal
Statistical Society Series B, 55, 757–796.
[16] Heckman, J. J. (1979). Sample selection bias as a specification error. Econometrica, 47,
153–162.
[17] Klein, R., and Spady, R. (1993). An Efficient Semiparametric Estimator of the Binary
Choice Model. Econometrica, 61, 387–421.
[18] Lee, L. F. (1984). Tests for the bivariate normal distribution in econometric models with
selectivity. Econometrica, 52, 843–863.
[19] Little, R. (1985). A note about models for selectivity bias. Econometrica, 53, 1469–1474.
[20] Marra, G., and Radice, R. (2011a). Estimation of a semiparametric recursive bivariate
probit model in the presence of endogeneity. Canadian Journal of Statistics, 39, 259–279.
[21] Marra,
G.,
and
Radice,
R.
(2011b).
SemiParBIVProbit:
parametric
Bivariate
Probit
Modelling.
R
package
version
http://CRAN.R-project.org/package=SemiParBIVProbit.
Semi2.0-4.1,
[22] Marra, G., and Radice, R. (in press). A Flexible Instrumental Variable Approach. Statistical Modelling, available at www.ucl.ac.uk/statistics/research/reports.
[23] Marra, G., and Wood, S. N. (in press). Coverage Properties of Confidence Intervals for
Generalized Additive Model Components. Scandinavian Journal of Statistics, available
at www.ucl.ac.uk/statistics/research/reports.
26
[24] Montmarquettea, C., Mahseredjiana, S., and Houle, R. (2001). The determinants of university dropouts: a bivariate probability model with sample selection. Economics of Education Review, 20, 475–484.
[25] Nelsen, R. B. (2006). An Introduction to Copulas. New York: Springer.
[26] Nocedal, J., and Wright, S. J. (1999). Numerical Optimization. New York: Springer-Verlag.
[27] Nychka, D. (1988). Bayesian Confidence Intervals for Smoothing Splines. Journal of the
American Statistical Association, 83, 1134–1143.
[28] Puhani, P. A. (2000). The Heckman correction for sample selection and its critique. Journal of Economic Surveys, 14, 53–68.
[29] R Development Core Team (2011). R: A Language and Environment for Statistical Computing. Vienna, Austria: R Foundation for Statistical Computing. ISBN 3-900051-07-0.
[30] Ruppert, D., Wand, M. P., and Carroll, R. J. (2003). Semiparametric Regression. London:
Cambridge University Press.
[31] Van de Ven, W. P. M. M., and Van Praag, B. M. S. (1981). The demand for deductibles in
private health insurance: a probit model with sample selection. Journal of Econometrics,
17, 229–252.
[32] Verba, S., Schlozman, K. L., and Brady, H. E. (1995). Voice and Equality: Civic Voluntarism
in American Politics. Cambridge: Harvard University Press.
[33] Wahba, G. (1983). Bayesian ‘Confidence Intervals’ for the Cross-Validated Smoothing
Spline. Journal of the Royal Statistical Society Series B, 45, 133–150.
[34] Wiesenfarth, M., and Kneib, T. (2010). Bayesian geoadditive sample selection models.
Journal of the Royal Statistical Society Series C, 59, 381–404.
[35] Wood, S. N. (2004). Stable and efficient multiple smoothing parameter estimation for
generalized additive models. Journal of the American Statistical Association, 99, 673–686.
[36] Wood, S. N. (2006). Generalized Additive Models: An Introduction with R. London: Chapman & Hall.
27