A Penalized Likelihood Estimation Approach to Semiparametric Sample Selection Binary Response Modelling∗ Giampiero Marra Rosalba Radice Dept of Statistical Science Dept of Health Services Research & Policy University College London London School of Hygiene & Tropical Medicine Gower Street, London WC1E 6BT, U.K. Tavistock Place, London WC1H 9SH, U.K. [email protected] [email protected] April 30, 2012 Abstract Sample selection bias plays an important role when estimating the effects of covariates on an outcome variable using survey research data. Statistical methods addressing this issue typically concern situations where the outcome is continuous. Here, we are interested in the case where the response is binary. Moreover, continuous covariates may have a nonlinear relationship to the outcome. We introduce two statistical methods for the (separate or simultaneous) estimation of two binary regression models involving semiparametric predictors in the presence of sample selection bias. This is achieved via an adaptation of the Heckman procedure, and a simultaneous equation estimation scheme. Both approaches are based on the penalized likelihood estimation framework. The problems of identification and inference are also discussed. The empirical properties of the proposed methods are studied through an extensive simulation study. As an application, we analyse data from the American National Election Study with the aim of more realistically quantifying public support for school integration. Keywords: Binary responses; Bivariate probit; Heckman-type estimation approach; Penalized regression spline; Sample selection bias; Survey. ∗ Research Report No. 315, Department of Statistical Science, University College London. Date: May 2012. 1 1 Introduction Sample selection bias is an issue that has plagued much of survey research (e.g. Groves et al., 2002). In public surveys, for example, it may be the case that some individuals choose not to answer some specific questions because they feel that their opinion might paint them in an unfavorable light. This leaves us with a self-selected sample. So if the interest is quantifying the relationship between various demographic and socio-economic characteristics and an outcome variable in the population as a whole, then using the responding subsample is likely to produce biased estimates. This source of bias is known in the literature as sample selection bias (Heckman, 1979; Dubin and Rivers, 1990). To fix ideas, let us consider a study of public opinion polls on school integration (http://www.electionstudies.org/). The main question was if respondents support government intervention to ensure that black and white children go to the same school. In particular, individuals were first asked if they had an opinion on the integration question (0 = no, 1 = yes) and then what that opinion was (0 = no integration, 1 = yes integration). Information on individual demographic and socio-economic characteristics was also recorded. If the respondents who opposed government involvement in school integration chose not to answer the integration question because they felt their opinion might be perceived as socially unacceptable, then it might have been the case that the sample of individuals who provided an opinion differed in systematic ways from the sample of non-respondents. To clarify this (often misunderstood) concept, let us characterize each individual by some observable and unobservable features or confounders. If the responding and nonresponding subsamples have similar characteristics, then the issue of sample selection bias does not arise since the average (observable and unobservable) features of the responding sample are similar to those of the population. If the decision to answer is no longer random, due to potentially differing characteristics between the responding and nonresponding samples, then sample selection bias arises if some of the determinants of the decision to answer also influence the outcome/response itself. When the relationship between the decision to respond and outcome is only through observables, it is possible to correct for sample selection bias by controlling for these variables in the outcome equation. However, in the presence of unobservables influencing the decision to answer and the outcome, controlling only for observables is clearly insufficient. That is, if some individuals are part of the responding subsample because of their unobservable features, then regardless of whether the observables and unobservables are correlated in the overall population they will be in the selected sample (e.g. Dubin and Rivers, 1990). This means that ignoring the potential correlation between the unobservable factors influencing the decision to answer and the outcome can lead to biased and inconsistent estimates of the covariate impacts in the outcome equation. 2 Note that there are a number of examples where the issue of sample selection bias occurs. These include the study of prevention program for high school dropouts (Montmarquette et al., 2001), family-related factors for foster approval (Cuddeback et al., 2004), and risk factors for HIV prevalence (Bärnighausen et al., 2011), to name a few. Statistical methods correcting for the phenomenon described in the previous paragraph have been developed. Most of these works concern models where the response variable is continuous (e.g. Chib et al., 2009; Heckman, 1979; Green, 2007; Wiesenfarth and Kneib, 2010, and references therein). Here, we are interested in the case where the response is binary. The only procedures currently available to fit a sample selection binary response model are those presented in Van de Ven and Van Praag (1981) and Green (2007). These involve the (separate or simultaneous) estimation of two binary regression models for the outcome and selection equations. The outcome equation is used to examine the substantive question of interest, whereas the selection equation is used to detect selection bias and obtain unbiased estimates of the covariate effects in the outcome equation. One potential drawback to the application of the techniques by Van de Ven and Van Praag (1981) and Green (2007) is the lack of flexibility in handling the possible presence of nonlinear relationships. The need for methods flexibly modelling regressor effects arises from the observation that all model parameter estimates are biased and inconsistent when the relationship between the observable confounders and the outcome is mismodelled as a result of the use of linear or low-order polynomial terms for example (Chib et al., 2009; Marra and Radice, 2011a). This is problematic as it may prevent the researcher from recognizing a strong covariate effect or, more generally, revealing interesting relationships. In this article, we discuss an extension of the procedure presented in Van de Ven and Van Praag (1981) which allows for flexible functional dependence of the binary response variables on continuous covariates. We also introduce a penalized likelihood estimation approach for a simultaneous system of two binary equations which include smooth functions of continuous covariates. The methods discussed here are implemented in the freely available R package SemiParBIVProbit (Marra and Radice, 2011b). To the best of our knowledge, no other computational alternatives which consider a semiparametric sample selection binary response model are available in the literature. It may be argued that the model setup adopted here is pretty similar to that of Chib et al. (2009) and Wiesenfarth and Kneib (2010) except for the fact that, in the current context, the outcome variable is binary. This suggests that the Bayesian estimation schemes introduced by these authors could be easily extended to estimate the model considered in this paper. We did not pursue a Bayesian approach because the computational developments by Chib et al. (2009) are not available for public use, and because the approach by Wiesenfarth and Kneib (2010) suffers from convergence problems due to the persistent autocorrelations of the parameters sam3 pled in the Markov chain Monte Carlo algorithm. As Wiesenfarth and Kneib (2010) point out, this issue is common in Bayesian sample selection models and the autocorrelations do not disappear even with long simulation runs and considerable thinning. Such an issue becomes even more severe when the correlation between the outcome and selection equations is high. The empirical properties of the methods are studied through an extensive simulation study. The proposals are then illustrated using the above-mentioned data on public opinion polls on school integration. Overall, the results indicate that the introduced approaches are effective for flexibly estimating covariate impacts in the presence of selection bias, and for more realistically quantifying public support for school integration. 2 Methods In this section, we describe the model structure of a semiparametric binary response model with sample selection, discuss two possible strategies for parameter estimation and address the problems of identification and inference. 2.1 The model The model consists of a first equation, known as selection equation, and a second outcome equation determining the response variable of interest. The selection equation, expressed using the latent variable representation, is given as ∗ + y1i = x1i θ1 + K1 ∑ k 1 =1 f 1k1 (z1k1 i ) + ε 1i , i = 1, . . . n, (1) ∗ is a latent continuous variable which is related to where n denotes the sample size, and y1i ∗ > 0). The outcome equation is given as its observable counterpart y1i through the rule 1(y1i ∗ + y2i = x2i θ2 + K2 ∑ k 2 =1 f 2k2 (z2k2 i ) + ε 2i , (2) where y2i is determined according to the rule y2i = ( ∗ 1 if (y2i > 0 & y1i = 1) ∗ 0 if (y2i < 0 & y1i = 1) 4 . + + + In (1), x1i = 1, x12i , . . . , x1P represents the ith row vector of x1+ , the n × P1 model matrix 1i for any strictly parametric model components (such as dummy and categorical variables), with corresponding parameter vector θ1 . The f 1k1 are unknown smooth functions of the continuous covariates z1k1 i , and K1 denotes the number of regressors z1k1 i . These components are represented using the regression spline approach (see next section). Each smooth term may also be multiplied by some predictor, yielding a ‘varying coefficients’ model (Hastie and Tibshirani, 1993). Moreover, smooth functions of two covariates such as f 11,12 (z11i , z12i ) + can be also considered (e.g. Wood, 2006, pp. 154-167). Similarly in (2), x2i is the ith row vector of the ns × P2 model matrix x2+ , with coefficient vector θ2 , the f 2k2 are unknown smooth terms of the continuous regressors z2k2 i , K2 denotes the number of regressors z2k2 i and ns the size of the selected sample. For identification purposes, smooth terms are subject to constraints such as ∑i f k (zki ) = 0 for all k. The error terms (ε 1i , ε 2i ) are assumed to follow the bivariate normal distribution ! " # " #! ε 1i 0 1 ρ iid ∼N , , (3) ε 2i 0 ρ 1 where ρ is the correlation coefficient. The error variances are normalized to unity. This is a conventional normalization that is necessary because the parameters in the model can only be identified up to a scale coefficient (e.g. Greene 2007, p. 686). 2.1.1 Smooth function representation A popular and effective way of representing smooth functions of continuous covariates is the regression spline approach. The basic idea is to approximate a generic function f k (zki ) by a linear combination of known spline basis functions, bkj (zki ), and regression parameters, β kj , Jk f k (zki ) = ∑ β kj bkj (zki ) = Bk (zki )βk , j =1 where J is the number of spline bases and hence regression coefficients used to represent f , Bk (zki ) represents the ith 1 × Jk row vector consisting of the basis functions evaluated at the observation zki , i.e. Bk (zki ) = bk1 (zki ), bk2 (zki ), . . . , bkJ (zki ) , and βk is the corresponding parameter vector. Calculating Bk (zki ) for each i yields Jk curves encompassing different degrees of complexity which multiplied by some real valued parameter vector βk and then summed give a curve estimate for f k (zk ). Basis functions are usually chosen to have convenient mathematical properties and good numerical stability. Possible choices are B-splines, cubic regression and thin plate regression splines. We will employ the latter splines although, in the one dimension case, they all lead to very similar results. See, e.g., Ruppert et 5 al. (2003) for a more detailed introduction. Based on the result above, equations (1) and (2) can written as ∗ + y1i = x1i θ1 + B1i β1 + ε 1i , i = 1, . . . , n, (4) and where Bvi 1, 2. ∗ + y2i = x2i θ2 + B2i β2 + ε 2i , i = 1, . . . , ns , (5) T , β T , . . . , β T ), for v = = Bv1 (zv1i ), Bv2 (zv2i ), . . . , BvK1 (zvKv i ) and βvT = (βv1 v2 vKv 2.2 Heckman-type estimation approach Sample selection bias can be dealt with by using a device which Heckman (1979) introduced for an analogous problem in binary-choice regression, and that Van de Ven and Van Praag (1981) employed in the context of a fully parametric probit model with sample selection. Here, we use the same device but in the context of semiparametric probit analysis. The population regression function for (5) can be written as + ∗ + |x2i , B2i ) = x2i θ2 + B2i β2 , E (y2i ∗ > 0) while the regression function for the subsample of complete observations (i.e. when y1i is ∗ + ∗ + ∗ + , B2i , y1i > 0). (6) θ2 + B2i β2 + E (ε 2i |x2i > 0) = x2i E (y2i |x2i , B2i , y1i Given that (ε 1i , ε 2i ) are assumed to follow a standardized bivariate normal distribution with correlation coefficient ρ, we have that + ∗ E (ε 2i |x2i , B2i , y1i > 0) = ρϑi (7) + where ϑi = φ(η1i )/Φ(η1i ), and is typically called inverse Mills ratio, η1i = x1i θ1 + B1i β1 , and φ and Φ are the density function and distribution function of a standardized normal. Regression equation (6) can be therefore written as + ∗ θ2 + B2i β2 + ρϑi + e ε 2i , = x2i y2i (8) ∗ > 0) = 0 and E (e ∗ > 0) = τ 2 . Based on the derivations by Heckman where E (e ε 2i |y1i ε22i |y1i i (1979), τi2 = 1 + ρ2 ϑi (−η1i − ϑi ). It is then clear to see that the coefficient estimates in (5) obtained using a nonrandom selected subsample are biased if ρ 6= 0. This can be thought of as arising from an ordinary specification error with the conditional mean (7) deleted as a covariate in the model. Including ϑi as an explanatory variable, as in equation (8), would 6 rectify this situation. In practice we do not know ϑi , but it is possible to obtain a consistent estimate of it based on the estimated coefficients of selection equation (4). After dividing ∗ > 0) = τ 2 , and using a selected sample, we can obtain equation (8) by τi , because E (e ε22i |y1i i unbiased parameter estimates using the model + ∗ ∗ β2 + ρ(ϑi /τi ) + ε¯ 2i , = (x2i /τi )θ2 + B2i y2i (9) ∗ includes the quantities corresponding to the spline bases for the smooth functions where B2i ∗ > 0) = 0, E ( ε¯2 | y∗ > 0) = 1, and ϑ and τ can of the covariates z2k2 i rescaled by τi , E (ε¯ 2i |y1i i i 2i 1i be consistently estimated as described above. The algorithm to fit model (9) can be summarised as follows: step 1 Fit a probit model for sample selection equation (4) so to obtain consistent estimates of ηb1i and hence ϑbi , for all i. step 2 Using the selected sample only, obtain a consistent estimate of ρ by fitting a linear b probability model q for equation (8), where ϑi is replaced with ϑi for all i. step 3 Estimate τbi via 1 + ρb2 ϑbi (−ηb1i − ϑbi ) where ϑbi , ηb1i and ρb are obtained in steps 1 and 2. step 4 Using the selected sample only and after rescaling all variables in the model by the τbi , fit a probit model for equation (9). Remark 1. Equation (9) is fitted using a probit model even if the normality assumption of ε¯ 2i , which is necessary for consistency, is clearly not met (e.g. Van de Ven and Van Praag, 1981). Hence, the results obtained from step 4 will be affected by some bias in the estimates. In addition, the method can produce an estimate for ρ which is not in the range [−1, 1]. Remark 2. Standard errors of the parameter estimates in (9) are not realistic in that, for example, they do not account for that additional source of sampling variability introduced because the ϑi and τi have to be estimated. Remark 3. The models used in the steps outlined above may be fitted using unpenalized parameter estimation procedures. However, because of the flexible model specification considered here, this is likely to result in smooth function estimates that are too wiggly to produce realistic and reliable results. This issue can be overcome by penalized estimation, R where an objective function is augmented by a penalty term, such as ∑k λk f k00 (zk )2 dzk , measuring the (second-order, in this case) roughness of the smooth terms in the model. The λk are smoothing parameters controlling the trade-off between fit and smoothness. In practice, since regression splines are linear in their model parameters, such a penalty can be expressed as a quadratic form in the generic parameter vector β (containing the coefficients R of all smooth terms in the model), i.e. ∑k λk f k00 (zk )2 dzk = β T (∑k λk Sk ) β, where the Sk are positive semi-definite known matrices. Depending on the model employed, the regression parameters can be estimated by either minimization of a penalized least squares criterion or 7 maximization of a penalized log-likelihood function. The λk can be selected via a prediction error or maximum likelihood criterion (e.g. Ruppert et al., 2003, Ch. 8). 2.3 Bivariate probit estimation approach Since it is assumed that the errors (ε 1i , ε 2i ) follow a standardized bivariate normal distribution with correlation coefficient ρ, simultaneous equation parameter estimation is advisable. Let us define the linear predictors ηvi = x+ vi θv + Bvi βv for v = 1, 2. Recognizing that in the sample selection context the data identify only the three possible events (y1i = 1, y2i = 1), (y1i = 1, y2i = 0) and (y1i = 0), with the following probabilities + + P (y1i = 1, y2i = 1|x1i , B1i , x2i , B2i ) = p11i = Φ2 (η1i , η2i ; ρ), + + P (y1i = 1, y2i = 0|x1i , B1i , x2i , B2i ) = p10i = Φ(η1i ) − Φ2 (η1i , η2i ; ρ), + P (y1i = 0|x1i , B1i ) = p0i = Φ(−η1i ), where Φ2 is the distribution function of a standardized bivariate normal with correlation ρ, the log-likelihood function is n `(δ ) = ∑ {y1i y2i log( p11i ) + y1i (1 − y2i ) log( p10i ) + (1 − y1i ) log( p0i )} , i =1 where δ T = (δ1T , δ2T , ρ), and δvT = (θvT , βvT ) for v = 1, 2. As explained in the previous section, in a smoothing context, it is necessary to penalize the regression spline coefficients in order to suppress that part of smooth term complexity which has no support from the data. The model is therefore fitted by maximization of the penalized log-likelihood 1 (10) ` p (δ ) = `(δ ) − β T Sλ β, 2 where β T = (β1T , β2T ) and Sλ = ∑2v=1 ∑kKvv=1 λkv Skv . Note that because ρ is bounded in [−1, 1], we use the common transform for correlation ρ∗ = tanh−1 (ρ) = (1/2) log {(1 + ρ) / (1 − ρ)}, so that [−1, 1] is mapped to the real line. Given values for the λkv , we seek to maximize (10). In practice, this can be achieved by Fisher scoring iterating ∗ −1 [ a ] δˆ [a+1] = δˆ [a] + (I [a] + Sλ ) (g − Sλ∗ δˆ [a] ) (11) ∗ until convergence, where a is the iteration index and Sλ denotes the overall block-diagonal penalty matrix made up of the λkv Skv and 0 corresponding to any strictly parametric model components, since such terms are not penalized. The gradient vector g is defined by two subvectors g1 = ∂`(δ )/∂δ1 and g2 = ∂`(δ )/∂δ2 , and a scalar g3∗ = ∂`(δ )/∂ρ∗ , while the 8 Fisher information matrix has the following 3 × 3 matrix block structure    I = −E   ∂2 `(δ ) ∂δ1 ∂δ2T ∂2 `(δ ) ∂δ2 ∂δ2T ∂2 `(δ ) ∂ρ∗ ∂δ2 ∂2 `(δ ) ∂δ1 ∂δ1T ∂2 `(δ ) ∂δ2 ∂δ1T ∂2 `(δ ) ∂ρ∗ ∂δ1 ∂2 `(δ ) ∂δ1 ∂ρ∗ ∂2 `(δ ) ∂δ2 ∂ρ∗ ∂2 `(δ ) ∂ρ∗2    .  The analytical expressions for g and I in the three main parameter space dimensions are given in the Appendix. Step (11) is implemented using a trust region algorithm with eigendecomposition of I at each iteration (e.g. Nocedal and Wright, 1999, Section 4.2). This proved to be faster and more reliable than the standard approaches adopted in the literature to estimate likelihood-based models. In (11), the smoothing parameters are fixed at some values. This is because joint estimation of δ and λ (containing all the λkv ) via maximization of (10) would clearly lead to severe overfitting since the highest value for ` p (δ ) would be obtained when λ = 0. Hence the need to estimate λ using an appropriate criterion. 2.3.1 Smoothing parameter estimation Smoothing parameter selection can, in principle, be achieved by direct grid search optimization of, e.g., a prediction error criterion. However, if the model has more than two or three smooth terms, then this typically becomes computationally burdensome, hence making the model building process difficult in most applied contexts. There are a number of techniques for automatic multiple smoothing parameter selection within the penalized likelihood framework. Without claim of exhaustiveness, these include the performanceoriented iteration method originally proposed by Gu (1992) and mixed model approach to penalized regression spline estimation (e.g. Ruppert et al., 2003). The former applies the generalized cross validation or unbiased risk estimator (UBRE), introduced by Craven and Wahba (1979), to each working linear model of the penalized iteratively re-weighted least squares (P-IRLS) scheme used to fit the model. The latter consists of viewing the λkv as variance components so that they can be estimated, e.g., by restricted maximum likelihood. Here, we adapt Gu’s approach to the current context. Given values for the λkv and considering that Fisher scoring step (11) can be written in P-IRLS form, δˆ [a+1] is the solution to the problem √ [ a] ∗ δ w.r.t. δ, minimize k W (z[a] − Xδ )k2 + δ T Sλ (12) √ T√ √ where W is any iterative weight non-diagonal matrix square root such that W W = W, zi is a 3-dimensional pseudodata vector given as zi = Xi δ [a] + Wi−1 di , where di = 9 ∗ and η ∗ = ρ∗ , W is the 3 × 3 matrix ∂`(δ )i /∂η1i , ∂`(δ )i /∂η2i , ∂`(δ )i /∂η3i i 3i    Wi = − E   ∂2 `(δ )i 2 ∂η1i 2 ∂ `(δ )i ∂η2i ∂η1i ∂2 `(δ )i ∗ ∂η ∂η3i 1i ∂2 `(δ )i ∂η1i ∂η2i ∂2 `(δ )i 2 ∂η2i 2 ∂ `(δ )i ∗ ∂η ∂η3i 2i ∂2 `(δ )i ∗ ∂η1i ∂η3i 2 ∂ `(δ )i ∗ ∂η2i ∂η3i 2 ∂ `(δ )i ∗2 ∂η3i    ,  and, assuming without loss of generality that the spline basis dimensions for the smooth terms in the model are all equal to J, Xi is a 3 × {( P1 + K1 × J ) + ( P2 + K2 × J ) + 1} block + + diagonal matrix, i.e. Xi = diag (x1i , B1i ), (x2i , B2i ), 1 . The superscript [ a] has been suppressed from di , zi , and Wi to avoid clutter. The λkv should be selected so that the estimated smooth functions are as close as possible to the true functions. Given an estimate for δ, multiple smoothing parameter estimation for problem (12) can be achieved by minimization of the approximate UBRE score, also thought of as an approximate rescaled Akaike information criterion, Vuw (λ) = 2 1 √ k W(z − Xδ )k2 − 1 + tr(Aλ ), n∗ n∗ (13) where the working linear model quantities are estimated in the Fisher scoring step, n∗ = 3n, ∗ −1 T Aλ = X(XT WX + Sλ ) X W is the hat matrix, and tr(Aλ ) represents the estimated degrees of freedom of the penalized model. For each working linear model of the P-IRLS iteration, Vuw (λ) is minimized with respect to λ. The two steps, one for δ the other for λ, are iterated until convergence. Here, this approach is implemented employing the approach by Wood (2004), which is based on Newton’s method and can evaluate score (13) and their derivatives in a way that is both computationally efficient and stable. Generally speaking, √ Q and this is achieved using WX = QR, obtained by pivoted QR decomposition, " where # R R are defined in the usual manner, and a singular value decomposition = UDVT , L ∗ ∗ where L is any matrix square root of Sλ such that LT L = Sλ , obtained for instance by eigendecomposition, the columns of U are those of an orthogonal matrix, V is an orthogonal matrix, and D is a diagonal matrix of singular values which are very useful to detect numerical rank deficiency of the fitting problem. Based on this, evaluation of tr(Aλ ) for new trial values of the smoothing parameters can be made relatively cheap, and the derivatives of Vuw with respect to λ can be stably and efficiently evaluated. Note that minimization of the score is with respect to λ∗ = log(λ) since the smoothing parameters must be positive. See Wood (2004) for further details. Remark 1. In the context of simultaneous equation estimation methods, the use of Fisher 10 scoring is recommended because the Wi are positive-definite over a larger region of the parameter space as compared to those obtained by Newton-Raphson. This is crucial given √ that W and W−1 (via z), obtained by eigen-decomposition, are needed in (13). Remark 2. Because W is a non-diagonal matrix of dimension n∗ × n∗ , computation can √ √ quickly become prohibitive, even for small problems. To calculate W−1 d, Wz and WX so that the computational load and storage demand of the algorithm is reduced considerably, the band structure of W is exploited. Hence, the working linear model in (13) is formed in O(n∗ (m + 2)) rather than O(n2∗ (m + 2)) operations, where m is the number of columns of X. Remark 3. As opposed to the Heckman-type approach, simultaneous estimation of all model parameters via the bivariate probit scheme introduced here does not rely on approximations and does not require the use of quantities estimated in previous stages, hence the procedure will yield unbiased and consistent estimates. Moreover, standard errors do not need to be corrected since the approach does not require the use of the inverse Mills ratio, and all parameters in δ are estimated jointly. All this convenience comes at expense of computational cost and stability. However, these can be satisfactorily dealt with by taking the precautions described in last two sections, and suppling the Heckman estimates (obtained using the procedure of Section 2.2) as starting values in the bivariate probit estimation scheme. 2.4 Empirical identification The parameters of the approach described in Section 2.2 are formally identified even if + + , B2i , because of the nonlinearity of the inverse Mills ratio (e.g. Puhani, x1i , B1i = x2i 2000). In practice, however, this will often result in substantial collinearity between ϑbi and the other covariates in the outcome equation. This is especially true when the variation in ηb1i will be such that the nonlinearity of the inverse Mills ratio does not play a major role, case in which ϑbi would be approximated quite well by a linear function of the covariates in the model. This collinearity can lead to large standard errors and instability in the estimates. The parameters of the method detailed in Section 2.3 are also formally identified, but with the advantage of not dealing with the drawbacks deriving from the use of the inverse Mills ratio. However, in the absence of equation-specific regressors the likelihood function may not vary significantly over a wide region around the mode, hence causing numerical difficulties. This suggests that, although the model parameters are formally identified, empirical identification is better achieved if the exclusion restriction (ER) on the covariates in the two equations holds (e.g. Little, 1985; Puhani, 2000). That is, the regressors in the selection equation should contain at least one or more regressors than those included in the outcome 11 equation. The simulation study in Section 3 will also shed light on this issue, particularly as to when the exclusion restriction is crucial to identify empirically the model parameters. 2.5 Inference The methods described in Section 2 rely on penalized log-likelihood estimation. Within this framework, inferential theory is not standard. This is because of the presence of smoothing penalties which undermines the usefulness of classic frequentist results for practical modelling (e.g. Wood, 2006). Solutions to this problem have been proposed. In this section, we show how to construct confidence intervals for the components in a semiparametric sample selection model adapting some of the results available in the literature. The well known Bayesian ‘confidence’ intervals originally proposed by Wahba (1983) in the univariate spline model context, and then generalized to the non-constant width component-wise interval case when dealing with Gaussian and non-Gaussian data are typically used to reliably represent the uncertainty of smooth functions (e.g. Gu, 2002). An interesting feature of these intervals is that they have close to nominal ‘across-the-function’ frequentist coverage probabilities, despite their derivation is Bayesian (Nychka, 1988; Marra and Wood, in press). To better understand this point, let us consider a generic smooth model component f (zi ). Intervals can be constructed seeking some constants Ci and A, such that 1 ACP = E n ( ∑ I(| fˆ(zi ) − f (zi )| ≤ qα/2 A/ i p Ci ) ) = 1 − α, (14) where ‘ACP’ denotes ‘Average Coverage Probability’, I is an indicator function, α is a constant between 0 and 1, and qα/2 is the α/2 critical point from a standard normal distribution. Defining b(z) = E { fˆ(z)} − f (z) and v(z) = fˆ(z) − E { fˆ(z)}, so that fˆ − f = b + v, and I to be a random variable uniformly distributed on {1, 2, . . . , n}, we have that ACP = √ √ Pr (| B + V | ≤ qα/2 A), where B = C I b(z I ) and V = C I v(z I ). At this point, it is necessary to find the distribution of B + V and values for the Ci and A so that requirement (14) is met. As theoretically shown in Marra and Wood (in press), in the context of non-Gaussian response models involving several smooth components, such a requirement is approximately met by the posterior distribution ˆ Vδ ), δ |yvN ˙ (δ, (15) where, for the approach described in Section 2.3, y refers to the response vectors, δˆ is the ∗ )−1 . Note that the same distributional result can be used estimate of δ and Vδ = (I + Sλ for the models described in Section 2.2. Given result (15), confidence intervals for linear and nonlinear functions of the model parameters can be easily obtained. For any strictly 12 parametric model components, using (15) to obtain confidence intervals is equivalent to using classic likelihood results. This is because such model terms are not penalized. It is important to stress that there is no contradiction in fitting the sample selection model via penalized log-likelihood estimation and then constructing confidence intervals following a Bayesian approach, and such a procedure has been employed many times in the literature (e.g. Gu, 2002; Marra and Radice, 2011a; Wood, 2006). Alternatively, one could employ a fully Bayesian approach. However, as explained in the introductory section, this is unlikely to be successful when used for parameter estimation. Result (15) can produce intervals with close to nominal coverage probabilities for the model components when using the estimation scheme described in Section 2.3. However, this is not true for the approach detailed in Section 2.2, for the reasons given in Remark 2. As a solution, posterior simulation can be employed (e.g. Wood, 2006). Specifically, we propose to adjust the intervals using the following algorithm: • Let the parameter vector and covariance Bayesian matrix estimated in step 1 be δˆ1 and ˆ δ . Draw Ns random vectors from N (δˆ1 , V ˆ δ ) and then calculate the corresponding V 1 1 ∗ ∗ b Ns values ηb and ϑ , for all i. 1i i ols , δˆ ols , . . . , δˆ ols and V ˆ ols , V ˆ ols , . . . , V ˆ ols . For each • Fit Ns step 2 models to obtain δˆ2,1 2,2 2,Ns δ2,1 δ2,2 δ2,Ns parameter vector and covariance matrix combination, draw Ns random vectors from the corresponding Gaussian distribution and then calculate the Ns2 values ρb∗ . ∗ and ρ b∗ . • Calculate the Ns2 values τbi∗ , for all i, using ϑbi∗ , ηb1i • Fit Ns2 step 4 models, using each of the ϑbi∗ and τbi∗ combination, to obtain δˆ2,1 , δˆ2,2 , . . . , δˆ2,Ns2 ˆ δ ,V ˆ δ ,...,V ˆ δ . For each parameter vector and covariance matrix combinaand V 2,2 2,1 2,Ns2 tion, draw Nd random vectors from the corresponding Gaussian distribution so to obtain Ns2 × Nd random draws from which approximate intervals for the component functions of model (9) can be constructed. This procedure can account for the extra source of variability introduced via the quantities calculated in the steps 1–3 of the Heckman-type estimation approach. Simulation experience suggests that small values for Ns and Nd , say 20 and 100, will be tolerable in practice. A similar scheme has been previously employed by Marra and Radice (in press) who found close to nominal across-the-function observed coverage probabilities for the model components. 3 Simulation study To gain insight into the empirical properties of the estimation approaches detailed in the previous sections, a Monte Carlo simulation study was conducted. All computations were per13 formed in the R environment (R Development Core Team, 2011) using the package SemiParBIVProbit which implements the ideas discussed in this article (Marra and Radice, 2011b). 3.1 Design and model fitting details The sampling experiments were based on the model ∗ y1i = θ11 + θ12 xi+ + f 11 (z1i ) + f 12 (z2i ) + ε 1i ∗ y2i = θ21 + θ22 xi+ + f 21 (z1i ) + ε 2i , 0.0 0.2 0.4 0.6 0.8 1.0 0.0 −0.2 f21(z1) −0.8 −0.6 −0.4 0.2 f12(z2) −2 −0.4 −0.2 0.0 0 −1 f11(z1) 0.4 1 0.6 0.2 0.8 where the binary outcomes y1i and y2i were determined according to the rules described in Section 2.1. The test functions were f 11 (z1i ) = −0.7 4z1i + 2.5z21i + 0.7 sin(5z1i ) + cos(7.5z1i ) , f 12 (z2i ) = −0.4 {−0.3 − 1.6z2i + sin(5z2i )}, and f 21 (z1i ) = 0.6 {exp(z1i ) + sin(2.9z1i )}; these are displayed in Figure 1. (θ12 , θ22 ) was set to (2.5, −1.5). To generate binary values for y1i so that approximately 50% of the total number of observations were selected to fit the outcome equation, and values for y2i which appeared approximately the same number of times, (θ11 , θ21 ) was set to (0.58, −0.68). As for xi+ , z1i and z2i , using rmvnorm() in the package mvtnorm, three uniform covariates on (0, 1) were obtained generating standardized multivariate random draws with correlation 0.5 and then applying pnorm() (Gentle, 2003). The regressor xi+ was dichotomized using round(). Standardized bivariate normal errors with correlations ρ = (±0.1, ±0.5, ±0.9) were considered, and sample sizes set to 1000, 3000 and 5000. In a full factorial design fashion, 1000 replications of each combination of parameter settings were obtained. 0.0 0.2 0.4 z1 0.6 0.8 1.0 0.0 z2 0.2 0.4 0.6 0.8 1.0 z1 Figure 1: The test functions used in the simulation studies. To explore empirically the issue of identification discussed in Section 2.4, we fitted the model using the data generating process described above but with no exclusion restriction (noER), i.e. without including z2i in the selection equation. We also considered the case 14 in which approximately 75% of the total number of observations were selected to fit the outcome equation. This was achieved setting θ11 to 1.66. In the two estimation approaches, the smooth components were represented using penalized thin plate regression splines with basis dimensions equal to 10 and penalties based on second-order derivatives (Wood, 2006, pp. 154-160). In cross-sectional studies, 10 bases typically suffice to represent reasonably well smooth functions, although sensitivity analysis using fewer or more spline bases is advisable in applied work. Models were also fitted neglecting the sample selection issue, i.e. simply fitting equation (5) on the selected sample (henceforth, this will be referred to as naive approach). 3.2 Results In this section, we only show a subset of results; these are representative of all empirical findings. As pointed out in the introduction, the selection equation is not affected by sample selection bias, hence we focus on the estimation results for the outcome equation. Figures 2, 3 and 4 present the boxplots of the estimates of θ22 , those of ρ, and the empirical root mean squared errors (RMSE) of fb21 (z1 ) when employing the naive, Heckman-type and bivariate probit estimation approaches, ER holds and approximately 50% of the total number of observations are available to fit the outcome equation. Figure 5 shows the estimated smooth functions for f 21 (z1 ) averaged over the simulation runs, whereas Figures 6 and 7 give the results for the estimates of θ22 and RMSE of fb21 (z1 ) when ER does not hold and approximately 50% and 75% of the total number of observations are available to fit the outcome equation. As, e.g., in Wiesenfarth and Kneibr(2010), based on the estimates for 200 n o2 b21 (z1i ) − f 21 (z1i ) . fixed covariate values, RMSE( fb21 ) was calculated as ∑200 f b =1 The results can be summarized as follows: • Overall, Figures 2 and 3 show that bivariate probit outperforms both the naive and Heckman approaches in terms of bias and precision of θb22 and ρb. Specifically, as ρ increases (i.e. as the sample selection issue becomes more pronounced) the accuracy of the naive and Heckman estimates deteriorates quickly, with the naive being the worst. In all cases, as n increases the bivariate probit estimates converge to their true values; this is not observed in the Heckman estimates for the reasons given in Remark 1 of Section 2.2. • The RMSEs and average fits of f 21 (z1 ), in Figures 4 and 5, indicate that the bivariate probit method performs better than the other two approaches, although the RMSEs of the former are characterized by more extreme values especially when ρ = 0.1. When sample selection bias is fairly negligible (i.e. ρ is low), none of the methods outperforms the others although the bivariate probit and Heckman estimation schemes are 15 obviously inefficient. As ρ increases the naive estimates are severely biased and not reliable, whereas those of bivariate probit and Heckman are consistently more accurate and precise. • The results from the scenarios with noER, in Figures 6 and 7, suggest that when an extra regressor does not enter the selection equation and about half of the total number of observations are available for the outcome equation, the estimates of all methods are biased and not reliable. As the percentage of selected observations increases the bivariate probit scheme strongly outperforms naive and slightly Heckman. This indicates that, under this circumstance and provided that assumption (3) is met, ER is not necessary to achieve empirical identification, at least for bivariate probit. Results from scenarios with ER n=1000 n=3000 n=5000 0 ρ=0.1 −3 ρ=0.1 −2 ρ=0.1 −1 −4 −5 n=1000 n=3000 n=5000 0 −2 ρ=0.5 ρ=0.5 θ^22 ρ=0.5 −1 −3 −4 −5 n=1000 n=3000 n=5000 0 ρ=0.9 −3 ρ=0.9 −2 ρ=0.9 −1 −4 −5 Naive Heckman Biv probit Naive Heckman Biv probit Naive Heckman Biv probit methods Figure 2: Boxplots of the parameter estimates of θ22 based on 1000 replications obtained from the experiments with exclusion restriction (ER), when employing the naive, Heckman-type and bivariate probit estimation approaches and approximately 50% of the total number of observations are available to fit the outcome equation. The true value (dashed lines) is −1.5. ρ and n denote the correlation between the errors of the selection and outcome equations, and the sample size. See Section 3.1 for further details. Average coverage probabilities of the 95% confidence intervals for θ22 and f 21 (z1 ), constructed as described in Section 2.5, were calculated. Ns and Nd were set to 20 and 100. For θ22 , the coverages rates of the Heckman approach were below the nominal level, with values going from 0.93 to 0.80 as ρ was increasing. This was especially due to the bias in the estimates highlighted in Figure 2. Despite the accuracy of the bivariate probit method in estimating θ22 , rates were poor with values ranging from 0.72 to 0.86 as n was increasing. A similar behaviour has already been observed in the literature when employing bivariate 16 Results from scenarios with ER n=1000 n=3000 n=5000 ρ=0.1 1 ρ=0.1 2 ρ=0.1 3 0 −1 n=1000 n=3000 n=5000 2 ρ=0.5 ρ=0.5 ^ ρ ρ=0.5 3 1 0 n=1000 n=3000 −1 n=5000 ρ=0.9 1 ρ=0.9 2 ρ=0.9 3 0 −1 Heckman Biv probit Heckman Biv probit Heckman Biv probit methods Figure 3: Boxplots of the parameter estimates of ρ based on 1000 replications obtained from the experiments with exclusion restriction (ER), when employing the Heckman-type and bivariate probit estimation approaches. The dashed lines indicate the true values. Further details are given in the caption of Figure 2. Results from scenarios with ER n=1000 n=3000 n=5000 0.8 ρ=0.1 ρ=0.1 0.4 ρ=0.1 0.6 0.2 0.0 n=1000 n=3000 n=5000 ^ RMSE( f 21) 0.8 ρ=0.5 ρ=0.5 ρ=0.5 0.6 0.4 0.2 n=1000 n=3000 0.0 n=5000 0.8 ρ=0.9 ρ=0.9 0.4 ρ=0.9 0.6 0.2 0.0 Naive Heckman Biv probit Naive Heckman Biv probit Naive Heckman Biv probit methods Figure 4: Boxplots of the empirical root mean squared errors (RMSE) of fb21 (z1 ) based on 1000 replications obtained from the experiments with exclusion restriction (ER), when employing the naive, Heckman-type and bivariate probit estimation approaches. Further details are given in the caption of Figure 2. 17 Results from scenarios with ER 0.0 0.2 n=1000 0.4 0.6 0.8 1.0 n=3000 n=5000 1.0 ρ=0.1 ρ=0.1 0.0 ρ=0.1 0.5 −0.5 −1.0 ^ average f 21(z1) n=1000 n=3000 n=5000 1.0 ρ=0.5 ρ=0.5 ρ=0.5 0.5 0.0 −0.5 −1.0 n=1000 n=3000 n=5000 1.0 ρ=0.9 ρ=0.9 0.0 ρ=0.9 0.5 −0.5 −1.0 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 z1 Figure 5: Average fits calculated using the estimated smooth functions for f 21 (z1 ) based on 1000 replications obtained from the experiments with exclusion restriction (ER), when employing the naive (grey lines), Heckman-type (dashed lines) and bivariate probit (dotted lines) estimation approaches. The true function is represented by the black solid lines. Note that in some cases the curves differ only minimally, hence they can hardly be distinguished. Further details are given in the caption of Figure 2. Results from scenarios with noER n=1000 n=3000 n=5000 4 ρ=0.1 ρ=0.1 0 ρ=0.1 2 −2 −4 n=1000 n=3000 n=5000 4 ρ=0.5 ρ=0.5 θ^22 ρ=0.5 2 0 −2 −4 n=1000 n=3000 n=5000 4 ρ=0.9 ρ=0.9 0 ρ=0.9 2 −2 −4 50% 75% 50% 75% 50% 75% percentage of observations in the selected sample Figure 6: Boxplots of the parameter estimates of θ22 based on 1000 replications obtained from the experiments with no exclusion restriction (noER), when employing the naive (white boxplots), Heckman-type (grey) and bivariate probit (darker grey) estimation approaches, and when approximately 50% and 75% of the total number of observations are available to fit the outcome equation. The true value (dashed lines) is −1.5. Further details are given in the caption of Figure 2. 18 Results from scenarios with noER n=1000 n=3000 n=5000 2.5 ρ=0.1 ρ=0.1 1.5 ρ=0.1 2.0 1.0 0.5 n=1000 n=3000 n=5000 ^ RMSE( f 21) 2.5 ρ=0.5 ρ=0.5 ρ=0.5 2.0 1.5 1.0 0.5 n=1000 n=3000 n=5000 2.5 ρ=0.9 ρ=0.9 1.5 ρ=0.9 2.0 1.0 0.5 50% 75% 50% 75% 50% 75% percentage of observations in the selected sample Figure 7: Boxplots of the empirical root mean squared errors (RMSE) of fb21 (z1 ) based on 1000 replications obtained from the experiments with no exclusion restriction (noER), when employing the naive (white boxplots), Heckman-type (grey) and bivariate probit (darker grey) estimation approaches, and when approximately 50% and 75% of the total number of observations are available to fit the outcome equation. Further details are given in the caption of Figure 2. probit estimation schemes and it is not yet clear why this is the case (Chiburis et al., 2011, and references therein). Here, nonparametric bootstrapped confidence intervals based on 199 replications appeared to provide an effective and simple fix for undercoverage; rates were in the interval (0.91, 0.97). For f 21 (z1 ), the bivariate probit approach yielded close to nominal coverages, with values in the range (0.93, 0.97). Interestingly, for the Heckman method, confidence intervals calculated without employing the correction procedure described in Section 2.5 produced rates which were not as low as expected, with values ranging from to 0.92 to 0.94. Corrected intervals offered marginal improvements, increasing the rates by 0.02 points. A number of alternative design settings were tried out using different nonlinear functions and changing the impact of the parametric components; the resulting conclusions were fairly similar to those reported here. In addition, in the spirit of Little (1985), to gain further insights into the identification problem, we carried out additional simulation experiments where the model errors were generated according to a bivariate Student-t distribution. This was achieved using rmvt() in the R package mvtnorm. Degrees of freedom equal to 3 and 15 were considered. All remaining design and model fitting settings were the same as those described in the previous section. As expected, the results (available upon request) were worse than those presented in this section. However, the use of ER helped to obtain better 19 estimates although still not as good as those produced when assumption (3) is met. These findings were in agreement with those by Marra and Radice (2011a) obtained in a similar context. 4 Application In this section, we illustrate the proposed methods using data on public opinion polls on school integration. As argued, e.g., in Verba et al. (1995), opinion polls may represent the view of the public well. However, in certain situations, they may poorly reflect collective public sentiment because some individuals choose not to respond to some specific questions as they feel that their opinion may be perceived as socially unacceptable. When some individuals are not willing to show their views, polls measuring collective opinion on sensitive topics typically provide biased estimates of preferences in the population a whole. A number of studies have recognized the presence of this phenomenon which can be problematic for policy information (e.g. Groves et al., 2002). A survey of public opinion polls on school integration conducted in the USA is one such study (Berinsky, 1999). 4.1 American national election study We use data from the American National Election Survey (ANES, http://www.electionstudies.o conducted in 1992, on public opinion polls on school integration. As mentioned in the introduction, the main question was whether respondents support government intervention to ensure that black and white children go to the same school. About 700 individuals were first asked if they had an opinion on the integration question (0 = no, 1 = yes) and then what that opinion was (0 = no integration, 1 = yes integration). This gave respondents an opportunity to opt out of the question answering process at an early stage. 64.57% of the individuals chose to answer the integration question. Among these, the proportion of ‘yes’ answers was 46.43%. The dataset also included information on individual demographic and socio-economic characteristics. The variables considered were age (in years), educ (number of years of education), sex (0 = female, 1 = male), race (0 = black, 1 = white), reg (0 = North-Central, 1 = North-East, 2 = South, 3 = West), child (number of children in the household), discpol (0 = never discuss politics, 1 = discuss politics), and perslett which was a binary variable indicating whether the interviewer attempted to convert a respondent who initially refused to participate in the survey. We analyzed the ANES dataset using the naive, Heckman-type and bivariate probit estimation approaches with the same model fitting settings as those described in the simulation study section. The outcome equation included the variables sex, race, reg, and child as 20 parametric components, and smooth functions of age and educ. Following Berinsky (1999), the selection equation included the same components as those of the outcome equation plus discpol and perslett. child was included as a parametric component because it did not have enough unique covariate values to justify the use of a smooth function. 4.2 Results and interpretation Table 1 and Figure 8 report the parametric and smooth function estimates for the outcome equation and, for completeness, also those for the selection equation, when applying the three approaches on the ANES dataset. In the selection equation, the estimated effect of discpol is the strongest and significant at the 5% level. The effect of age is linear for both the bivariate probit and Heckman approaches. The function estimates of educ are linear and nonlinear for bivariate probit and Heckman, respectively, and the bivariate probit intervals do not contain the Heckman estimated curve for a part of the range of covariate values. In the outcome equation, the parameter of race exhibits the strongest effect, and is nearly significant for bivariate probit. The effects of age and educ show different degrees of nonlinearity across the three methods, and for age the bivariate probit intervals do not contain the naive function estimate for a small part of the covariate value range. The estimates of ρ, which are important to ascertain the presence of selection bias, are high and, for bivariate probit, statistically significant. This indicates that the process by which individuals decide to answer the integration question is connected to the process by which they decide what the answer is. The positive sign of ρb suggests that the unobservable factors which lead individuals to take part in the survey also lead them to take a more supportive stance on the integration issue. A comparison between the outcome equation parametric estimates obtained from the naive approach and those from the Heckman and bivariate probit methods indicates that, while none of them change sign, the substantive power of some parameters is altered by the correction for the selection bias present in the question-answering process. For instance, the bivariate probit coefficient of race decreases by about 14% percent relative to that of the naive approach. However, the movement of the coefficient estimates of (Intercept) is more impressive. The naive estimate is positive, while it is negative for Heckman and bivariate probit. This means that once selection effects are accounted for, respondents are more likely to oppose school integration than the naive estimate suggests. This result is consistent with that of Berinsky (1999) who also found that correcting for selection bias decreases the probability of supporting government efforts to integrate schools. The nonlinear impacts of age and educ obtained using the bivariate probit approach are consistent with the interpretations that the effect of age on school integration is negative up 21 to a certain age (56-58 years) after which it becomes positive, and that educ has a positive impact throughout the whole range of covariate values. Different interpretations are obVariable Naive Heckman Bivariate probit - 0.12 (−0.30, 0.54) 0.04 (−0.17, 0.25) −0.23 (−0.56, 0.09) 0.27 (−0.04, 0.58) 0.19 (−0.09, 0.47) 0.32 (0.02, 0.62) 0.08 (−0.03, 0.20) 0.33 (0.08, 0.57) 0.27 (−0.03, 0.57) 0.06 (−0.45, 0.58) 0.05 (−0.19, 0.27) −0.22 (−0.57, 0.10) 0.26 (−0.06, 0.59) 0.19 (−0.12, 0.46) 0.31 (0.01, 0.63) 0.08 (−0.06, 0.21) 0.32 (0.05, 0.58) 0.26 (−0.05, 0.58) 0.26 (−0.20, 0.73) −0.11 (−0.37, 0.15) −0.77 (−1.17, −0.37) 0.44 (0.06, 0.83) 0.36 (0.00, 0.72) 0.40 (0.03, 0.78) 0.11 (−0.03, 0.25) −0.22 (−1.08, 0.62) −0.07 (−0.32, 0.17) −0.71 (−1.21, −0.25) 0.45 (−0.04, 0.98) 0.34 (−0.13, 0.85) 0.43 (−0.04, 0.95) 0.12 (−0.03, 0.30) −0.30 (−1.15, 0.57) −0.07 (−0.47, 0.33) −0.66 (−1.29, −0.05) 0.46 (−0.03, 0.91) 0.35 (−0.10, 0.82) 0.44 (−0.05, 0.89) 0.13 (−0.33, 0.55) - 0.88 (−0.28, 2.03) 0.84 (0.45, 0.99) Selection Eq. (Intercept) sex.male race.white reg.northeast reg.south reg.west child discpol perslett Outcome Eq. (Intercept) sex.male race.white reg.northeast reg.south reg.west child ρ Table 1: Parametric estimates obtained applying the naive, Heckman-type and bivariate probit estimation approaches on the ANES dataset described in Section 4.1. Within parentheses are 95% confidence intervals calculated using the procedure described in Section 2.5 with Ns = 20 and Nd = 100 for Heckman, and nonparametric bootstrap based on 199 replications for bivariate probit. Note that results from the naive approach concern only the outcome equation. tained when looking at the naive and Heckman smooth term estimates. These results could offer new empirical insights if analysed in light of the sociological and economic literature, but this is beyond the scope of this article. To gauge the aggregate effects of the selection bias in the question-answering process, we estimated the mean predicted probabilities of giving a supportive response under the three methods. These were 0.39, 0.15 and 0.09 for the naive, Heckman-type and bivariate probit approaches. So, predicted support for school integration is much lower when selection bias is accounted for, hence indicating that expressed opinion on the school integration question is a poor barometer of underlying support for integrationist policy. Based on our simulation study, estimation results from the bivariate probit approach can be regarded as the most accurate. 22 0.5 0.0 −1.0 −0.5 f(educ) − selection eq. 0.4 0.2 0.0 −0.2 −0.4 f(age) − selection eq. 0.6 Note that the goodness of the results presented in this section relies especially on whether the assumption of normality is met. For the simultaneous equation estimation approach, a possibility would be to employ a score test of bivariate normality whose density of the errors under the alternative hypothesis is based on a type AA bivariate Gram Charlier series with 9 additional parameters (Chiburis, 2010; Lee, 1984). However, it is not clear how this test could be satisfactorily extended to the penalized framework considered here. 20 30 40 50 60 70 80 5 15 0.0 f(educ) − outcome eq. −1.0 −0.5 0.6 0.4 0.2 0.0 −0.2 −0.4 f(age) − outcome eq. 10 educ 0.5 age 20 30 40 50 60 70 80 5 age 10 15 educ Figure 8: Smooth function estimates obtained applying the naive (grey lines), Heckman-type (dotted lines) and bivariate probit (solid lines) estimation approaches on the ANES dataset described in Section 4.1. The results are reported on the scale of the linear predictors of the selection and outcome equations. The dashed lines represent 95% Bayesian ‘confidence’ intervals calculated from the bivariate probit estimates. The ‘rug plot’, at the bottom of each graph, shows the covariate values. Note that: in the left-top plot the Heckman and bivariate probit smooth estimates can not be distinguished; to avoid clutter, corrected confidence intervals for the Heckman approach have not been reported; results from the naive approach concern only the outcome equation. Due to the identifiability constraints, the estimated curves pass through zero. 5 Discussion In this article, we introduced two statistical methods for the (separate or simultaneous) estimation of two binary regression models involving semiparametric predictors in the presence of sample selection bias. The problems of identification and inference have also been discussed. The methods have been illustrated using data on public opinion polls on school integration. The application of the proposed approaches revealed that predicted support for school integration is much lower when selection bias is accounted for. This finding could 23 not have been obtained otherwise. The approaches presented here have a great potential of applicability in several research areas including economics and epidemiology. The results of our simulation study suggested that the bivariate probit and Heckmantype estimation approaches are effective for flexibly estimating covariate impacts in the presence of selection bias, with the former being more accurate and reliable. We also shed light on the issue of empirical identification. The main finding was that, in the absence of exclusion restriction and provided that the assumption of normality is met, if the proportion of selected observations is fairly high then empirical identification is equally achieved, especially when using the bivariate probit estimation method. Since maximum likelihood estimation schemes are typically sensitive to model error misspecification, extensions of our proposals allowing for different joint distributions of the model errors seem feasible adopting either a copula approach (e.g. Nelsen, 2006) or a nonparametric distribution function estimation framework (e.g. Gallant and Nychka, 1987; Klein and Spady, 1993). To accommodate more complex data structures arising, for instance, in longitudinal studies, future research will also focus on extending the methods presented in this article to allow for random effects in the linear predictors. Appendix The expressions for the gradient vector and Fisher information matrix that are referred to in Section 2.3 are given here. The elements of the gradient vector are defined as g1 g2 g3∗ y1i y2i y1i (1 − y2i ) (1 − y1i ) = ∑ φ(η1i ) Φ( Ai ) + {1 − Φ( Ai )} − X1i , p11i p10i p0i i =1 n y1i y2i y1i (1 − y2i ) = ∑ φ(η2i ) Φ( Bi ) − X2i , p11i p10i i =1 n y1i y2i y1i (1 − y2i ) ∂ρ = ∑ φ2 (η1i , η2i ; ρ) − , p11i p10i ∂ρ∗ i =1 n where Xvi = x+ norvi , Bvi for v = 1, 2, φ2 is the density function of a standardized bivariate p 2 ∗ ∗ ∗ mal with correlation ρ, ∂ρ/∂ρ = 4 exp (2ρ )/ {exp (2ρ ) + 1} , Ai = (η2i − ρη1i )/ (1 − ρ2 ), p and Bi = (η1i − ρη2i )/ (1 − ρ2 ). The remaining quantities are defined in Section 2.3. The 24 elements of the Fisher information matrix are given as I 11 = I 22 = ∗ I 33 = I 12 = ∗ I 13 = ∗ I 23 = 1 1 2 1 ∑ φ(η1i ) Φ( Ai ) p11i + {1 − Φ( Ai )} p10i + p0i XT1i X1i , i =1 n 1 1 2 ∑ {φ(η2i )Φ( Bi )} p11i + p10i XT2i X2i , i =1 n 1 ∂ρ 2 1 2 ∑ φ2 (η1i , η2i ; ρ) p11i + p10i ∂ρ∗ , i =1 n Φ( Ai )Φ( Bi ) {1 − Φ( Ai )} Φ( Bi ) T − X1i X2i , ∑ φ(η1i )φ(η2i ) p p 11i 10i i =1 n 1 1 ∂ρ ∑ φ(η1i )φ2 (η1i , η2i ; ρ) Φ( Ai ) p11i − {1 − Φ( Ai )} p10i ∂ρ∗ X1i , i =1 n 1 ∂ρ 1 X2i . ∑ φ(η2i )φ2 (η1i , η2i ; ρ) Φ( Bi ) p11i + p10i ∂ρ∗ i =1 n 2 2 References [1] Bärnighausen, T., Bor, J., Wandira-Kazibwe, S., and Canning, D. (2011). Correcting HIV Prevalence Estimates for Survey Nonparticipation Using Heckman-type Selection Models. Epidemiology, 22, 27–35. [2] Berinsky, A. J. (1999). The Two Faces of Public Opinion. American Journal of Political Science, 43, 1209–1230. [3] Chib, S., Greenberg, E., and Jeliazkov, I. (2009). Estimation of Semiparametric Models in the Presence of Endogeneity and Sample Selection. Journal of Computational and Graphical Statistics, 18, 321–348. [4] Chiburis, R. C. (2010). Score Tests of Normality in Bivariate Probit Models: Comment. Working paper. Available at https://webspace.utexas.edu/rcc485/www/research.html. [5] Chiburis, R. C., Das, J., and Lokshin, M. (2011). A Practical Comparison of the Bivariate Probit and Linear IV Estimators. World Bank Policy Research, Paper 5601. [6] Craven, P., and Wahba, G. (1979). Smoothing noisy data with spline functions. Numerische Mathematik, 31, 377–403. [7] Cuddeback, G., Wilson, E., Orme, J. G., and Combs-Orme, T. (2004). Detecting and Statistically Correcting Sample Selection Bias. Journal of Social Service Research, 30, 19– 33. 25 [8] Dubin, J. A., and Rivers, D. (1990). Selection Bias in Linear Regression, Logit and Probit Models. Sociological Methods and Research, 18, 360–390. [9] Gallant, A. R., and Nychka, D. W. (1987). Semi-Nonparametric Maximum Likelihood Estimation. Econometrica, 55, 363–390. [10] Gentle, J. E. (2003). Random Number Generation and Monte Carlo Methods. London: Springer-Verlag. [11] Greene, W. H. (2007). Econometric Analysis. New York: Prentice Hall. [12] Groves, R. M., Dillman, D. A., Eltinge, J. L., and Little, R. J. A. (2002). Survey Nonresponse. New York: Wiley. [13] Gu, C. (1992). Cross validating non-Gaussian data. Journal of Computational and Graphical Statistics, 1, 169–179. [14] Gu, C. (2002). Smoothing Spline ANOVA Models. London: Springer-Verlag. [15] Hastie, T., and Tibshirani, R. (1993). Varying-coefficient models. Journal of the Royal Statistical Society Series B, 55, 757–796. [16] Heckman, J. J. (1979). Sample selection bias as a specification error. Econometrica, 47, 153–162. [17] Klein, R., and Spady, R. (1993). An Efficient Semiparametric Estimator of the Binary Choice Model. Econometrica, 61, 387–421. [18] Lee, L. F. (1984). Tests for the bivariate normal distribution in econometric models with selectivity. Econometrica, 52, 843–863. [19] Little, R. (1985). A note about models for selectivity bias. Econometrica, 53, 1469–1474. [20] Marra, G., and Radice, R. (2011a). Estimation of a semiparametric recursive bivariate probit model in the presence of endogeneity. Canadian Journal of Statistics, 39, 259–279. [21] Marra, G., and Radice, R. (2011b). SemiParBIVProbit: parametric Bivariate Probit Modelling. R package version http://CRAN.R-project.org/package=SemiParBIVProbit. Semi2.0-4.1, [22] Marra, G., and Radice, R. (in press). A Flexible Instrumental Variable Approach. Statistical Modelling, available at www.ucl.ac.uk/statistics/research/reports. [23] Marra, G., and Wood, S. N. (in press). Coverage Properties of Confidence Intervals for Generalized Additive Model Components. Scandinavian Journal of Statistics, available at www.ucl.ac.uk/statistics/research/reports. 26 [24] Montmarquettea, C., Mahseredjiana, S., and Houle, R. (2001). The determinants of university dropouts: a bivariate probability model with sample selection. Economics of Education Review, 20, 475–484. [25] Nelsen, R. B. (2006). An Introduction to Copulas. New York: Springer. [26] Nocedal, J., and Wright, S. J. (1999). Numerical Optimization. New York: Springer-Verlag. [27] Nychka, D. (1988). Bayesian Confidence Intervals for Smoothing Splines. Journal of the American Statistical Association, 83, 1134–1143. [28] Puhani, P. A. (2000). The Heckman correction for sample selection and its critique. Journal of Economic Surveys, 14, 53–68. [29] R Development Core Team (2011). R: A Language and Environment for Statistical Computing. Vienna, Austria: R Foundation for Statistical Computing. ISBN 3-900051-07-0. [30] Ruppert, D., Wand, M. P., and Carroll, R. J. (2003). Semiparametric Regression. London: Cambridge University Press. [31] Van de Ven, W. P. M. M., and Van Praag, B. M. S. (1981). The demand for deductibles in private health insurance: a probit model with sample selection. Journal of Econometrics, 17, 229–252. [32] Verba, S., Schlozman, K. L., and Brady, H. E. (1995). Voice and Equality: Civic Voluntarism in American Politics. Cambridge: Harvard University Press. [33] Wahba, G. (1983). Bayesian ‘Confidence Intervals’ for the Cross-Validated Smoothing Spline. Journal of the Royal Statistical Society Series B, 45, 133–150. [34] Wiesenfarth, M., and Kneib, T. (2010). Bayesian geoadditive sample selection models. Journal of the Royal Statistical Society Series C, 59, 381–404. [35] Wood, S. N. (2004). Stable and efficient multiple smoothing parameter estimation for generalized additive models. Journal of the American Statistical Association, 99, 673–686. [36] Wood, S. N. (2006). Generalized Additive Models: An Introduction with R. London: Chapman & Hall. 27