Sample Selection Regression Models (Ch. 17) Until now we always assumed to have a random sample Now we cover cases where no random sample is available There are two distinct cases: - the sample was collected/selected according to some value of y - the sample is selected by behaviour of the population under consideration (self-selection) We focus on the second case Microeconometrics Michael Gerfin Examples Family wealth function Effect of pension plan on wealth accumulation y = β 0 + β1 plan + β 2 x + u The sample only contains people with wealth less than 100'000 Æ Selection on basis of y 2 Fall 2008 Microeconometrics Michael Gerfin Fall 2008 Wage function Estimation of wage function for population in working age But wages are only observed for workers Æ y is only observable for subsample which is defined by another variable (working) Æ Self selection: decision to work depends on wage 3 Microeconometrics Michael Gerfin Fall 2008 When can Sample Selection Be Ignored? Simply put: if selection is based on exogenous right-hand side variable it does not affect the consistency of OLS However, the precision of the estimates decreases (standard errors get larger) 4 Microeconometrics Michael Gerfin Fall 2008 Example set g x g u g y obs 10000 = uniform() = invnorm(uniform()) = 1 + x + u . reg y x if x > 0.25 Source | SS df MS -------------+-----------------------------Model | 289.307451 1 289.307451 Residual | 7679.02682 7506 1.0230518 -------------+-----------------------------Total | 7968.33427 7507 1.06145388 Number of obs F( 1, 7506) Prob > F R-squared Adj R-squared Root MSE = = = = = = 7508 282.79 0.0000 0.0363 0.0362 1.0115 -----------------------------------------------------------------------------y | Coef. Std. Err. t P>|t| [95% Conf. Interval] -------------+---------------------------------------------------------------x | .911556 .0542066 16.82 0.000 .8052959 1.017816 _cons | 1.051498 .0361101 29.12 0.000 .9807125 1.122284 ------------------------------------------------------------------------------ 5 Microeconometrics Michael Gerfin Fall 2008 reg y x if x > 0.50 Source | SS df MS -------------+-----------------------------Model | 97.2936697 1 97.2936697 Residual | 5150.4905 5067 1.0164773 -------------+-----------------------------Total | 5247.78416 5068 1.03547438 Number of obs F( 1, 5067) Prob > F R-squared Adj R-squared Root MSE = = = = = = 5069 95.72 0.0000 0.0185 0.0183 1.0082 -----------------------------------------------------------------------------y | Coef. Std. Err. t P>|t| [95% Conf. Interval] -------------+---------------------------------------------------------------x | .9652273 .0986589 9.78 0.000 .7718133 1.158641 _cons | 1.00676 .0755266 13.33 0.000 .8586954 1.154825 ------------------------------------------------------------------------------ . reg y x if x > 0.75 Source | SS df MS -------------+-----------------------------Model | 10.8707227 1 10.8707227 Residual | 2530.00844 2581 .980243489 -------------+-----------------------------Total | 2540.87917 2582 .984074038 Number of obs F( 1, 2581) Prob > F R-squared Adj R-squared Root MSE = = = = = = 2583 11.09 0.0009 0.0043 0.0039 .99007 -----------------------------------------------------------------------------y | Coef. Std. Err. t P>|t| [95% Conf. Interval] -------------+---------------------------------------------------------------x | .8980481 .2696729 3.33 0.001 .3692508 1.426845 _cons | 1.069994 .2364745 4.52 0.000 .6062953 1.533693 ------------------------------------------------------------------------------ 6 Microeconometrics Michael Gerfin Fall 2008 Self Selection Sample selection is not result of sample design but due to decisions made by members of the population (self selection) Exogenous explanatory variable Classic example: Labour force participation and wages We want to know: E ( wi | xi ) for a person randomly drawn from the population (w : wage) w is only observed for working people. 7 Microeconometrics Michael Gerfin Fall 2008 Model of labour supply: Decision to work is based on difference between market wage and reservation wage Assume that wi = exp(xi1 β1 + ui1 ) wir = exp(xi 2 β 2 + γ 2 ai + ui 2 ) (u11 , ui2) independent of (xi1 , xi2 , ai). xi1 contains productivity characteristics and xi2 contains charactistics that determine marginal utility of leisure and income (there may be an overlap) log wi = xi1 β1 + ui1 8 Microeconometrics Michael Gerfin But wage is only observed if w > wr, i.e. log wi − log wir = xi1 β1 − xi 2 β 2 − γ 2 ai + ui1 − ui 2 ≡ xiδ 2 + v2 > 0 Problem: wr is not observed and depends on xi2 and ui2 , Æ we need another estimation procedure Notation: drop subscript i, y1 ≡ log w and y2 ist binary indicator (1) y1 = x1 β1 + u1 (2) y2 = 1[xδ 2 + v2 > 0] (2) is a probit if v2 is normally distributed 9 Fall 2008 Microeconometrics Michael Gerfin Fall 2008 Assumptions 17.1: (a) (x,y2) are always observed, y1 is only observed if y2 = 1; (b) (u1,v2) is independent of x with zero mean; (c) v2 ~ N (0,1); and (d) E (u1 | v2 ) = γ 1v2 (a) describes the selection process; (b) is strong exogeneity assumption; (c) is necessary to derive a conditional expectation given the selected sample; and (d) requires linearity of regression of u on v. (d) always holds if (u1,v2) is bivariate normal (but it is not necessary to assume that u is normally distributed). This model is also called Tobit Typ 2 . It is important to recognise that we are not dealing with a corner solution for y1 Æ y1 must not be set to zero for estimation (it is missing) 10 Microeconometrics Michael Gerfin Fall 2008 Estimation of Selection Model Let ( y1 , y2 , x, u1 , v2 ) denote a random draw from the population. Given the selection rule we can hope to estimate E ( yi | x, y2 = 1) and P ( y2 = 1| x) How does E ( yi | x, y2 = 1) depend on β1? First, (3) E ( yi | x, v2 ) = x1 β1 + E (u1 | x, v2 ) = x1 β1 + E (u1 | v2 ) = x1 β1 + γ 1v2 where the second equality follows because (u1,v2) is independent of x If γ1 = 0 Æ no selection problem! 11 Microeconometrics Michael Gerfin Fall 2008 What if γ1 ≠ 0? Using iterated expectations on (3) gives E [ E ( yi | x, v2 ) | x, y2 ] = E (x1β1 | x, y2 ) + γ 1 E (v2 | x, y2 ) = E ( yi | x, y2 ) = x1β1 + γ 1 E (v2 | x, y2 ) = x1β1 + γ 1h(x, y2 ) where h(x, y2 ) = E ( v2 | x, y2 ) If we knew h(x, y2 ), we could estimate β1 und γ1 from the regression of y1 on x and h(x, y2 ) (in the selected sample). In the selected sample y2 = 1 Æ we only have to find h(x,1) . h(x,1) = E ( v2 | v2 > − xδ 2 ) = λ (xδ 2 ) , where λ (⋅) = 12 φ (⋅) Φ(⋅) Microeconometrics Michael Gerfin Fall 2008 This implies (4) E ( y1 | x, y2 = 1) = x1 β1 + γ 1λ (xδ 2 ) From (4) it is obvious that OLS of y on x1 in the selected sample omits the term λ (xδ 2 ) Æ omitted variable bias (4) also shows a way to consistently estimate β1. Heckman (1979) has shown that β1 und γ1 can consistently be estimated in the selected sample by regressing y on x1 and λ (xδ 2 ) . But δ2 is unknown and must be estimated in a first step (using Probit). 13 Microeconometrics Michael Gerfin Fall 2008 Heckman Estimator Step 1: Estimate Probit model (5) P( y2 = 1| x) = Φ( xiδ 2 ) using all observations. Obtain λˆi 2 ≡ λ (xiδˆ2 ) Step 2: Estimate βˆ1 und γˆ1 using OLS in the selected sample (6) yi1 = xi1 β1 + γ 1λˆi 2 + ui This estimator is consistent and asymptotically normally distributed 14 Microeconometrics Michael Gerfin Fall 2008 Simple test for selection bias: under H0 (no selection bias) in (6) γ1 = 0 Æ t – test for γ1. IMPORTANT: this test is only valid if the model is correctly specified (distributional assumptions) If γ1 ≠ 0 the standard errors of β1 must be corrected - for heteroskedasticity - because δ2 has been estimated in the first step Stata does this for you if you use the command heckman 15 Microeconometrics Michael Gerfin Fall 2008 Data generation for selection problem set obs 10000 g x = uniform() g z = uniform() matrix c = (4, 1 \ 1, 1) /*Kovarianzmatrix u1,v2*/ drawnorm u1 v2, n(10000) cov(c) /*korrelierte Störterme */ g y1 = 1 + x + u1 g y2star = 0.5 + 0.5*x + 0.5*z + v2 g y2 = y2star>0.6 /* selection indicator */ replace y1 = . if y2==0 /* set y1 to missing if not selected */ OLS . reg y1 x Source | SS df MS Number of obs = 6531 . . ------------------------------------------------------------------------------y1 | Coef. Std. Err. t P>|t| [95% Conf. Interval] -------------+---------------------------------------------------------------x | .7547057 .0804123 9.39 0.000 .5970712 .9123402 _cons | 1.676609 .0482139 34.77 0.000 1.582094 1.771124 16 Microeconometrics Michael Gerfin Fall 2008 Heckman Two Step Estimator . heckman y1 x, select (y2= x z) twostep Heckman selection model -- two-step estimates (regression model with sample selection) Number of obs Censored obs Uncensored obs = = = 10000 3469 6531 Wald chi2(2) Prob > chi2 = = 215.55 0.0000 -----------------------------------------------------------------------------| Coef. Std. Err. z P>|z| [95% Conf. Interval] -------------+---------------------------------------------------------------y1 | x | 1.076284 .1234465 8.72 0.000 .8343331 1.318235 _cons | .8734129 .2295979 3.80 0.000 .4234092 1.323416 -------------+---------------------------------------------------------------y2 | x | .5332074 .0451385 11.81 0.000 .4447375 .6216772 z | .5002837 .0452514 11.06 0.000 .4115926 .5889748 _cons | -.1151823 .0340753 -3.38 0.001 -.1819686 -.0483959 -------------+---------------------------------------------------------------mills | lambda | 1.146936 .3192996 3.59 0.000 .5211204 1.772752 -------------+---------------------------------------------------------------rho | 0.55818 sigma | 2.0547643 lambda | 1.1469362 .3192996 ------------------------------------------------------------------------------ 17 Microeconometrics Michael Gerfin Fall 2008 Theoretically, it is not necessary that x1 is a strict subset of x Æ β1 is identified if x = x1 (because λ is nonlinear function of x) However, in practice λ is often almost a linear function of x Æ severe multicollinearity Æ very imprecise estimates Î Strong recommendation: you should have at least one element in x that is not in x1 (exclusion restriction) 18 Microeconometrics Michael Gerfin Fall 2008 0 1 lambda 2 3 Relation between xβ and λ -4 -2 2 0 xb 19 4 Microeconometrics Michael Gerfin Fall 2008 Simulation continued . heckman y1 x z, select (y2= x z) twostep Heckman selection model -- two-step estimates (regression model with sample selection) Number of obs Censored obs Uncensored obs = = = 10000 3469 6531 Wald chi2(4) Prob > chi2 = = 333.21 0.0000 -----------------------------------------------------------------------------| Coef. Std. Err. z P>|z| [95% Conf. Interval] -------------+---------------------------------------------------------------y1 | x | .4166927 .9797537 0.43 0.671 -1.503589 2.336975 z | -.623502 .9195849 -0.68 0.498 -2.425855 1.178851 _cons | 2.828842 2.889762 0.98 0.328 -2.834987 8.49267 -------------+---------------------------------------------------------------y2 | x | .5332074 .0451385 11.81 0.000 .4447375 .6216772 z | .5002837 .0452514 11.06 0.000 .4115926 .5889748 _cons | -.1151823 .0340753 -3.38 0.001 -.1819686 -.0483959 -------------+---------------------------------------------------------------mills | lambda | -1.171361 3.428838 -0.34 0.733 -7.891761 5.549039 -------------+---------------------------------------------------------------rho | -0.56807 sigma | 2.0620111 lambda | -1.1713608 3.428838 -----------------------------------------------------------------------------20 Microeconometrics Michael Gerfin Generate Table 17.1 wage equation OLS Heckman 2 step educ 0.107 (7.60)** 0.109 (7.03)** exper 0.042 (3.15)** 0.044 (2.70)** expersq -0.001 (2.06)* -0.001 (1.96) mills:lambda 0.032 (0.24) Constant -0.522 (2.63)** -0.578 (1.90) selection equation inlf:educ 0.131 (5.18)** inlf:exper 0.123 (6.59)** inlf:expersq -0.002 (3.15)** inlf:age -0.053 (6.23)** inlf:kidslt6 -0.868 (7.33)** inlf:kidsge6 0.036 (0.83) inlf:nwifeinc -0.012 (2.48)* inlf:Constant 0.270 (0.53) lambda .032 (0.24) sigma .663 Observations 428 753 R-squared 0.16 Absolute value of t statistics in parentheses * significant at 5%; ** significant at 1% 21 Fall 2008 Microeconometrics Michael Gerfin Fall 2008 Second example: data from the Swiss expenditure survey 1998 . reg lnwage edu* age* foreign city Source | SS df MS -------------+-----------------------------Model | 123.740716 6 20.6234526 Residual | 647.467676 3407 .19004041 -------------+-----------------------------Total | 771.208392 3413 .225962025 Number of obs F( 6, 3407) Prob > F R-squared Adj R-squared Root MSE = = = = = = 3414 108.52 0.0000 0.1605 0.1590 .43594 -----------------------------------------------------------------------------lnwage | Coef. Std. Err. t P>|t| [95% Conf. Interval] -------------+---------------------------------------------------------------edu_l | -.2510206 .0249717 -10.05 0.000 -.2999816 -.2020595 edu_h | .2751949 .0171604 16.04 0.000 .2415491 .3088407 age | .0448688 .0058037 7.73 0.000 .0334897 .056248 age2 | -.0049732 .0007318 -6.80 0.000 -.0064081 -.0035384 foreign | -.0843714 .0222799 -3.79 0.000 -.1280547 -.0406882 city | .0520015 .0166524 3.12 0.002 .0193518 .0846511 _cons | 2.293195 .1104736 20.76 0.000 2.076594 2.509796 22 Microeconometrics Michael Gerfin Heckman selection model -- two-step estimates (regression model with sample selection) Fall 2008 Number of obs = 4800 Censored obs = 1386 Uncensored obs = 3414 -----------------------------------------------------------------------------| Coef. Std. Err. z P>|z| [95% Conf. Interval] -------------+---------------------------------------------------------------lnwage | edu_l | -.2454693 .025113 -9.77 0.000 -.2946899 -.1962487 edu_h | .2690675 .0174234 15.44 0.000 .2349183 .3032168 age | .0464142 .0058536 7.93 0.000 .0349413 .057887 age2 | -.0051244 .0007356 -6.97 0.000 -.0065662 -.0036827 foreign | -.0847045 .0222888 -3.80 0.000 -.1283897 -.0410193 city | .047618 .0167895 2.84 0.005 .0147112 .0805248 _cons | 2.284004 .1106135 20.65 0.000 2.067206 2.500803 -------------+---------------------------------------------------------------ilf | edu_l | -.2140455 .0631543 -3.39 0.001 -.3378256 -.0902654 edu_h | .2711794 .0531317 5.10 0.000 .1670431 .3753157 age | .1491991 .0189459 7.87 0.000 .1120657 .1863324 age2 | -.0208317 .0023369 -8.91 0.000 -.0254119 -.0162514 foreign | .0905475 .0649259 1.39 0.163 -.0367049 .2178 city | .0058761 .0455646 0.13 0.897 -.0834288 .095181 inc_0 | -.239319 .0354921 -6.74 0.000 -.3088821 -.1697558 kids | -.480615 .0231543 -20.76 0.000 -.5259967 -.4352334 married | -.4891778 .0878587 -5.57 0.000 -.6613777 -.3169779 _cons | -.7185278 .3602989 -1.99 0.046 -1.424701 -.012355 -------------+---------------------------------------------------------------mills | lambda | -.056072 .0271142 -2.07 0.039 -.1092149 -.002929 -------------+---------------------------------------------------------------rho | -0.12842 sigma | .43663547 lambda | -.05607195 .0271142 23 Microeconometrics Michael Gerfin Fall 2008 Predictions after estimation of selection models Often selection models are used to predict the dependent variable for the observations not in the selected subsample Example:: expected wage of nonworkers Correct prediction: E ( yi1 | xi ) = xi βˆ and NOT E ( yi | xi , yi 2 = 1) = xi βˆ + γ 1λˆi 2 ≠ E ( yi1 | xi ) Stata heckman lnlohn .../*selection model for ln(wage)*/ predict lnlohn_pred, e(.,.) /* prediction of ln(wage)*/ 24