ECONOMETRICS II (ECO 2401) Victor Aguirregabiria Winter 2015 TOPIC 2: BINARY CHOICE MODELS 1. Introduction 2. BCM with cross sectional data 2.1. Threshold model 2.2. Interpretation in terms of utility maximization 2.3. Probit and Logit Models 2.4. Testing hypotheses on parameters 2.5. Measures of goodness of …t 2.6. Partial E¤ects and Average Partial E¤ects in BCM 2.7. BCM as a Regression Model 2.8. Misspeci…cation of Binary Choice Models 2.9. Speci…cation tests based on Generalized Residuals 2.10. Semiparametric methods 2.11. BCM with endogenous regressors 3. BCM with panel data 3.1. Static models a) Fixed e¤ects estimators b) Random e¤ects estimators 3.2. Dynamic models a) Fixed e¤ects estimators b) Random e¤ects estimators 1. INTRODUCTION Econometric discrete choice models, or qualitative response models, are models where the dependent variable takes a discrete and …nite set of values. Many economic decisions involve choices among discrete alternatives. (1) Labor economics: labor force participation; unionization; occupational choice; migration; retirement; job matching (hiring/…ring workers); strikes. (2) Population and family economics: number of children; contraceptive choice; marriage; divorce. (3) Industrial organization: entry and exit in a market; product choice in a di¤erentiated product market; purchase of durable goods; …rms’choice of location. (4) Education economics: going to college decision. (5) Political Economy: voting Some classi…cations of Discrete Choice Models (DCM) that have relevant implications for the econometric analysis are: a) Type of data: Cross-section / Panel b) Number of choice alternatives: Binomial or Binary / Multinomial c) Speci…cation assumptions: Parametric / Semiparametric d) Dynamic / Static; e) Single-agent / games 2. BCM WITH CROSS SECTIONAL DATA We are interested in the occurrence or non-occurrence of a certain event (e.g.,“an individual is unemployed”, ”a worker is unionized”, ”a …rm invests in R&D”) and how this event depends on some explanatory variables X . De…ne the binary variable Y such that: 8 > < Y =1 > : Y =0 if event occurs if it does not De…ne the Conditional Choice Probability (CCP). P (x) Note that: Pr(Y = 1jX = x) E (Y j X = x) = P (x) A BCM is a paramtric model for the conditional expectation E (Y j X = x), that is also the CCP, P (x). Reduced form Model for CCP In some empirical applications, the researcher may be interested in the CCP P (x) just as a predictor of Y given X = x, not in a causal e¤ect interpretation of the model. In that case, the researcher can just choose a ‡exible speci…cation of P (x). For instance: P (x) = F (x0 ) where F (:) is a known function that maps the index x0 into the the probability space [0; 1], e.g., F (:) is a CDF. Model with explicit speci…cation of unobservables Many times we are interested in the causal e¤ect of X on Y . Then, it is useful to consider a model that relates Y with X and with the unobservable variables for the researcher, ", and makes assumptions about the relationship between X and ". Y = g (X; , ") Since Y is a discrete variable, it should respond in a discrete way (i.e., not continuously) to changes in (X; ,"). That is, g (:) should be a function that maps continuous variables in , ", or X into the binary set f0; 1g. In principle, this condition rules out the Linear Regression Model (i.e., Y = X 0 + ") as a valid model for a binary dependent variable. But we will discuss in detail this point in Sections 2.5 and 2.6 below. 2.1. THRESHOLD MODELS A popular speci…cation of g (X; ,") that appears naturally in many economic applications is the threshold function: Y = g (X; , ") = 8 > < 1 > : 0 if Y (X; ,") 0 if Y (X; ,") < 0 Y (X; ,") is a real-valued function that is denoted latent variable. Note that setting the threshold at 0 is an innocuous normalization because Y (X; ,") always includes a constant term. A common speci…cation of the latent threshold function is: Y (X; ,") = X 0 where is a K 1 vector of parameters. " Therefore, the model is: Y = 8 > < 1 > : 0 if " X0 if " > X 0 We can also represent the model using the Indicator Function 1fAg where 1fAg = 1 if A is true and 1fAg = 0 if A is false. Y = 1f" X0 g When " is independent of X and it has a CDF F ( ), we have that: P (x) = Pr (Y 0 j X = x) = Pr(" x0 ) = F (x0 ) The relationship between the conditional probability P (x) and the index x0 depends on the distribution of ". If " is N (0; 1): If " is Logistic: F (x0 ) = F (x0 )= x0 ( x0 exp x0 )= 1 + exp (x0 ) Interpretation of the parameters We know that in a linear regression model, Y = X 0 + ", when " is (mean) independent of X we have that E (Y jX = x) = x0 and: with 8 @E (Y jX = x) > > > < @xk k => > > : E (Y jX = x + k if Xk continuous k) E (Y jX = x) if Xk discrete a vector of 0s except at position k where we have 1. In a BCM, we have that: @E (Y jX = x) = @xk E (Y jX = x + k) 0 k f (x E (Y jX = x) = F (x0 + k ) ) if Xk continuou F (x0 ) if Xk discrete DCM and Models with Non-additive unobservables Discrete choice models belong to a class of nonlinear econometric models where unobservables (error term) enter into the model in a non-additive form: Y = g (X; ") where g (:; :) is a function that is not additive in ", e.g., g (X; " + c) 6= g (X; ") + c. In DCMs this non-additivity is a natural implication of the discrete nature of the dependent variable. In this class of models, the "Average Partial E¤ect" is di¤erent to the "Partial E¤ect at the Average". A linear-regression approach typically provides estimates of "Partial E¤ect at the Average". We will discuss why for some empirical questions we are interested in estimating the "Average Partial E¤ect" and not the "Partial E¤ect at the Average". Interpretation of the parameters (2) The main di¤erence between the LRM and the BCM in the interpretation of @E (Y jX = x) is that in the LRM the Partial E¤ects are constant across @xk x while in the BCM they depend on the individual characteristics, and more speci…cally on its propensity or probability of Y = 1 given X . Taking into account that P (x) = we have that: 8 > > > As > > > < > > > > > > : As x0 x0 F (x0 @E (Y jX = x) ) and = k f (x0 ), @xk 1: @E (Y jX = x) P (x) ! 0 and !0 @xk ! +1: @E (Y jX = x) P (x) ! 1 and !0 @xk ! 2.2. INTERPRETATION IN TERMS OF UTILITY MAXIMIZATION Example 1: Consider an individual who has to decide whether to purchase a certain durable good or not (e.g., an iPhone). Suppose that the purchased quantity is either one or zero: Y 2 f0; 1g. Y 2 f0; 1g is the indicator of purchasing the durable good. The utility function is U (C; Y ), where C represents consumption of the composite good. More speci…cally: U (C; Y ) = u(C ) + Y n Z0 1 " o is a vector of parameters, u(:) is an increasing function, Z is a vector of characteristics observable to the econometrician, such as age and education, and " is a zero mean random variable that is individual-speci…c. 1 The individual’s decision problem is to maximize U (C; Y ) subject to the budget constraint C + P Y M , where P is the price of the good, and M is the individual’s disposable income. We can represent this decision problem as: maximization of fU (M; 0) ; U (M Therefore, the optimal choice is Y = 1 i¤ U (M Y =1 , u(M P ) + Z0 1 P; 1)g P; 1) > U (M; 0), or: " > u(M ) For instance, suppose that u(C ) = 1C Then, Y =1 , , where X 0 = ( P; P [M 1P X0 + 2 P [M 2C 2, with 1 P ] + Z0 1 0 and 2 0. " >0 " >0 P ]; Z ), and 0 = ( 1; 2; 01). Conditional on fZ; P; M g, the probability that an individual purchases the product is: P (Y = 1jZ; P; M ) = F ( [ P; P (M P) M; Z ] ) Example 2: Y = Indicator of the event "individual goes to college". X = { HS grades ; Family income ; Parents’Education ; Scholarships } Let U0 and U1 be the utilities associated with choosing Y = 0 (no college) and Y = 1 (college), respectively. Consider the following speci…cation of these utility functions: U0 = X 0 0 + " 0 U1 = X 0 1 + " 1 If the individual maximizes her utility, then: fY = 1g () f U1 where = 1 0, and " = "0 "1 . U0 g () " X0 2.3. PROBIT AND LOGIT MODELS To complete the parametric speci…cation of the model we should make an assumption about the distribution of the disturbance "i. The most common assumptions in the literature are: Probit model: " Logit model: " N (0; 1) then, F (x0 ) = Logistic( ) then, F (x0 x0 exp x0 )= 1 + exp (x0 ) Maximum likelihood estimation. Let fyi; xi : i = 1; 2; :::ng be a random sample of (Y; X ). The likelihood function is: L( ) = Pr(y1; y2; :::; yn)jx1; x2; :::; xn) = n Y i=1 Pr(yijxi) = Y F (x0i ) yi =1 Y [1 F (x0i )] yi =0 The log-likelihood function: l( ) = n X i=1 li ( ) = n X i=1 yi ln F (x0i ) + (1 yi) ln 1 F (x0i ) This likelihood is continuous and twice di¤erentiable if F (:) is. For the Probit and Logit models, the likelihood is also globally concave. The MLE is the value of b that solves the likelihood equations: 0 n n X X xi f (xi b ) @li( b ) @l( b ) = = yi 0 0 b b @ @ F (xi )] i=1 i=1 F (xi ) [1 @li( b ) is called the score of observation i. @ 0 F (xi b ) = 0 For the Probit model the likelihood equations are: n X @li( b ) i=1 @ = n X i=1 0 xi (xi b ) yi 0 0 b b (xi ) [1 (xi )] 0 (x i b ) = 0 And for the Logit model the likelihood equations are: n X @li( b ) i=1 @ = n X i=1 0 xi @yi 0 exp xi b 0 1 + exp xi b because for the logistic distribution f (") = F (")[1 1 A=0 F (")]. Computation of the MLE There is not a closed-form expression for the MLE. We should calculate b numerically using an iterative algorithm. The most common iterative algorithms to obtain MLE are Newton-Raphson and BHHH. Given that the likelihood is globally concave, both algorithms converge to the unique maximum regardless of the initial value we use to initialize the algorithm. Newton-Raphson iterations: b K+1 = bK 2 bK 2l @ n i X 6 6 4 0 i=1 @ @ 3 12 7 7 5 n @li 6X 6 4 @ i=1 bK 3 7 7 5 BHHH iterations: b K+1 = bK 2 n @li 6X +6 4 @ i=1 bK Note that, at the true value of 2 1 p lim 4 n!1 n n X @ 2 li ( 3 )5 0 i=1 @ @ i.e., Fisher’s information matrix. @li bK @ 0 3 12 7 7 5 n @li 6X 6 4 @ i=1 bK in the population: = 2 1 p lim 4 n!1 n n X @li( i=1 @ 3 ) @li( ) 5 @ 0 3 7 7 5 Asymptotic properties of the MLE If the model is correctly speci…ed, p b n !d N (0; V ) where @li ( ) @li ( ) V =E @ @ ! 1 =E f (X 0 F (X 0 )[1 )2 F (X 0 )] A consistent estimate of V is obtained by substituting by the sample mean such that: Vd ar b = Vb n = 0 XX 0 by ^ and . EX (:) 1 1 0 2 b f (xi ) 1 @1 0A x x i i n n i=1 F (x0i b )[1 F (x0i b )] n X ! 1 2.4. TESTING HYPOTHESES ON PARAMETERS AND REPORTING ESTIMATION RESULTS Wald, LM and LR tests as usual for MLE Reporting estimation results: For some applications the estimated partial e¤ects can be more informative than the estimates of the parameters. The partial e¤ect can be evaluated at the mean value of the regressors x. The estimated partial e¤ect for explanatory variable k (evaluated at the sample mean x) is: Pd Ek = 8 > < > : b 0b k f (x ) F (x0 b + b k ) if Xk continuous F (x0 b ) if Xk discrete However, in some applications we may be more interested in Average Partial E¤ects than in Partial E¤ects evaluated at the mean. We come back to this point in Sections 2.5 and 2.6 below. Example: Default in the payment of college student loans. Knapp and Seaks (REStat, 1992). - Sample: 1834 college students in Pennsylvania who got a student loan and left college in the academic year 1984-1985. Variable b (s:e:) Graduation dummy -1.090 (0.121) P artial Ef f ect in % points -9.9 Parent’s income (in thousand $) -0.018 (0.004) -0.2 Loan amount (in thousand $) 0.026 (0.020) +0.3 College cost (in thousand $) 0.085 (0.061) +0.9 2.5. MEASURES OF GOODNESS OF FIT Standard Residuals. In a BCM, after the estimation of , we cannot obtain residuals for the unobservable ". Note that the "standard" residual "bi is such that: yi = 1fx0i b "bi 0g If yi = 1, we know that "bi x0i b . If yi = 0, we know that "bi > x0i b . But we do not know the exact value of "bi. This is a relevant issue for the identi…cation of the distribution of ", and to the distributional assumptions. However, it is not an issue to obtain Goodness-of-…t measures from the estimated model.instead of MLE. 2.5. MEASURES OF GOODNESS OF FIT De…ne the following …tted values: Pbi = F (x0i b ) and Log likelihood function: l( b ) Number of wrong predictions: ybi = Pn i=1 (yi Pseudo-R-squares: - Square of the correlation between yi and Pbi. - Square of the correlation between yi and ybi. ( 0 if Pbi < 0:5 1 if Pbi 0:5 ybi)2 Pbi)2 Pn (yi Weighted RSS: i=1 b Pi(1 Pbi) Likelihood Ratio Index (or McFadden’s R-square): l( b ) LRI = 1 l0 where l0 is the log-likelihood when all parameters except the constant term are zero. It is simple to prove that, l0 = n0 ln(n0) + n1 ln(n1) n ln(n) where n0 = #obs with yi = 0, and n1 = #obs with yi = 1: BCM as a Nonlinear regression model: Generalized Residuals E (Y jX = x) = F (x0 ) var(Y jX = x) = F (x0 ) [1 F (x0 )] Therefore, we can write: Y = F (X 0 ) + u where E (ujX = x) = 0, and var(ujX = x) = F (x0 )[1 F (x0 )]. Given an estimate b , we can obtain the (generalized) residuals: ui = yi Pbi = yi F (x0i b ). 2.6. PARTIAL EFFECTS AND AVERAGE PARTIAL EFFECTS IN BCM Is it reasonable (good econometric practice) to use a linear regression model when the dependent variable is binary? Under what conditions or for which types of empirical questions? To answer these questions, …rst we have to de…ne the concepts of "Partial E¤ect", "Average Partial E¤ect", and "Partial E¤ect at the Average". In econometrics, typically, we are interested in ceteris paribus e¤ects: how Y changes when a variable Xk changes keeping constant the rest of the variables. This type of ceteris paribus e¤ect is called the Partial E¤ect of Xk on Y . De…ne P Ek (X0; "0) as the Partial E¤ect given that the initial value of (X; ") is (X0; "0) and we change X to X0 + k where k is a vector of zeroes at every position except at position k where we have a 1. In the general model, we have that: P Ek (X0; "0) = g (X0 + k; , "0 ) g (X0; , "0 ) The conditional Average Partial E¤ect AP Ek (X0) is de…ned as P Ek (X0; "0) averaged over the distribution of the unobservables " but conditional on X0: AP Ek (X0) = Z P Ek (X0; ") dF (") The unconditional Average Partial E¤ect AP Ek is is de…ned as P Ek (X0; "0) averaged both over the distribution of the unobservables " and over the distribution of observables X . AP Ek = Z P Ek (X; ") dF (") dFX (X ) It is important to distinguish between the average partial e¤ect AP Ek and the Partial E¤ect at the average individual. The later is: P Ek (X0 = E (X ); "0 = E (")) = g (E (X ) + k; , E (")) g (E (X ); , E (")) In a linear regression model (LRM), individuals are assumed to be homogeneous in term partial e¤ects: i.e., g (X; ,") = X 0 + " and therefore: P Ek (X0; "0) = AP Ek (X0) = AP Ek = k More precisely, in a LRM (without random coe¢ cients) we can allow for interactions between observable variables such that Partial E¤ects may vary across individuals according to observable characteristics. However, Partial E¤ects do not depend on unobservables. LRM with random coe¢ cients allow for unobserved heterogeneity in Partial E¤ects and therefore in those models the Average Partial E¤ect is not equal the Partial E¤ect at the average. BCM is a class of models where the di¤erence between the Average Partial E¤ect and the Partial E¤ect at the average appear naturally as the result of the binary nature of the dependent variable. In a BCM, we have that: n P Ek (X0; "0) = 1 "0 [X0 + where 1 f:g is the indicator function. k ]0 o n 1 "0 X0 0 o Partial e¤ects at the individual level depend on the individual’s X and ". This is an important property of BCM. This property derives naturally from the discrete aspect of the dependent variable. In a BCM, we have that the APE are: AP Ek (X0) = F ([X0 + k] 0 ) F (X00 ) The marginal partial e¤ect is similar to the partial e¤ect but when resents a marginal change in a continuous variable Xk . In that case: AM P Ek (X0) = k f (X00 ) where f (:) is the PDF of ". The AMPE at the average individual is: AM P Ek (X0 = E (X )) = k f (E (X )0 ) k rep- In BCM, Partial E¤ects vary over individuals and, in general, the AP E can be very di¤erent to the Partial E¤ect at the average. This is a property that distinguishes BCM from Linear Regression Model. When our main interest is to estimate the PE for the average individual, then we can use a LRM for the binary variable Y . For large samples, the estimates will not be very di¤erent to the estimate of the same e¤ect from a BCM. However, most of the time in economics we are interested in APE and not in the PE for the average individual. If that is the case, using a LRM for a binary Y is a very bad choice because that model imposes the very implausible (even impossible!) restriction that PEs do not depend on the unobservables. Example: School Attendance of Children from Poor Families. Suppose that we are interested in the determinants of elementary school attendance of kids (Y ) from poor families. Y = Kids in the family attend (regularly) school We are interested in evaluating the e¤ects of a public program that tries to encourage school attendance by providing a subsidy that is linked to school attendance, e.g., PROGRESA program in Mexico since 1997. We have data on fY; S , Xg where: S is the amount of subsidy and it is S = 0 for families in the control group, and S = $M for families in the experimental group. X contains family socioeconomic characteristics. Example: School Attendance of Children from Poor Families (2) We estimate the BCM: Y = 1f" S + X0 g Let ^ and ^ be the estimated parameters, and P^i = F (^ si + x0i ^ ) the estimated probability of school attendance for family i. The Partial E¤ect of the receiving the subsidy for individual i is: P E (xi; "i) = 1f"i If M + x0i g 0 this e¤ect can be only zero or one. 1f"i x0i g Example: School Attendance of Children from Poor Families (3) Even if we knew the true values of we cannot estimate "i. and , we cannot obtain P Ei because However, we can estimate the average partial e¤ect of the subsidy for a family with characteristics xi, AP E (xi): \ A P E (xi) = F (^ M + x0i ^ ) F (x0i ^ ) ' f (x0i ^ ) ^ M And the estimated increase in the number of kids attending school because the program: Kids Attending School = n X i=1 \ 1fsi > 0g A P E (xi) We could also estimate the "counterfactual" e¤ect of the hypothetical application of the policy to a population of H families: 2 3 n X 1 \ \ Kids Attending School = H 4 A P E (xi)5 = H A PE n i=1 Example: School Attendance of Children from Poor Families (4) The e¤ect of the policy for the average family is: AP E (x) = F (^ M + x0 ^ ) F (x0 ^ ) ' f (x0 ^ )^ M If we use this partial e¤ect for the average household to extrapolate the e¤ect of the policy in the actual experiment, we get: 2 \ A P E (x) 4 n X i=1 3 1fsi > 0g5 And if we make this same extrapolation for the hypothetical application of the policy to a population of H families, the predicted e¤ect is: \ H A P E (x) Example: School Attendance of Children from Poor Families (5) \ \ In general, A P E and A P E (x) can be quite di¤erent. The magnitude of this di¤erence depends on the variance or dispersion of x0i ^ (i.e., on the level of cross-families heterogeneity in the propensity to send kids to school), and on the magnitude of ^ M . \ \ But even in the hypothetical case that A P E and A P E (x) were similar, we can interested in estimation AP E (x) for di¤erent groups of families according \ to x. For instance, suppose that A P E (xi) is very close to zero for almost every family i, but it is very large for families with very low income, that represent only 1% of the population. This information is very useful to target the policy. 2.7. BCM AS A REGRESSION MODEL A regression model is a statistical model that speci…es how the conditional mean E (Y jX ) depends on X , i.e., it speci…es a function m(X; ) for E (Y jX ). E (Y jX ) = m(X; ) This implies that: Y = m(X; ) + u where u is a disturbance or unobservable variable that, by construction, is mean independent of X , i.e., E (ujX ) = 0. When m(X; ) = X 0 , we have a linear regression model. When m(X; ) is nonlinear in the parameters, we have a nonlinear regression model, e.g., m(X; ) = expfX 0 g, or m(X; ) = 1[X1 2 + X1 3 ] 4 . When Y is binary, we have that: E (Y jX ) = 1 Pr(Y = 1jX ) + 0 Pr(Y = 0jX ) = Pr(Y = 1jX ) Therefore, a BCM for Pr(Y = 1jX ) is also a Regression Model for E (Y jX ). According to the threshold BCM: E (Y jX ) = F (X 0 ) An therefore, Y = F (X 0 ) + u where, by construction, u is mean independent of X . Therefore, in this context, we can justify using a Linear Regression Model (LRM) for the binary dependent variable Y as a …rst-order (linear) approximation to the function F (X 0 ) for X around its mean E (X ). F (X 0 ) ' F (E (X )0 ) + (X E (X ))0 f (E (X )0 ) Let X = (1; X1) where 1 represents the constant term and X1 the rest of the regressors. Then, X 0 = 0 + X10 1 and E (X )0 = 0 + E (X1)0 1, and (X E (X ))0 = (X1 E (X1))0 1. Solving these expressions in the equation above, we have: F (X 0 ) ' 0 + X10 1 0 ) = F ( E ( X ) 0 and 1 = f (E (X )0 ) 1 where : f (E (X )0 ) E (X1)0 1 Note that 1 = 1f ( 0X ) is the AMPE for the average individual. Therefore, we can use a Linear Regression Model for the binary variable Y . This type of model is called the Linear Probability Model: Y = X0 +u and the slopes have a clear interpretation as Average Partial E¤ects for the Average individual. OLS estimation of this LRM provides consistent estimates of . The main limitation of the Linear Probability Model is that it does not provide any information about the APE for individuals other than the average individual, or for the unconditional APE (that depends on the conditional APE for all the individuals). This limitation is particularly serious in BCMs where the APEs F ([X + ]0 ) F (X 0 ) vary signi…cantly over X . In that case, the APEs of a significant group of individuals, and the unconditional APE, can be very di¤erent to the APE for the average individual. 2.8. MISSPECIFICATION OF BCM Remember that in the linear regression model a necessary and su¢ cient condition for consistency the OLS estimator is that E ("jx) = 0. That is, heteroscedasticity, autocorrelation and non-normality of the error term does not a¤ect the consistency of the OLS estimator as long as E ("jx) = 0. However, in the context of discrete choice models, the consistency of the MLE depends crucially on our assumptions about "i. If "i is heteroscedastic, or if it has a cdf that is not the one that we have assumed, then the MLE is no longer consistent. The reason is that our assumption about "i a¤ects not only second and further moments of yi, but also its conditional mean: Suppose that the true model is such that "i True model: yi = F (x0i ) + ui; Instead, we assume that "i Estimated Model: yi = It is clear that, if F 6= iid with cdf F . Then, where E (uijxi) = 0 iid N(0,1) [Probit]. Then, (x0i ) + ui ; where ui = ui + F (x0i ) then E (ui jx) 6= 0; and MLE using (x0i ) is inconsistent. Suppose that the researcher is not particularly interested in the estimates of but only in the estimated probabilities P (xi): For instance, a car insurance company that is only interested in the probability of accident of an individual with characteristics xi: In this case, the main issue is the consistency of P (xi) not the consistency of : One might think that misspeci…cation of F is not a big issue in this case. However, that is not true. Misspeci…cation of F (:) can generates important biases both on the estimator of and on the estimator of P (:). Horowitz’s Monte Carlo experiment: Suppose that the true probabilities are P (xi) and the researcher estimates a logit model. How close are fP^Logit(x)g to the true fP (x)g? Horowitz (Hbook of Stat, 1993) performed a Monte Carlo study to answer this question. He considered di¤erent cases for P (x): (1) Homoscedastic probit; (2) Student-t; (3) Uniform; (4) Heteroscedastic logit; (5) Heteroscedastic probit; (6) Bimodal distribution of ". The main results are: (a) The errors are small when the true distribution of " is unimodal and homoscedastic. (b) The errors can be very large when the true distribution of " is bimodal or it is heteroscedastic. Summary of Horowitz’s Monte Carlo study: True Model Homoced. and unimod Bimodal Heteroscedastic ^ ^ Ex(P (x) P P Logit (x)) maxx jP (x) Logit (x)j 0.01 0.02 0.05 0.20 0.10 0.30 2.9. SPECIFICATION TESTS BASED ON GENERALIZED RESIDUALS In the LRM, we typically test assumption on the error term " by using the residuals "bi = yi x0i b . In BCM we cannot obtain residuals for "i but we can get residuals for the error term ui in the regression-like representation of the BCM: yi = F (x0i ) + ui We can get the resduals: and standardized residuals: bi u b i = yi u q yi F (x0i b ) F (x0i b ) F (x0i b )(1 F (x0i b )) Under the null hypothesis that the model is correctly speci…ed, we have that: ui q yi F (x0i ) F (x0i )(1 F (x0i )) ui should be independent of xi with zero mean. b i and xi, we test the correct By testing the independence of the residuals u speci…cation of the model. GENERAL PURPOSE SPECIFICATION TEST b i and the estimated CCPs Pbi Given the standardized residuals u we run the OLS regression: F (x0i b ), b i = 0 + 1 Pbi + 2 (Pbi)2 + ::: + q (Pbi)q + ei u De…ne the statistic LM = n R2, where R2 is the R-square coe¢ cient from the previous regression. Under the null hypothesis (the model is correctly speci…ed), LM is asymptotically distributed as 2q 1. TEST OF HETEROSCEDASTICITY IN BCM Consider the BCM Y = 1fX 0 "jX " 0g where: ~0 N 0 ; exp X ~ is the vector X without the constant term. where X We are interested in testing the null hypothesis of homoscedasticity, that is equivalent to test: H0 : = 0. A possible approach is to estimate and by MLE. That approach is computationally demanding because the log-likelihood of this model is no longer globally concave in ( ; ). Instead, we can estimate the standard probit model, under the null hypothesis of = 0, and use a LM test for the null. The LM statistic is: " #0 ! 1" # ^ ^ ^ @ log L( ; = 0) @ log L( ; = 0) @ log L( ; = 0) LM = V ar @( ; ) @( ; ) @( ; ) Under H0, LM is asymptotically distributed as 2dim( ). Davidson and McKinnon (JE, 1984) show that this LM statistic can be obtained as the output of a simply auxiliary regression. LM = n R2, where R2 is the R-square coe¢ cient from the following regression: u ^ i = xi 1 + zi 2 + ei where: u ^i yi (x0i ^ ) q (x0 ^ )(1 (x0 ^ )) i xi xi (x0i ^ ) q (x0 ^ )) (x0 ^ )(1 i zi i i x ~i (x0i ^ ) (x0i ^ ) q (x0i ^ )(1 (x0i ^ )) 2.10. ADAPTIVE (SEMIPARAMETRIC) ESTIMATION OF BCM The consistency of the ML estimator of Probit or Logit models relies on the correct speci…cation of the probability distribution of the unobservable ". That is, consistency of the MLE in BC models is not robust to misspeci…cation of the CDF of ". This property contrasts with the consistency of OLS in the linear regression model: i.e., the OLS is the MLE when " is normally distributed, but it is also consistent when " is not normal, and even asymptotically e¢ cient (if " is homoscedastic and not serially correlated). In econometrics, this type of robust estimators are called ADAPTIVE ESTIMATORS. Are there adaptive estimators of the BCM which are robust to di¤erent properties of the unobserved error term such as heteroscedasticity, serial correlation, or the particular functional form for the distribution of the error? We consider four adaptive estimators of the BCM: (1) Least Absolute Deviations (LAD) estimator; (2) Manski’s Maximum Score Estimator; (3) Horowitz’s Smooth Maximum Score Estimator; (4) Klein and Spady estimator. 1. Least Absolute Deviations (LAD) estimation LAD is an estimation method that is adaptive for a very general class of econometric models. Remember that Least Squares (LS) estimation (linear or nonlinear) is based on the following property of the mean. Let E (Y ). Then, = arg min E [Y c c] 2 The LS estimator is based on the sample counterpart of this property of the mean: n 1X b = arg min (yi c)2 c n i=1 We have that, b !p . Least Absolute Deviations (LAD) (3) Similarly, LAD estimation is based on the following property of the median. If m median(Y ) then, m = arg min E (jY c cj) The LAD estimator is based on the sample counterpart of this property of the median: n 1X c = arg min jyi cj m c n i=1 c !p m. We have that, m Least Absolute Deviations (LAD) (4) Consider the general econometric model: Y = f (X; ; ") where f is a known function; X is a vector of observable explanatory variables; " is an unobservable variable; and is a vector of parameters. The assumptions that de…ne this class of models are: (A1) the function f is known and monotonic in "; (A2) median("jX ) = 0. Least Absolute Deviations (LAD) (5) Under assumptions (A1) and (A2), we have that median(Y jX ) = f (X; ; 0) Based on this condition, the true value of = arg min E (jY c satis…es the following condition: f (X; c; 0)j) LAD estimator is based on the sample counterpart of this property: ^ LAD = arg min Xn jy i=1 i f (xi; ; 0)j The LAD estimator minimizes the sum of absolute deviations of yi with respect to its median f (xi; ; 0). Least Absolute Deviations (LAD) (6) Under assumptions (A1) and (A2), the LAD estimator is consistent. Therefore, LAD is a general type of semiparametric estimator for nonlinear econometric models. If function f is continuously di¤erentiable in , then the LAD estimator is: (a) root-n consistent; (b) asymptotically normal; (c) it has a simple expression for its asymptotic variance that it is simple to estimate; (d) we can use standard gradient optimization methods to compute ^ LAD . If function f is NOT continuous in , LAD is still consistent but, in general, properties (a) to (d) do not hold. 2. Manski’s Maximum Score Estimator Consider the BCM Y = 1fX 0 " 0g where we assume that: median("jX ) = 0 That is, " is median independent of X , and the median is zero. Other than median("jX ) = 0, no other assumption is made on dist. of ". If we knew , a "natural" predictor of Y is 1fX 0 (a) the support of 1fX 0 (b) median(Y jX ) = 1fX 0 0g because: 0g is the same as the support of Y : f0; 1g 0g. Maximum Score Estimator (MSE) (2) We have a correct prediction when: either Y =1 and 1fX 0 or Y =0 and 1fX 0 < 0g 0g Given a sample fyi; xi : i = 1; 2; :::; ng, consider the following sample criterion function: Xn 0 S( ) = y 1 fx i i i=1 0g + (1 yi) 1fx0i < 0g This criterion function provides the number of correct predictions for a given value of . We call it the Score function. Maximum Score Estimator (MSE) (3) The Maximum Score Estimator (MSE) is the value of the score function: that maximizes ^ M SE = arg max S ( ) Under median("jX ) = 0, the MSE is a consistent estimator of . Therefore, the MSE is an estimator that is robust to heteroscedasticity, serial correlation, and to any form of the distribution of ". In that sense, the MSE has similar properties as OLS in a linear regression mode under the mean independence assumption E ("jX ) = 0. Equivalence of LAD and MSE Before we discuss other properties of the MSE, it is interesting to show that for the BCM, the MSE and the LAD are identical estimators. Let LAD( ) be the LAD criterion function, and let S ( ) be the score function. We now show that LAD( ) = n S ( ) and therefore minimizing LAD( ) is equivalent to maximizing S ( ) such that the MSE is the LAD estimator. Equivalence of LAD and MSE LAD( ) = n P i=1 = o 1 yi = 1 and n P yi 1fx0i < 0g + (1 n P yi 1 i=1 = n 0g n P i=1 = 1fx0i yi (2) i=1 = n n P i=1 = n 1fx0i yi 1fx0i S( ) x0i n < 0 + 1 yi = 0 and yi) 1fx0i 0g + (1 0g + (1 yi) 1 x0i o 0 0g 1fx0i < 0g yi) 1fx0i < 0g Properties of the MSE Note that the score function S ( ) is discontinuous and not di¤erentiable in . 1fx0i 0g is a step function, and this implies that S ( ) is also step function. Example. Consider the model Y = 1f + X 0g. We have a sample of n = 4 observations: (xi; yi) = f(x1; 0), (x2; 0), (x3; 1), (x4; 1)g, and 0 < x1 < x2 < x3 < x4 . The score function is: S ( ) = 1f + x1 < 0g + 1f + x2 < 0g + 1f + x3 0g + 1f + x4 0g or S( ) = 8 > 2 if > > > > > > > > > 3 if > > > > > < 4 if > > > > > > > > 3 if > > > > > > > : 2 if There is not a single value [ x3; x2). < x4 2 [ x4 ; x3 ) 2 [ x3 ; x2 ) 2 [ x2 ; x1 ) x1 that maximizes S ( ) but a whole interval As the sample size increases, the amplitude of thi intervale gets smaller. Properties of the MSE In this case, discontinuity of S ( ) does not a¤ect the consistency of the MSE, but it has several important implications. (a) We cannot use the standard gradient based methods to search for the MSE. (b) If the sample size is not large enough, there may not be a unique value of that maximizes S ( ). The maximizer of S ( ) can be a whole (compact) set in the space of . (c) The MSE is not asymptotically normal. It has a not standard distribution. (d) The rate of convergence of the MSE to the true root-n . It is n1=3. is lower than 3. Horowitz’s Smooth Maximum Score Estimator Limitations (a) to (d) of the MSE motivate the use of the smooth-MSE proposed by Horowitz. First, note that score function S ( ) can be written as follows: S( ) = n X yi 1fx0i 0g + (1 yi) 1fx0i < 0g n X yi 1fx0i 0g + (1 yi) 1 i=1 = i=1 = n X (1 i=1 yi) + n X i=1 (2yi 1) 1fx0i 1fx0i 0g 0g Smooth Maximum Score Estimator (2) Xn Therefore, maximizing S ( ) is equivalent to maximizing (2yi i=1 1fx0i 0g, and: Xn 0 ^ M SE = arg max 1) 1 fx 0g (2 y i i i=1 Limitations (a)-(d) of the MSE are due to the fact that 1fx0i discontinuous in . Horowitz proposes to replace 1fx0i 0g by a function x0i bn 1) 0g is ! , where (:) is the CDF of the standard normal, and bn is a bandwidth parameter such that: (1) bn ! 0 as n ! 1; and (2) nbn ! 1 as n ! 1. That is, bn goes to zero but more slowly than 1=n. Smooth Maximum Score Estimator (3) The Smooth-MSE is de…ned as: ^ SM SE = arg max n X (2yi 1) i=1 As n ! 1, and bn ! 0, the function x0i bn ! x0i bn ! converges 1fx0i 0g, and the criterion function converges to the Score function. This implies the consistency of ^ SM SE . Under the additional condition that nbn ! 1 as n ! 1, this estimator is asymptotically normal, nr consistent with r 2 [2=5; 1=2], and it can be computed using standard gradient search methods because the criterion function is continuously di¤erentiable. Smooth MSE in STATA See Blevins, J. R. and S. Khan (2013): "Distribution-Free Estimation of Heteroskedastic Binary Response Models in Stata," Stata Journal 13, 588–602. Blevins and Khan have created a command in Stata, dfbr (for distribution free binary response), that implements the Smooth MSE and other methods for the estiation of BCM with a nonparametric speci…cation of the distribution of ". NONPARAMETRIC IDENTIFICATION OF F ("jX ) Once we have estimated the vector of parameters using an adaptive method such as the smooth-MSE, we want to estimate Average partial e¤ects (APE) for di¤erent individuals in the sample or out of the sample (for di¤erent values of xi). As shown above, to estimate APEs for individuals who are not the average individual in the sample (or for some other average or marginal individual) we need to estimate the distribution of ". Given and our assumption that median("jX ) = 0, is the CDF F ("jX ) nonparametrically identi…ed? No, without further assumptions. More speci…cally, no if only assumption median independence between " and X . NONPARAMETRIC IDENTIFICATION OF F ("jX ) Matzkin (ECTCA, 1992). A su¢ cient condition for the identi…cation of F is: (a) (b) f0 , where " and Z are independent; X0 = Z + X f, Z has variation over the whole real line; Conditional on X f (but we don’t need full inde(c) " is median independent of X pendence); f=x e ) is nonparaProof: The CCP function P (z; x) = Pr(Y = 1jZ = z , X e ). Suppose that metrically identi…ed from the data at every (z; x has been identi…ed/estimated (e.g., MSE estimator). e 0 and any "0 2 R, we can de…ne the value z0 = "0 Given any x e0) F ("0jx Pr(" = Pr(" f=x e0) "0 j X e 00 z0 + x e0) = P (z0; x e 00 . Then: x f=x e0) jX e 0,"0) we can always de…ne a value z0 such that the That is, for any (x e 0) give us the CDF of ", F ("0jx e 0). empirical CCP P (z0; x EFFICIENT SEMIPARAMETRIC ESTIMATION Consider the BCM Y = 1f" X 0 g where: (a) " is not completely independent of X . Instead, V ar("jX ) = 2 (X 0 ), i.e., there may be heteroscedasticity; " is independent of X with CDF F (:) continuous and (b) 0 (X ) strictly increasing. According to this model: P (x) = Pr(Y = 1jX = x) = F We de…ne G(x0 ) F x0 (x0 ) ! x0 (x0 ) ! EFFICIENT SEMIPARAMETRIC ESTIMATION Klein-Spady estimator propose a semiparametric maximum likelihood estimator of and the function G(:). The log-likelihood function is: Xn 0 l ( ; G) = y ln G x i i i=1 + (1 h yi) ln 1 G x0i i And KS estimator is de…ned as: b b KS ; GKS = arg max f ;Gg l ( ; G) The di¢ cult issue here is that G is not a …nite-dimension vector of parameters, but a real-valued function or in…nite-dimension vector of parameters. This is not a standard MLE, and both its computation and the derivation of its asymptotic properties are non-standard problems. Under mild regularity conditions, Klein and Spady show that the estimator is consistent and asymptotically normal. The estimator of is root-n consistent. Also, it is asymptotically e¢ cient within the class of semiparametric estimators. b be this The procedure starts with an initial guess of the function G. Let G 0 b can the , i.e., we postulate a Probit model initial guess. For instance, G 0 with homocedasticity. Then, at every iteration K Step 1: Estimate 1 we perform two steps. b given G K 1. b K = arg max l This is a standard MLE (or quasi MLE). b ;G K 1 b Step 2: Given b K , we obtain a new G K using a kernel method (NadarayaWatson) estimator: b (z ) = G K Xn y K i=1 i Xn K i=1 x0i b K bn x0i b K bn z z ! ! where bn is a bandwidth parameter. This is a nonparametric estimator of E (Y jX 0 b K = z ), and we know that E (Y jX 0 = z ) = Pr(Y = 1jX 0 = z ) = G(z ). The algorithm iterates until convergence, e.g., until b K b K 1 < 10 6. 2.11. BCM WITH CONTINUOUS ENDOGENOUS REGRESSORS Rivers and Vuong (JE, 1988). Consider the model: = 1 X0 + (1) Y (2) W = Z0 + u W +">0 where " and u are independent of X and Z , but cov ("; u) 6= 0, and therefore " and W are not independent. Suppose that ("; u) are jointly normal. Then, we have that: " = u+ 2 ) where (a) = "u= 2u; (b) is normally distribution as N (0, 2" 1 where is the correlation between " and u; (c) is independent of u; (d) since " is independent of X and Z , we have that is independent of X , Z , and u, and therefore it is independent of W . Then, we can write the probit model: Y And given that have that: = 1 X0 + W+ u+ >0 is normally distributed and independent of X , W , and u, we Pr(Y = 1jX; W; u) = X0 + W+ u ! We do not know u, but we can obtain a consistent estimate of u as the residual u ^=Y Z 0^. Rivers and Vuong (1988) propose the following procedure: Step 1. Estimate the regression of W on Z and obtain the residual u ^; Step 2. Run a probit for Y on X , W and u ^. This is in fact the method in the STATA command "ivprobit". Using this procedure we obtain consistent estimates of Note that , , and 6= 0 if and only if cov ("; u) 6= 0. Therefore, a t-test of H0 : = 0 is a test of the endogeneity of W . . Blundell and Powell (Review of Economic Studies, 2004) "Endogeneity in Semiparametric Binary Response Models" Blundell and Powell (2004) extend Rivers-Vuong method to models where the distribution of the unobservables, " and u, is nonparametrically speci…ed. TBW 3. BCM WITH PANEL DATA As in linear PD models, we distinguish static and dynamic PD BCMs. (1) Static models: explanatory variables are strictly exogenous; (a) Exogenous individual e¤ects: Avery-Hansen-Hotz Quasi-MLE; (b) Endogenous individual e¤ects: FE methods: (b1) Manski’s MSE; (b2) Chamberlain’s Conditional Logit. (c) Endogenous individual e¤ects: RE methods: (c1) Chamberlain Correlated RE; and (c2) Heckman-Singer …nite mixture model. (2) Dynamic models: (a) FE methods. (a1) Chamberlain’s Conditional logit; (a2) HonoreKyriatzidou conditional logit. (b) RE methods. (b1) Heckman-Singer …nite mixture model; (b2) Arellano-Carrasco. 3.1. Static Binary Choice Models Consider the Panel Data BCM: 0 + Yit = 1fXit i where uit is independent of exogenous regressors. i uit 0g and and of fXi1; Xi2; :::; XiT g, i.e., strictly We have panel data with N individuals and T periods where N is large and T is small. We want to estimate . We are concerned about the correlation of Xit with the individual e¤ect i. Avery-Hansen-Hotz Pseudo MLE Suppose that the individual e¤ect i and Xit are independently distributed. Then, we can use a MLE to estimate . The conditional log-likelihood function P is l( ) = N i=1 li ( ) where: li( ) = ln Pr(yi1; yi2; :::; yiT j xi1; xi2; :::; xiT ; ) = ln Pr( 1f2(yit 1)"it 2(yit 1)xit g for t = 1; 2; :::; T ) Since "0s are serially correlated, these probabilities involve T dimensional integrals. This is computationally costly. Also, we have to specify the stochastic process of uit and estimating the parameters of this process together with . Is there a way to avoid this multiple integration problem? Is there an "adaptive estimator" that is consistent regardless the form of the serial correlation in "it? Avery-Hansen-Hotz (IER, 1983) provide a simple estimator that is robust to serial correlation. They show that a method that estimates using a standard Probit (or logit) model that ignores the serial correlation in "it is root-N consistent and asymptotically normal. Consider the Pseudo- log likelihood function: l( ) = PN PT i=1 t=1 yit ln And let ^ AHH be the value of Hansen-Hotz estimator. x0it + (1 yit) ln x0it that maximizes this function. This is Avery- If the distribution of "it is normal with zero mean and constant variance (and the stochastic process of "it satis…es some standard stationarity conditions), then this estimator is consistent and asymptotically normal, regardless the serial correlation in "it over time (or across individuals). Why is Avery-Hansen-Hotz estimator consistent despite it is an MLE based on a misspeci…ed likelihood function? Because the likelihood equations that de…ne the estimator are valid moment conditions regardless the form of serial correlation in "it. The likelihood equations are: 0 N T X xit f (xit ) 1 X yit 0 0 N t=1 i=1 F (xit ) [1 F (xit )] 0 F (xit ) = 0 where F and f are the CDF and the PDF of "it. As N goes to in…nity, these equations converge to: T X E z (xit) t=1 where z (xit) is the vector h yit 0 0 F (xit ) xit f (xit ) . F (x0it ) [1 F (x0it )] i =0 If "it and Xit are independently h 0 distributed, we can show easily that E z (xit) yit F (xit ) that these moment conditions / likelihood equations hold. i = 0 such Therefore, AHH estimator can be seen as a GMM estimator based on valid moment conditions. Note that we can use the same approach to estimate BCM using Time Series data, Yt = 1fXt0 "t 0g, provided the variables satisfy some stationarity conditions. The likelihood equations are: 0 T 1 X xt f (xt ) yt T t=1 F (x0t ) [1 F (x0t )] 0 F (xt ) = 0 And under mild stationarity conditions, as T ! 1, T h 1 X z (xt) yt T t=1 0 i h F (xt ) !p E z (xt) yt 0 that is equal to 0 because E (YtjXt) = F (Xt ). 0 i F (xt ) Bias of the MLE based of FD and WG transformations Now, consider the more interesting case where i and Xit can be correlated. In a BCM, the transformations of the model in First-Di¤erences or WithinGroups does not eliminate the individual e¤ect i. 0 + Yit = 1fXit i 0 6= 1f Xit uit uit 0g 0 1fXit 1 + i uit 1 0g 0g Therefore, a MLE based on the equation provides an inconsistent estimator of . 0 Yit = 1f Xit uit 0g We will show later that Manski’s Maximum Score Estimator can be used to obtain a consistent estimator of that is somehow based on a …rst di¤erence transformation of the model, but not exactly on the transformation above. Bias of ML-Dummy Variables Estimator In the Static Linear PD model, we show that the LSDV estimator was consistent (for …xed T ) and equivalent to the WG estimator. Unfortunately, that is not the case in the Static (or Dynamic) BCM. The estimator is de…ned as: ( b ; b ) = arg max N X f ; g i=1 where li ( ; i ) = T X t=1 yit ln F (x0it + i) + (1 li ( ; i ) yit) ln 1 F (x0it + i) Bias of ML-Dummy Variables Estimator (2) The likelihood equations are: N X @li( b ; b i) With respect to : @ i=1 With respect to i: where N X @li( b ; b i) i=1 @ = 0 @li( b ; b i) = 0 @ i 0 N X T X xit f (xit b + b i) = yit 0 0 b b F (xit + b i)] i=1 t=1 F (xit + b i ) [1 0 T X f (xit b + b i) @li( b ; b i) = yit 0 0 b b @ i b b F (xit + i)] t=1 F (xit + i ) [1 0 F (xit b + b i) 0 F (xit b + b i) Bias of ML-Dummy Variables Estimator (3) 0 f (xit b + b i) For instance, for the Logit model, = 1 0 0 b b F (xit + b i) [1 F (xit + b i)] such that the likelihood equations become: N X T X i=1 t=1 2 xit 4yit T X t=1 2 4yit 3 b exp xit + b i 5= 0 0 b 1 + exp xit + b i 0 3 b exp xit + b i 5= 0 0 b 1 + exp xit + b i 0 We can use a BHHH method to compute ( b ; b ). Greene (Econometrics Journal, 2004) has developed a computationally e¢ cient method to calculate this estimator [in the spirit of Within Groups transformation, but in a sequential method]. In, particular we do not need to invert any matrix with dimension N + K to compute this estimator. Bias of ML-Dummy Variables Estimator (4) Though the conputation of a Dummy Variables (Fixed E¤ects) estimator for PD-BCM is computationally simple, the estimator of is inconsistent as N ! 1 and T is …xed. It is only consistent when T also goes to in…nity. This estimator of in linear models. does not share the nice properties of the LSDV estimator The reason is that, in this model, as as N ! 1 and T is …xed cov ( b ; b ) 6= 0 The estimator b is asymptotically correlated with b such that the asymptotic estimation error in b contaminates the estimator b . Bias of ML-Dummy Variables Estimator: Example Consider an example with T = 2, only one explanatory variable xit that is the dummy variable for t = 2, and distribution F (:) that is symmetric around the median = 0: Yi1 = 1f i ui1 0g Yi2 = 1f + i ui2 0g The likelihood equations are: N X i=1 (yi1 + yi2) 2 4yi2 exp ( b i) 1 + exp ( b i) exp b + b i 1 + exp b + b exp b + b i i 1 + exp b + b i 3 5= 0 = 0 Bias of ML-Dummy Variables Estimator: Example For observations with (yi1; yi2) = (0; 0), we have that 0 F ( b i) F b + b i = 0, and this implies that: (a) b i ! 1; and (b) these ob- servations do not contribute to the estimator b because li( b ; b i) ! 0 for any b. For observations with (yi1; yi2) = (1; 1), we have that 1 F ( b i) + 1 F b + b i , and this implies that: (a) b i ! +1; and (b) these observations do not contribute to the estimator b because li( b ; b i) ! 0 for any b . Bias of ML-Dummy Variables Estimator: Example For observations with (yi1; yi2) = (0; 1) or with (yi1; yi2) = (1; 0), we have that 1 that implies b i = b 2 F ( b i) , such that F F b + bi = 0 b =2 + F b =2 = 1. Therefore, the concentrated log-likelihood function is: l( ) = N X i=1 1fyi1 = 0; yi2 = 1g ln F 2 +1fyi1 = 1; yi2 = 0g ln F 2 Bias of ML-Dummy Variables Estimator: De…ne p F 2 Example . The concentrated log-likelihood is maximized at: PN i=1 1fyi1 = 0; yi2 = 1g pb = P N 1fy + y = 1g i1 i2 i=1 And the MLE of is: b = 2 F 1 (pb) = 2 F 1 ! PN i=1 1fyi1 = 0; yi2 = 1g PN i=1 1fyi1 + yi2 = 1g Bias of ML-Dummy Variables Estimator: Example Is this a consistent estimator of ? It is clear that: p lim b = 2 F 1 p lim pb N !1 N !1 =2F Pr (Yi1 = 0; Yi2 = 1) In general, 2 F 1 Pr (Yi1 + Yi2 = 1) b is inconsistent. ! 1 Pr (Yi1 = 0; Yi2 = 1) Pr (Yi1 + Yi2 = 1) ! 6= , and this ML-DV estimator Bias of ML-Dummy Variables Estimator: Example For instance, for the logit model, we can show that p = does not depend of p= i and: Pr (Yi1 = 0; Yi2 = 1) Pr (Yi1 + Yi2 = 1) exp ( ) Pr (Yi1 = 0; Yi2 = 1) = =F( ) Pr (Yi1 + Yi2 = 1) 1 + exp ( ) Therefore, for the logit model: p lim b = 2 F 1 (F ( )) = 2 N !1 Fixed E¤ects Estimators for Static Panel Data BCM As in the case of linear panel data models, we distinguish two approaches: (a) Fixed E¤ects approach: no assumption on the joint distribution of Xi and i. (b) Random E¤ects approach: there is a parametric assumption on the joint distribution of Xi and i. We consider two …xed e¤ects estimators: 1. Chamberlain’s Conditional Logit model. 2. Manski’s MSE applied to Panel Data BCM Chamberlain Conditional Logit 0 + Consider the BCM Yit = 1fXit i distribution. Therefore, Pr(Yit = 1 j Xit; 0g where uit has a logistic uit o n 0 + exp Xit i n o i) = 0 + 1 + exp Xit i And if uit is independent over time: Pr(Yi1,Yi2,:::; YiT j Xi; i) = P n T exp Y (X 0 Y it it n 0 t=1 1 + exp Xit o + i) + i o De…ne the random variable Si = T t=1 Yit that represents the number of times that the binary event has occurred during the T sample periods. Chamberlain Conditional Logit (2) Let Yi = fyi1; yi2; :::; yiT g, and Xi = fxi1; xi2; :::; xiT g. The key result behind Chamberlain conditional logit estimator is that: Pr (Yi j Xi; Si; i; ) = Pr (Yi j Xi; Si; ) i.e., it does not depend on i. First, by the chain rule, it is clear that: Pr (Yi; Si j Xi; i) = Pr (Yi j Xi; Si; i) Pr (Si j Xi; i), and therefore: Pr (Yi; Si j Xi; i) Pr (Yi j Xi; i) Pr (Yi j Xi; Si; i) = = Pr (Si j Xi; i) Pr (Si j Xi; i) Given our logit model and that uit is iid over time, we have that the probability Pr (Yi j Xi; i) is: Pr (Yi j Xi; i) = = YT t=1 Pr (yit j xit; i) n h i o yit x0it + i n o t=1 1 + exp x0 it + i YT exp nP o T 0 exp t=1 yit xit + Si i = YT h n oi 0 1 + exp x it + i t=1 To derive the expression for Pr (Si j Xi; i) it is useful to de…ne the sets: H (Si) = D = (d1; d2; :::; dT ) 2 f0; 1gT XT : d = Si t=1 t Using this de…nition, we can write: Pr (Si j Xi; i) = = = X Pr (D j Xi; i) D2H(Si ) X YT D2H(Si ) X t=1 Pr (dt j xit; i) nP T d x0 exp t=1 t it D2H(Si ) n YT h 0 1 + exp x it t=1 + Si i + i oi o Combining the previous expressions for Pr (Yi j Xi; i) and Pr (Si j Xi; i) we have that: Pr (Yi j Xi; Si; i) = Pr (Yi Pr (Si n P o T 0 exp j Xi; i) t=1 yit xit n P = X T d x0 j Xi; i) exp t=1 t it d2H(Si ) o which does not depend on i. Therefore, Pr (Yi j Xi; Si; i) = Pr (Yi j Xi; Si). The conditional log-likelihood function Xn l( ) = i=1 log Pr (Yi j Xi; Si) Using the expression for Pr (Yi j Xi; Si) obtained before, we have that 2 6 Xn 6 6 l( ) = log 6 i=1 4 o n P T 0 exp t=1 yit xit n P X T d x0 exp t=1 t it D2H(Si ) Xn XT 0 = y x it it i=1 t=1 2 3 7 7 o7 7 5 nP Xn 6 X T d x0 log exp 4 t=1 t it i=1 D2H(Si ) This function is globally concave in . 3 o 7 5 MSE for Panel Data BCM. Manski (Ectca, 1987) 0 + Consider the BCM Yit = 1fXit i 0 + Yit = 1fXit i Therefore, conditional on Yit = If 0g uit 0g. The model implies that: 0 1fXit 1 + i uit 1 0g Yit 6= 0, we have that: 8 > < 1 > : uit if 0 Xit uit > 0 1 if 0 Xit uit < 0 Yit 6= 0 then: either (a) Yit 1 = 0 and Yit 1 = 1, and this implies that 0 Yit = 1 and that Xit uit > 0; or (b) Yit 1 = 1 and Yit 1 = 0, and 0 this implies that Yit = 1 and that Xit uit < 0. Assumption: Conditional on Xit, Xit 1, and i, the variables uit and uit 1 have the same probability distribution with support ( 1; 1). It is possible to show that, this assumption implies that: median( uit j Xit, Yit 6= 0) = 0 Therefore, we can apply the MSE to the model: Yit = with Yit 6= 0. 8 > < 1 > : if 0 Xit uit > 0 1 if 0 Xit uit < 0 Given a sample fyit; xitg, the score function is: S( ) = N X T X i=1 t=1 0 1f yit = 1g 1f Xit > 0g+1f yit = 0 1g 1f Xit < 0g That is just equal to the number of observations for which we score a correct 0 prediction for the sign of yit if we use the sign of Xit as a predictor. The MSE is the value of that maximizes the score function: ^ M SE = arg max S ( ) This estimator has the same properties as in the cross-section case: N 1=3 consistent, asymptotically non-normal, and possibly not uniquely de…ned in …nite samples. Following Horowitz, we can de…ne a smooth-MSE for this estimator by re0 placing the discontinuous function 1f Xit > 0g with a continuously di¤er- 0 0 ) converges uniformly entiable function KN Xit such that KN ( Xit 0 > 0g as N goes to in…nity. to 1f Xit Correlated Random E¤ects Static Probit model Suppose that i i and Xi have a joint normal distribution. Then: 0 0 0 = Xi1 1 + Xi2 2 + ::: + XiT T + ei = Xi0 + ei where ei is normally distributed and independent of i. Solving this expression in the equation of the BCM, we have: 0 + X0 Yit = 1f Xit i = 1f Xi0 t where t = ( 1; :::; t 1; uit + (ei uit) 0g 0g + t; t+1; :::; T ) and uit = uit ei . Random E¤ects Static Probit model (2) If uit is normally distributed (the original model is a Probit model) and independent of Xi, then uit is also normally distributed and independent of Xit. Then, Yit = 1f Xi0 t uit 0g is a standard Probit model and we can estimate the parameters t using MLE or the Pseudo-MLE of Avery-HansenHotz. Given these estimate of t and of its variance matrix, we can estimate and using a simple MD estimator. Given that the system of equations that relates and and is linear, the MD estimator has a simple closed form expression for the estimator of ^ in terms of ^ and Vd ar(^ ). 3.2. Dynamic Binary Choice Models Chamberlain (1985) Conditional Logit model for autorregresive PD BCM. Honore and Kyriadzidou (ECTCA, 2000) extension to include also strictly exogenpus regressors. Conditional MLE for Dynamic PD Logit Consider the dynamic panel data logit model Yit = 1 n Yi;t 1 + i where uit has a logistic distribution. In this model Si = not true that: uit > 0 o PT t=1 yit is not a su¢ cient statistic for i. That is, it is Pr (Yit j Yit 1; Si; i) = Pr (Yit j Yit 1; Si) However, fortunately, there is an alternative way to construct a su¢ cient statistic for i controlling for (Yi1; YiT ; Si). Conditional MLE for Dynamic PD Logit (2) Suppose that T = 4 and let Yi = fyi1; yi2; yi3; yi4g be the choice history for individual i. We distinguish four sets of choice histories: A B C D = = = = fy1; 1; 0; y4g fy1; 0; 1; y4g fy1; 1; 1; y4g fy1; 0; 0; y4g De…ne Si = 1(Yi 2 A [ B ). We will show that: Pr (Yi j 1(Yi 2 A [ B ), i, ) = Pr (Yi j 1(Yi 2 A [ B ), ) We can construct a (Conditional) likelihood function based on the probabilities Pr (Yi j 1(Yi 2 A [ B ), ), and the corresponding MLE is a consistent estimator of . Conditional MLE for Dynamic PD Logit (3) First, we obtain Pr (Yi j i,A [ B ). By Bayes’rule we have that: Pr (Yi j i, A [ B ) = Pr (Yi j i) Pr (Yi j i) = Pr (A [ B j i) Pr (A j i) + Pr (B j i) Note that: Pr (A j i) = Pr (y1j i) Pr (1j y1; i) Pr (0j 1; i) Pr (y4j 0; i) 1 exp (y4 i) exp ( y1 + i) = Pr (y1j i) 1 + exp ( y1 + i) 1 + exp ( + i) 1 + exp ( i) And Pr (B j i) = Pr (y1j i) Pr (0j y1; i) Pr (1j 0; i) Pr (y4j 1; i) 1 exp ( i) exp (y4 [ + i]) = Pr (y1j i) 1 + exp ( y1 + i) 1 + exp ( i) 1 + exp ( + i) Conditional MLE for Dynamic PD Logit (4) Therefore, Pr (A j i, A [ B ) = Pr (A j i) Pr (A j i) + Pr (B j i) exp ( [y1 y4]) = 1 + exp ( [y1 y4]) The CMLE is the value of function: lC ( ) = that maximizes the Conditional log-likelihoof P i 1fyi2 = 1; yi3 = 0g ln ( + 1fyi2 = 0; yi3 = 1g ln ( where (:) is the logistic function. [y1i [y1i y4i]) y4i]) Conditional MLE for Dynamic PD Logit (5) In this simple model, with T = 4 and without exogenous covariates X , it is simple to show that CMLE of is: ^ = log #f1; 1; 0; 0g + #f0; 0; 1; 1g #f0; 1; 0; 1g + #f1; 0; 1; 0g ! where #fy1; y2; y3; y4g means the number of individuals in the sample with a choice history fy1; y2; y3; y4g. Interpretation (intuition). - If time persistence in yit is generated by individual heterogeneity, then for an individual, we should have persistence in only one of the two states, either at 0 (if i is small) or at 1 (if i is large). - If time persistence in yit is generated by true state dependence ( > 0), then we should have persistence in both states, 0 and 1. Choice histories f1; 1; 0; 0g and f0; 0; 1; 1g are the only histories that provide evidence of persistence in both states. The larger the sample frequency of these histories the stronger is the evidence of structural state dependence and the larger the estimator of . The choice histories f0; 1; 0; 1g and f1; 0; 1; 0g are the only histories that provide evidence of no persistence in any of the two states. The larger the sample frequency of these histories the smaller the estimator of . Conditional MLE for Dynamic PD Logit (6) It is possible to extend the previous result to Panel Data with any value of T 4, to obtain the following expression. PT Let Yi = fYi1; Yi2; :::; YiT g and si = t=1 yit. Then, PT 1 exp t=2 yit yit 1 Pr (Yi j i, si; yi1 yiT ) = P PT 1 d2Ci exp t=2 dt dt 1 where: Ci = 8 < : (d1; d2; :::; dT ) 2 f0; 1gT : T X t=1 9 = dt = si; d1 = yi1; dT = yiT ; Honore and Kyriadzidou (ECTCA, 2000) Consider the dynamic panel data logit model Yit = 1 n 0 Yi;t 1 + Xit + i uit > 0 o where uit has a logistic distribution, and Xit are strictly exogenous regressors with respect to uit. For T = 4, they show that (si; yi1 yi4) are su¢ cient statistics for if we condition on xi3 = xi4. i only They propose a version of the CMLE that incorporates kernel weights that depend on the distance kxi3 xi4k.
© Copyright 2024