A Sample Lecture Notes for Advanced Graduate Econometrics by Tian Xie Queen’s University Kingston, Ontario, Canada December 1, 2012 c Tian Xie 2012 Preface This is a sample lecture notes for advanced econometrics. Students are assumed to have finished an introductory econometric course and an intermediate econometric course or the equivalent. The course aims to help students to establish a solid background in both theoretical and empirical econometrics studies. The lecture notes contain comprehensive information that is sufficient for a two semester course or it can be condensed to fit a one semester course, depending on the course design. The lecture notes cover various topics that start at a moderate level of difficulty, which gradually increases as the course proceeds. The lecture notes were developed during my time of completing my PhD in the Queen’s University economics department. The lecture notes were created based on the following textbooks: • “Principles of Econometrics” (3rd Edition) by R. Carter Hill, William Griffiths and Guay C. Lim • “Introduction to Econometrics” (3rd Edition) by James H. Stock and Mark W. Watson • “Econometric Analysis” (6th Edition) by William H. Greene • “Econometric Theory and Methods” by Russell Davidson and James G. MacKinnon I used these lecture notes when I was working as a lecturer/TA/private tutor for both undergraduate and graduate students. These lecture notes have received positive feedback and great reviews from my students, which is demonstrated in the enclosed “2012 Winter Term Evaluation Form”. i Contents 1 The Generalized Method of Moments 1.1 GMM Estimators for Linear Regression Models 1.2 HAC Covariance Matrix Estimation . . . . . . . 1.3 Tests Based on the GMM Criterion Function . . 1.3.1 Tests of Overidentifying Restrictions . . 1.3.2 Tests of Linear Restrictions . . . . . . . 1.4 GMM Estimators for Nonlinear Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1 2 3 3 4 4 2 The Method of Maximum Likelihood 2.1 Basic Concepts . . . . . . . . . . . . . . . . . . . . . 2.1.1 Regression Models with Normal Errors . . . . 2.1.2 Computing ML Estimates . . . . . . . . . . . 2.2 Asymptotic Properties of ML Estimators . . . . . . . 2.2.1 Consistency of MLE . . . . . . . . . . . . . . 2.2.2 Dependent Observations . . . . . . . . . . . . 2.2.3 The Gradient, the Information Matrix and the 2.2.4 Asymptotic Normality of the MLE . . . . . . 2.3 ∗ The Covariance Matrix of the ML Estimator . . . . 2.3.1 ∗ Example: the Classical Normal Linear Model 2.4 ∗ Hypothesis Testing . . . . . . . . . . . . . . . . . . . 2.5 ∗ Transformations of the Dependent Variable . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Hessian . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 5 6 6 7 7 8 9 9 10 10 12 12 3 Discrete and Limited Dependent Variables 3.1 Binary Response Models: Estimation . . . . 3.2 Binary Response Models: Inference . . . . . 3.3 Models for Multiple Discrete Responses . . . 3.4 Models for Count Data . . . . . . . . . . . . 3.5 Models for Censored and Truncated Data . . 3.6 Sample Selectivity . . . . . . . . . . . . . . 3.7 Duration Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14 14 15 16 18 18 19 20 . . . . 22 22 22 23 24 . . . . . . . . . . . . . . 4 Multivariate Models 4.1 Seemingly Unrelated Linear Regressions (SUR) 4.1.1 MM Estimation . . . . . . . . . . . . . . 4.1.2 Maximum Likelihood Estimation . . . . 4.2 Linear Simultaneous Equations Models . . . . . ii . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . CONTENTS 4.2.1 4.2.2 4.2.3 iii GMM Estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Structural and Reduced Forms . . . . . . . . . . . . . . . . . . . . . . Maximum Likelihood Estimation . . . . . . . . . . . . . . . . . . . . 24 26 27 5 Methods for Stationary Time-Series 5.1 AR Process . . . . . . . . . . . . . 5.2 MA Process and ARMA Process . . 5.3 Single-Equation Dynamic Models . 5.4 Seasonality . . . . . . . . . . . . . 5.5 ARCH, GARCH . . . . . . . . . . Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29 29 30 32 33 33 6 Unit Roots and Cointegration 6.1 Unit Root . . . . . . . . . . . . 6.2 Cointegration . . . . . . . . . . 6.2.1 VAR Representation . . 6.2.2 Testing for Cointegration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35 36 38 39 39 7 Testing the Specification of Econometric Models 7.1 Specification Tests Based on Artificial Regressions 7.1.1 RESET Test . . . . . . . . . . . . . . . . . 7.1.2 Conditional Moment Tests . . . . . . . . . 7.1.3 Information Matrix Tests . . . . . . . . . . 7.2 Nonnested Hypothesis Tests . . . . . . . . . . . . 7.2.1 J Tests . . . . . . . . . . . . . . . . . . . 7.2.2 P Tests . . . . . . . . . . . . . . . . . . . 7.2.3 Cox Tests . . . . . . . . . . . . . . . . . . 7.3 Model Selection Based on Information Criteria . . 7.4 Nonparametric Estimation . . . . . . . . . . . . . 7.4.1 Kernel Estimation . . . . . . . . . . . . . 7.4.2 Nonparametric Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42 42 42 42 43 43 44 44 45 45 45 46 46 A Sample Assignment and Solution . . . . . . . . . . . . . . . . . . . . 48 Chapter 1 The Generalized Method of Moments 1.1 GMM Estimators for Linear Regression Models We can consider GMM as a combination of IV and GLS, in which endogenous regressors and heteroskedasticity coexist. The linear regression model is given as E(uu> ) = Ω, y = Xβ + u, (1.1) where there are n observations. We assume that there exist an n × l matrix of predetermined instrumental variables W satisfying the condition E(ut |W t ) = 0 for t = 1, ..., n. As we have seen in IV, we need to first “break” the correlation between X and u by premultiplying a linear combination of W > , then “purify” the error terms to make the covariance matrix of purified error terms proportional to I. The sample moment conditions of (1.1) then become J > W > (y − Xβ) = 0, ˆ = (J > W > X)−1 J > W > y. We choose J so as to minimize where J is l × k. The associated β ˆ − β 0 ), in which the covariance matrix of the plim of n1/2 (β J = (W > Ω0 W )−1 W > X, where Ω0 is the true value of Ω. Then, the efficient (infeasible) GMM estimator is ˆ GMM = X > W (W > Ω0 W )−1 W > X −1 X > W (W > Ω0 W )−1 W > y. β (1.2) Of course, the above efficient GMM estimator can be obtained by minimizing the (efficient) GMM criterion function Q(β, y) ≡ (y − Xβ)> W (W > Ω0 W )−1 W > (y − Xβ). (1.3) However, the above criterion is actually based on the efficient J . Without knowing J in advance, there is no way you can form up a criterion function like (1.3), but (y − Xβ)> W ΛW > (y − Xβ), which will lead us to the inefficient GMM estimator ˆ = (X > W ΛW > X)−1 X > W ΛW > y. β 1 2 1.2 CHAPTER 1. THE GENERALIZED METHOD OF MOMENTS HAC Covariance Matrix Estimation We notice that both (1.2) and (1.3) include an unknown matrix Ω0 . If we replace Ω0 with ˆ we obtain the feasible efficient GMM estimator an efficient estimate Ω, ˆ FGMM = X > W (W > ΩW ˆ )−1 W > X β −1 ˆ )−1 W > y. X > W (W > ΩW (1.4) ˆ or equivThe performance of the feasible estimator (1.4) crucially depends on the estimate Ω >ˆ >ˆ 1 alently W ΩW . In practice, we should always compute n W ΩW instead for computation efficiency. ˆ If there is only heteroskedasticity, we can compute n1 W > ΩW in a fashion similar to the ˆ HCCMEs introduced Chapter 5. A typical element of the matrix n1 W > ΩW is n 1X 2 uˆ wti wtj , n t=1 t ˆ and β ˆ is a preliminary estimator; for example, the generalized IV where uˆt ≡ yt − X t β estimator. The HCCMEs are not appropriate when there is also autocorrelation. In this case, we need to use a heteroskedasticity and autocorrelation consistent (HAC) estimator. We define a set of autocovariance matrices to mimic the serial correlation: 1 Pn E(ut ut−j W > t W t−j ) for j ≥ 0 n P t=j+1 Γ(j) ≡ n > 1 t=−j+1 E(ut+j ut W t+j W t ) for j < 0 n It is easy to show that Γ(j) = Γ> (−j). Then, we can write the sample HAC estimator for 1 ˆ W > ΩW as n p X ˆ HW = Γ(0) ˆ ˆ ˆ > (j) . Σ + Γ(j) +Γ (1.5) j=1 This estimator is usually refer as the Hansen-White estimator. The threshold parameter p, which is also called the lag truncation parameter, is somewhat arbitrary. For the purposes of asymptotic theory, we need p to go to infinity at some suitable rate as the sample size goes to infinity; for example, n1/4 . In practice, when the sample size is finite, we need to choose a specific value of p. Another issue with the HW estimator (1.5) is that it is not always positive-definite in finite samples. To solve this critical deficiency, Newey and West (1987) proposed a modified HW estimator, which is later known as the Newey-West estimator: ˆ NW Σ p X ˆ = Γ(0) + 1− j=1 j p+1 > ˆ ˆ Γ(j) + Γ (j) . (1.6) The NW estimator is a biased estimator of Σ, since it underestimates the autocovariance matrices. It is still consistent if p increases as n does, and the appropriate rate is n1/3 . Note that Newey and West (1994) proposed a procedure to automatically select p. 1.3. TESTS BASED ON THE GMM CRITERION FUNCTION 3 ˆ (HW, NW or other estimators), the feasible efficient GMM estimaWith an estimated Σ tor becomes −1 > > ˆ ˆ −1 > ˆ −1 W > y. β X WΣ FGMM = X W Σ W X The associated covariance matrix is then > ˆ ˆ −1 > −1 Var(β F GM M ) = n(X W Σ W X) . 1.3 Tests Based on the GMM Criterion Function We first rewrite the GMM criterion function here Q(β, y) ≡ (y − Xβ)> W (W > ΩW )−1 W > (y − Xβ). Define Ω−1 = ΨΨ> , which implies Ω = (Ψ> )−1 Ψ−1 . Then h i −1 > > −1 −1 > > −1 > −1 Q(β, y) = (y − Xβ) Ψ Ψ W (W (Ψ ) Ψ W ) W (Ψ ) Ψ> (y − Xβ) = (y − Xβ)> ΨPA Ψ> (y − Xβ) ˜ > PA u ˜ = u ˜ = Ψ> u ∼ iid(0, I). Therefore, where A = Ψ−1 W , u a Q(β 0 , y) ∼ χ2 Rank(A) = χ2 (l), (1.7) where β 0 is true parameter vector. All tests in this section are based on result (1.7). 1.3.1 Tests of Overidentifying Restrictions Whenever l > k, a model estimated by GMM involves l − k overidentifying restrictions. Our ˆ target is to find the distribution of test statistic Q(β GMM , y): ˆ GMM , y) = (y − X β ˆ GMM )> ΨPA Ψ> (y − X β ˆ GMM ). Q(β If the model is correctly specified, it can be shown that ˆ PA Ψ> (y − X β GMM ) = = = = PA (I − PPA Ψ> X )Ψ> y (PA − PPA Ψ> X )Ψ> (Xβ 0 + u) (PA − PPA Ψ> X )Ψ> u (PA − PPA Ψ> X )˜ u. Therefore, a 2 > 2 ˆ Q(β GMM , y) ∼ χ Rank(A) − Rank(PA Ψ X) = χ (l − k). This test statistic is often called Hansen’s overidentification statistic. 4 1.3.2 CHAPTER 1. THE GENERALIZED METHOD OF MOMENTS Tests of Linear Restrictions It is even easier to construct a test statistics for linear restrictions based on the GMM criterion. For model y = X 1 β 1 + X 2 β 2 + u, Suppose we want to test the restrictions β 2 ˜ ˜ obtain a constraint estimates β = [β FGMM 1 E(uu> ) = Ω. = 0, where β 1 is k1 × 1 and β 2 is k2 × 1. We first .. . 0]. Then, the constraint criterion function is ˜ FGMM , y) = (y − X 1 β ˜ 1 )> W (W > ΩW ˜ ) ˆ )−1 W > (y − X 1 β Q(β 1 a > > 2 = u Ψ(PA − PPA Ψ> X1 )Ψ u ∼ χ (l − k1 ) ˆ GMM , y) from Q(β ˜ FGMM , y), the difference between the constrained and If we subtract Q(β unconstrained criterion function is ˜ FGMM , y) − Q(β ˆ GMM , y) = u> Ψ(PP Ψ> X − PP Ψ> X )Ψ> u Q(β A A 1 a ∼ χ2 (k − k1 ) = χ2 (k2 ) Note that the above results hold only for efficient GMM estimation. It is not true for nonoptimal criterion functions like (y − Xβ)> W ΛW > (y − Xβ). The reason is very simple: we can (almost) never construct a projection matrix of any forms from Λ. Therefore, the asymptotic distribution can rarely be a χ2 distribution. 1.4 GMM Estimators for Nonlinear Models When dealing with nonlinear models, we can adopt an approach that employs the concept of an elementary zero function. We denote f (θ, y) as a set of such function, where θ is a k-vector of parameters. The n × 1 f (θ, y) correspond to the error terms, in which f (θ, y) ≡ y − Xθ = u in the linear regression case. The n × k matrix F (θ), which has typical element ∂ft (θ) , ∂θi correspond to the regressors X in linear models. We can simply duplicate results from above sections by replacing u with f (θ, y) and X ˆ is with F (θ). For example, the covariance matrix of θ Fti (θ) ≡ ˆ = n(F ˆ >W Σ ˆ −1 W > F ˆ )−1 d θ) Var( and the nonlinear GMM criterion function is Q(θ, y) = f > (θ, y)ΨPA Ψ> f (θ, y). Of course, the real story is much more complicated than one we just showed. Assumptions ˆ and regularity conditions are different for nonlinear models. Also, in practice, the F (θ) matrix is usually very painful to estimate. Chapter 2 The Method of Maximum Likelihood 2.1 Basic Concepts Models that are estimated by maximum likelihood must be fully specified parametric models. We denote the dependent variable by the n-vector y. For a given k-vector θ of parameters, let the joint PDF of y be written as f (y, θ). This joint PDF function f (y, θ) is referred to as the likelihood function of the model for the given data set; and the ˆ that maximizes f (y, θ) is called a maximum likelihood estimate, or parameter vector θ MLE, of the parameters. If observations are assumed to be independent, the joint density of the entire sample y is n Y f (yt , θ). f (y, θ) = t=1 Because the above product is often a very large or very small number, it is customary to work instead with the loglikelihood function l(y, θ) ≡ log f (y, θ) = n X lt (yt , θ), t=1 where lt (yt , θ) is equal to log ft (y, θ). For example, if yt is generated by the density f (yt , θ) = θe−θyt , yt > 0, θ > 0, we take the logarithm of the density and obtain lt (yt , θ) = log θ − θyt . Therefore, l(y, θ) = n X (log θ − θyt ) = n log θ − θ t=1 n X t=1 We take FOC with respect to θ, n n X − yt = 0, θ t=1 P n which can be solved to yield θˆ = n/ t=1 yt . 5 yt . 6 2.1.1 CHAPTER 2. THE METHOD OF MAXIMUM LIKELIHOOD Regression Models with Normal Errors For the classical normal linear models y = Xβ + u, u ∼ N(0, σ 2 I), it is very easy to show that the regressand yt is distributed, conditionally on X, as N(X t β, σ 2 ). Thus the PDF of yt is 1 (yt − X t β)2 ft (yt , β, σ) = √ exp − . 2σ 2 σ 2π Take the log to obtain lt (yt , β, σ) such that 1 1 1 lt (yt , β, σ) = − log 2π − log σ 2 − 2 (yt − X t β)2 . 2 2 2σ Thus the sum of all lt (yt , β, σ) is n 1 n l(y, β, σ) = − log 2π − log σ 2 − 2 (y − Xβ)> (y − Xβ). 2 2 2σ (2.1) Taking FOC with respect to σ gives us ∂l(y, β, σ) n 1 = − + 3 (y − Xβ)> (y − Xβ) = 0, ∂σ σ σ which yields the result that σ ˆ 2 (β) = 1 (y − Xβ)> (y − Xβ). n Substituting σ ˆ 2 (β) into (2.1) yields the concentrated loglikelihood functions n n n 1 c > (y − Xβ) (y − Xβ) − . l (y, β) = − log 2π − log 2 2 n 2 Maximizing (2.2) is equivalent to minimize (y − Xβ)> (y − Xβ), which yields ˆ = (X > X)−1 X > y. β ˆ Then, the ML estimate σ ˆ 2 = SSR(β)/n, which is biased downward. 2.1.2 Computing ML Estimates We define the gradient vector g(y, θ), which has typical element n ∂l(y, θ) X ∂lt (yt , θ) gi (y, θ) ≡ = . ∂θi ∂θ i t=1 (2.2) 2.2. ASYMPTOTIC PROPERTIES OF ML ESTIMATORS 7 The gradient vector is the vector of first derivatives of the loglikelihood function. Let H(θ) be the Hession matrix of l(θ) with typical element Hij (θ) = ∂ 2 l(θ) . ∂θi ∂θj The Hessian is the matrix of second derivatives of the loglikelihood function. Let θ (j) denote the value of the vector of estimates at step j of the algorithm, and let g (j) and H (j) denote the gradient and Hessian evaluated at θ (j) . We perform a secondorder Taylor expansion of l(y, θ) around θ (0) (the initial value of θ) in order to obtain an approximation l∗ (θ) to l(θ): 1 > l∗ (θ) = l(θ (0) ) + g > (0) (θ − θ (0) ) + (θ − θ (0) ) H (0) (θ − θ (0) ), 2 The FOC for a maximum of l∗ (θ) with respect to θ can be written as g (0) + H (0) (θ − θ (0) ) = 0, which implies θ (1) in the next step θ (1) = θ (0) − H −1 (0) g (0) . We keep doing this iteration until the estimator converges. The whole process can be summarized by the following equation: θ (j+1) = θ (j) − H −1 (j) g (j) , which is the fundamental equation for Newton’s Method. In practice, however, Newton’s Method does not always work due to the reason that the Hessian is not negative definite. In such cases, one popular and more appropriate way to obtain the MLE is to use the so-called quasi-Newton method, in which the fundamental equation is replaced by θ (j+1) = θ (j) + α(j) D −1 (j) g (j) , where α(j) is a scalar which is determined at each step, and D (j) is a positive definite matrix which approximates −H (j) . 2.2 2.2.1 Asymptotic Properties of ML Estimators Consistency of MLE The loglikelihood function l(θ) is a concave function. Otherwise, we are not able to find its maximum. Let L(θ) = exp l(θ) denote the likelihood function. By Jensen’s Inequality, we have Z L(θ ∗ ) L(θ ∗ ) L(θ ∗ ) < log E0 = log L(θ 0 )dy = 0, E0 log L(θ 0 ) L(θ 0 ) L(θ 0 ) which implies E0 l(θ ∗ ) − E0 l(θ 0 ) < 0. 8 CHAPTER 2. THE METHOD OF MAXIMUM LIKELIHOOD By LLN, we can assert 1 1 plim l(θ ∗ ) ≤ plim l(θ 0 ) ∀θ ∗ 6= θ 0 . n→∞ n n→∞ n (2.3) ˆ maximizes l(θ), it must be the case that Since the MLE θ 1 ˆ 1 plim l(θ) ≥ plim l(θ 0 ). n→∞ n n→∞ n (2.4) The only way that (2.3) and (2.4) can both be true is if 1 ˆ 1 plim l(θ) = plim l(θ 0 ). n→∞ n n→∞ n (2.5) ˆ is consistent, simply because the Note that the result (2.5) alone does not prove that θ inequality condition (2.3) may also imply that there are many values θ ∗ satisfies 1 1 plim l(θ ∗ ) = plim l(θ 0 ). n→∞ n n→∞ n Therefore we must also assume that 1 1 plim l(θ ∗ ) 6= plim l(θ 0 ) ∀θ ∗ 6= θ 0 , n→∞ n n→∞ n which is also known as a asymptotic identification condition. 2.2.2 Dependent Observations For a sample of size n, the joint density of n dependent observations follows n f (y ) = n Y f (yt |y t−1 ), t=1 where the vector y t is a t-vector with components y1 , y2 , ..., yt . If the model is to be estimated by MLE, the density f (y n ) depends on a k-vector of parameters θ, n f (y , θ) = n Y f (yt |y t−1 ; θ). t=1 The corresponding loglikelihood function is then l(y, θ) = n X t=1 lt (y t , θ). 2.2. ASYMPTOTIC PROPERTIES OF ML ESTIMATORS 2.2.3 9 The Gradient, the Information Matrix and the Hessian We define the n × k matrix of contributions to the gradient G(y, θ) so as to have typical element ∂lt (y t , θ) Gti (y t , θ) ≡ . ∂θi We should be able to tell the difference between G(y, θ) and g(y, θ) that has typical element n gi (y, θ) ≡ ∂l(y, θ) X ∂lt (yt , θ) = . ∂θi ∂θ i t=1 The covariance matrix of the elements of the tth row Gt (y t , θ) of G(y, θ) is the k × k matrix I t (θ), which is normally positive definite. The information matrix is the sum of I t (θ) n n X X t t . (y , θ)G (y , θ) I t (θ) = Eθ G> I(θ) ≡ t t t=1 t=1 We have already defined the Hessian matrix H(y, θ) that has a typical element Hij (y, θ) = ∂ 2 l(y, θ) . ∂θi ∂θj For asymptotic analysis, we are generally more interested in their asymptotic matrices 1 I(θ) ≡ plimθ I(θ), n→∞ n 1 H(θ) ≡ plimθ H(θ), n→∞ n where I(θ) = −H(θ). 2.2.4 Asymptotic Normality of the MLE ˆ we perform a Taylor expansion around θ 0 For a likelihood estimate g(θ), ˆ = g(θ 0 ) + H(θ)( ¯ θ ˆ − θ 0 ) = 0, g(θ) which implies ˆ − θ 0 ) = − n−1 H(θ) ¯ n1/2 (θ −1 n−1/2 g(θ 0 ) a = −H−1 (θ 0 )n−1/2 g(θ 0 ) a = I −1 (θ 0 )n−1/2 g(θ 0 ). Since gi (y, θ) = n X t=1 Gti (y t , θ) and E(Gti (y t , θ)) = 0, 10 CHAPTER 2. THE METHOD OF MAXIMUM LIKELIHOOD By CLT, we have a n−1/2 g(θ 0 ) ∼ N 0, I(θ 0 ) . Finally, we obtain a ˆ − θ0 ) ∼ n1/2 (θ N 0, I −1 (θ 0 ) a ∼ N 0, −H−1 (θ 0 ) a ∼ N 0, H−1 (θ 0 )I(θ 0 )H−1 (θ 0 ) ∗ 2.3 The Covariance Matrix of the ML Estimator Based on the asymptotic result we just derived above, four covariance matrix estimators can be formulated: (i) The empirical Hessian estimator: ˆ = −H −1 (θ). ˆ d H (θ) Var (ii) The information matrix estimator: ˆ = I −1 (θ). ˆ d IM (θ) Var (iii) The outer-product-of-the-gradient estimator: −1 > ˆ ˆ ˆ d VarOPG (θ) = G (θ)G(θ) , which is based on the following theoretical result: I(θ 0 ) = E G> (θ 0 )G(θ 0 ) . (iv) The sandwich estimator: ˆ = H −1 (θ)G ˆ > (θ)G( ˆ ˆ −1 (θ). ˆ d S (θ) Var θ)H 2.3.1 ∗ Example: the Classical Normal Linear Model For a classical normal linear model y = Xβ + u, u ∼ N(0, σ 2 I), where X is n × k. The contribution to the loglikelihood function made by the tth observation is 1 1 1 lt (yt , β, σ) = − log 2π − log σ 2 − 2 (yt − X t β)2 . 2 2 2σ A typical element of any of the first k columns of the matrix G is Gti (β, σ) = ∂lt 1 = 2 (yt − X t β)xti , ∂βi σ i = 1, ..., k. 2.3. ∗ THE COVARIANCE MATRIX OF THE ML ESTIMATOR 11 A typical element of the last column of G is Gt,k+1 (β, σ) = 1 1 ∂lt = − + 3 (yt − X t β)2 . ∂σ σ σ This implies the following: (i) For i, j = 1, ..., k, the ij th element of G> G is 2 X n n n X X 1 1 1 2 (yt − X t β)xti = (yt − X t β) xti xtj = xti xtj . 2 4 σ σ σ2 t=1 t=1 t=1 (ii) The (k + 1), (k + 1)th element of G> G is 2 n X 1 1 2 − + 3 (yt − X t β) σ σ t=1 n n X 2 X 1 n 2 = − (y − X β) + (y − X t β)4 t t 4 6 t σ2 σ σ t=1 t=1 2n 3n 2n n − 2 + 2 = 2. 2 σ σ σ σ = (iii) The (i, k + 1)th element of G> G is n X 1 1 1 2 − + 3 (yt − X t β) (yt − X t β)xti σ σ σ2 t=1 = − n n X X 1 1 (y − X β)x + (yt − X t β)3 xti t t ti 3 5 σ σ t=1 t=1 = 0. Therefore, 1 I(β, σ) = plim G> G = plim n→∞ n→∞ n n−1 X > X/σ 2 0 0> 2/σ 2 . If we want to estimate the covariance matrix using the IM estimator, we have 2 > −1 σ ˆ (X X) 0 ˆ d VarIM (β, σ ˆ) = . 0> σ ˆ 2 /2n √ Thus the standard error of σ ˆ is estimated as σ ˆ / 2n and the C.I. can be constructed following √ √ [ˆ σ − cα/2 σ ˆ / 2n, σ ˆ + cα/2 σ ˆ / 2n], given the fact that u ∼ N(0, σ 2 I). 12 2.4 CHAPTER 2. THE METHOD OF MAXIMUM LIKELIHOOD ∗ Hypothesis Testing Assume that the null hypothesis imposes r restrictions on θ and we can write these as r(θ) = 0. ˜ and θ ˆ denote, respectively, the restricted and unrestricted maximum likelihood estiLet θ mates of θ. There are three classical tests: (i) The likelihood ratio test a 2 ˆ − l(θ) ˜ ∼ LR = 2 l(θ) χ (r). (ii) The Wald test a ˆ R(θ) ˆ Var( ˆ > (θ) ˆ −1 r(θ) ˆ ∼ d θ)R W = r > (θ) χ2 (r), where R(θ) = ∂r(θ)/∂θ is an r × k matrix. (iii) The Lagrange multiplier test ˜ I˜ −1 g(θ), ˜ LM = g > (θ) ˜ > R(θ) ˜ I˜ −1 R> (θ) ˜ λ, ˜ = λ a ∼ χ2 (r). ˜ is estimated from the constrained maximization problem where λ max l(θ) − r > (θ)λ. θ 2.5 ∗ Transformations of the Dependent Variable We consider the following model log yt = X t β + ut , ut ∼ NID(0, σ 2 ) The loglikelihood for the log yt is n 1 n − log 2π − log σ 2 − 2 (log y − Xβ)> (log y − Xβ) . 2 2 2σ Using the Jacobian transformation of log yt , we establish the link between f (yt ) and f (log yt ), where d log yt f (log yt ) = f (yt ) = f (log yt ) , dyt yt which implies lt (yt ) = lt (log yt ) − log yt 2.5. ∗ TRANSFORMATIONS OF THE DEPENDENT VARIABLE 13 Therefore, the loglikelihood function of yt we are seeking is n X n n 1 − log 2π − log σ 2 − 2 (log y − Xβ)> (log y − Xβ) − log yt . 2 2 2σ t=1 And the loglikelihood function concentrated with respect only to σ is ! n n X n n 1X n (log yt − X t β)2 − log yt . − log 2π − − log 2 2 2 n t=1 t=1 ˆ >G ˆ is a (k + 1) × (k + 1) matrix with three distinct elements similar The associated G to those in example 3.1 ˆ >G ˆ is (i) For i, j = 1, ..., k, the ij th element of G 2 X n n X 1 1 ˆ ˆ 2 xti xtj . (log yt − X t β)xti = (log yt − X t β) 2 4 σ ˆ σ ˆ t=1 t=1 ˆ >G ˆ is (ii) The (k + 1), (k + 1)th element of G 2 n X 1 1 2 ˆ − + 3 (log yt − X t β) σ ˆ σ ˆ t=1 n n X 2 X 1 n 2 ˆ ˆ 4. − (log y − X β) + (log yt − X t β) = t t 4 6 σ ˆ2 σ ˆ σ ˆ t=1 t=1 ˆ >G ˆ is (iii) The (i, k + 1)th element of G n X 1 1 1 2 ˆ ˆ ti − + 3 (log yt − X t β) (log yt − X t β)x 2 σ ˆ σ ˆ σ ˆ t=1 = − n n X X 1 1 ˆ ˆ 3 xti . (log y − X β)x + (log yt − X t β) t t ti 3 5 σ ˆ σ ˆ t=1 t=1 The Hessian matrix can be obtained in a similar fashion. We need to consider three ˆ the upper left k × k block that corresponds to β, the lower right scalar elements of −H: that corresponds to σ, and the k × 1 vector that corresponds to β and σ. Of course, all ˆ must be replaced by a sample estimates (which means it must be the parameters in −H “hatted”). Chapter 3 Discrete and Limited Dependent Variables This chapter explores two “typical” limits of traditional regression models: when the dependent variable is discrete, or is continuous but is limited in a range. 3.1 Binary Response Models: Estimation A binary dependent variable yt can take on only two values 0 and 1. We encounter these variables a lot in survey data, which the economic agent usually chooses between two alternatives. A binary response model explains the probability that the agent chooses alternative 1 as a function of some observed explanatory variables. Let Pt denote the probability that yt = 1 conditional on the information set Ωt Pt ≡ Pr(yt = 1|Ωt ) = E(yt |Ωt ) = F (X t β) ∈ [0, 1], where X t β is an index function and the CDF function F (x) is also called a transformation function. There are two popular choices for F (x) are ( Rx F (x) = Φ(x) ≡ √12π −∞ exp − 12 X 2 dX probit model F (x) = γ(x) ≡ ex /(1 + ex ) logit model Since the probability that yt = 1 is F (X t β), the loglikelihood function for observation t when yt = 1 is log F (X t β). Therefore the loglikelihood function for y is simply n X l(y, β) = yt log F (X t β) + (1 − yt ) log 1 − F (X t β) , t=1 which can be easily estimated by MLE. The binary response models can deal with models/dataset that is not generally suitable to traditional regression models. For example, there is unobserved, or latent, variable yt◦ . Suppose that yt◦ = X t β + ut , ut ∼ NID(0, 1). We observe only the sign of yt◦ , which determines the value of the observed binary variable yt according to the relationship ( yt = 1 if yt◦ > 0 yt = 0 if yt◦ ≤ 0 14 3.2. BINARY RESPONSE MODELS: INFERENCE 15 However, the binary response models do have drawbacks comparing to traditional regression models. For example, the binary response models are likely to encounter the perfect classifier problem when the sample size is small, and almost all of the yt are equal to 0 or 1. In that case, it is highly possible that we can sort the dependent variables by some linear combination of the independent variables, say X t β • , such that ( yt = 0 whenever X t β • < 0, yt = 1 whenever X t β • > 0. Unfortunately, there is no finite ML estimator exists. 3.2 Binary Response Models: Inference Follow the standard results for ML estimation, we can show that −1 1 > 1/2 ˆ X Υ(β 0 )X , Var plim n (β − β 0 ) = plim n n→∞ n→∞ where f 2 (X t β) Υt (β) ≡ . F (X t β)(1 − F (X t β)) In practice, the covariance matrix estimator is −1 ˆ = X > Υ(β)X ˆ d β) Var( . We can also test restrictions on binary response models simply using LR tests. As usual, the LR test statistic is asymptotically distributed as χ2 (r), where r is the number of restrictions. Another somewhat easier way to perform a hypothesis testing is to construct the binary response model regression (BRMR) −1/2 Vt −1/2 (β) yt − F (X t β) = Vt (β)f (X t β)X t b + resid, −1/2 where Vt (β) ≡ F (X t β) 1 − F (X t β) and f (x) = ∂F (x)/∂x. The BRMR is a modified version of the GNR. . We partition X as [X 1 , X 2 ] and β as [β 1 .. β 2 ], where β 2 is a r−vector. Suppose we want to test β 2 = 0, we can do this by running the BRMR −1/2 −1/2 ˜ −1/2 ˜ V˜t yt − F˜t = V˜t ft X t1 b1 + V˜t ft X t2 b2 + resid, ˜ = [β˜1 where V˜t , F˜t and f˜t are estimated under the restricted coefficients β testing β 2 = 0 is equivalent to test b2 = 0. The best test statistic to use ESS from above regression , which is asymptotically distributed as χ2 (r). .. . 0]. Therefore, is probably the 16 3.3 CHAPTER 3. DISCRETE AND LIMITED DEPENDENT VARIABLES Models for Multiple Discrete Responses It is quite common that discrete dependent variables can take on more than just two values. Models that deal with multiple discrete responses are referred to as qualitative response models or discrete choice models. We usually divided these models into two groups: ones designed to deal with ordered responses, for example, rate your professor in the course evaluation form; and ones designed to deal with unordered responses, for example, your choice of what-to-do in the reading-week. The most widely-used model for ordered response data is the ordered probit model. The model for the latent variable is yt◦ = X t β + ut , Consider the number of responses is 3 for observed yt and the latent yt◦ is given by y t = 0 yt = 1 yt = 2 ut ∼ NID(0, 1). an example, we assume the relation between the if yt◦ < γ1 if γ1 ≤ yt◦ ≤ γ2 if yt◦ ≥ γ2 where γ1 and γ2 are threshold parameters that must be estimated. The probability that yt = 0 is Pr(yt = 0) = Pr(yt◦ < γ1 ) = Pr(ut < γ1 − X t β) = Φ(γ1 − X t β). similarly, we have Pr(yt = 1) = Φ(γ2 − X t β) − Φ(γ1 − X t β), Pr(yt = 2) = Φ(X t β − γ2 ). Therefore, the loglikelihood function for the ordered probit model is X X log Φ(X t β − γ2 ) log Φ(γ1 − X t β) + l(β, γ1 , γ2 ) = yt =2 yt =0 + X log Φ(γ2 − X t β) − Φ(γ1 − X t β) yt =1 The most popular model to deal with unordered responses is the multinomial logit model or multiple logit model. This model is designed to handle J + 1 responses, for J ≥ 1. The probability that a response j is observed is exp(W tj β j ) Pr(yt = j) = PJ j=0 exp(W tj β j ) for j = 0, ..., J, where W tj is the explanatory variables for response j and β j is usually different for each j = 0, ..., J. The loglikelihood function is simply ! J J n X X X I(yt = j)W tj β j − log exp(W tj β j ) , t=1 j=0 J=0 3.3. MODELS FOR MULTIPLE DISCRETE RESPONSES 17 where I(·) is the indicator function. In practice, however, we often face a situation when the explanatory variables W tj are the same for each j. In that case, we can denote the familiar X t as the explanatory variables for all responses and the probabilities for j = 1, ..., J is simply Pr(yt = j) = exp(X t β j ) PJ 1 + j=1 exp(X t β j ) and for outcome 0, the probability is Pr(yt = 0) = 1 1+ PJ j=1 exp(X t β j ) . The unknown parameters in each vector βj are typically jointly estimated by maximum a posteriori (MAP) estimation, which is an extension of MLE. An important property of the general multinomial logit model is that exp(W tl β l ) Pr(yt = l) = . Pr(yt = j) exp(W tj β j ) for any two responses l and j. This property is called the independence of irrelevant alternatives, or IIA. Unfortunately, the IIA property is usually incompatible with the data. The logit model that do not possess the IIA property is often denoted as the nested logit model. We can derive the probability of a nested logit model based on the multinomial logit model: • We partition the set of outcomes {0, 1, ..., J} into m disjoint subsets Ai , i = 1, ..., m. • Suppose that the choice among the members of Ai is governed by a standard multinomial logit model exp(W tj β j /θi ) Pr(yt = j|yt ∈ Ai ) = P l l∈Ai exp(W tl β /θi ) where θi is a scale parameter for the parameter vectors β j , j ∈ Ai . • Specifically, we assume that exp(θi hti ) Pr(yt ∈ Ai ) = Pm k=1 exp(θk htk ) where ! hti = log X exp(W tj β j /θi ) . j∈Ai • Finally, by Bayes’ rule, we have Pr(yt = j) = Pr(yt = j|yt ∈ Ai(j) )Pr(yt ∈ Ai(j) ) ! X exp(W tj β j /θi ) = P log exp(W tj β j /θi ) . l l∈Ai exp(W tl β /θi ) j∈Ai 18 3.4 CHAPTER 3. DISCRETE AND LIMITED DEPENDENT VARIABLES Models for Count Data Many economic data are nonnegative integers, for example, the total number of population. Data of this type are called event count data or count data. In probability theory and statistics, the Poisson distribution (or Poisson law of small numbers) is a discrete probability distribution that expresses the probability of a given number of events occurring in a fixed interval of time and/or space if these events occur with a known average rate and independently of the time since the last event. Therefore, the Poisson distribution model is one of the most popular model to deal with count data: y exp − λt (β) λ − t(β) , y = 0, 1, 2, ... Pr(Yt = y) = y! where λt (β) ≡ exp(X t β). The loglikelihood function is simply l(y, β) = n X (− exp(X t β) + yt X t β − log yt !) t=1 and the associated Hessian matrix is H(β) = − n X > exp(X t β)X > t X t = −X Υ(β)X, t=1 where Υ(β) is an n × n diagonal matrix with typical diagonal element equal to Υt (β) ≡ exp(X t β). In practice, however, the Poisson regression model tends to under predict the variance of the actual data. Such failure is also called overdispersion. If the variance of yt is indeed equal to exp(X t β), the quantity zt (β) ≡ yt − exp(X t β) 2 − yt has expectation 0. To test the overdispersion in the Poisson regression model, we make use the artificial OPG regression: ˆ + cˆ ι = Gb z + resid, ˆ X t and z ˆ = yt − exp(X t β) ˆ 2 − yt . We test c = 0 using t-stat ˆ = yt − exp(X t β) where G following the asymptotic distribution N(0, 1) or we can examine the ESS by χ2 (1). 3.5 Models for Censored and Truncated Data A data sample is said to be truncated if some observations have been systematically excluded from the sample. And a data sample has been censored if some of the information contained in them has been suppressed. 3.6. SAMPLE SELECTIVITY 19 Any dependent variable that has been either censored or truncated is said to be a limited dependent variable. Consider the regression model yt◦ = β1 + β2 xt + ut , ut ∼ NID(0, σ 2 ). What we actually observe is yt , which differs from yt◦ because it is either truncated or censored. Suppose yt is truncated such that all the negative values are omitted. The probability that yt◦ is included in the sample is Pr(yt◦ ≥ 0) = Pr(X t β + ut ≥ 0) = 1 − Pr(ut /σ < −X t β/σ) = Φ(X t β/σ). The density of yt is then σ −1 φ (yt − X t β)/σ . Φ(X t β/σ) This implies that the log likelihood function of yt conditional on yt◦ ≥ 0 is n n X n 1 X 2 l(y, β, σ) = − log(2π) − n log(σ) − 2 (yt − X t β) − log Φ(X t β/σ), 2 2σ t=1 t=1 which can be estimated by MLE. The most popular model for censored data is the tobit model. Assume we impose the following restriction ( yt = yt◦ if yt◦ > 0, yt = 0 otherwise The loglikelihood function for yt is simply X X 1 log l(y, β, σ) = φ (yt − X t β)/σ + log Φ(−X t β/σ), σ y >0 y =0 t t which is the sum of the logs of probability for the censored observations and the uncensored observations. Fortunately, we can still use MLE, even though the distribution of the dependent variable in a tobit model is a mixture of discrete and continuous random variables. 3.6 Sample Selectivity Many samples are truncated on the basis of another variable that is correlated with the dependent variable. For example, if a student shows up to a tutorial depends on his own desire of knowledge and the TA’s ability, when these two parameters do not meet certain standard, the student will not show up. If we want to evaluate the performance of students in the tutorial, our evaluation will be upward biased because our sample only includes those geniuses who actually go to the tutorial. The consequences of this type of sample selection are often said to be due to sample selectivity. Therefore, to have a fair evaluation of students’ performance, we need to include the parameters of student’s desire and TA’s ability in our model. 20 CHAPTER 3. DISCRETE AND LIMITED DEPENDENT VARIABLES Suppose that yt◦ and zt◦ are two latent variables, generated by the bivariate process ◦ 2 yt X tβ ut ut σ ρσ = + , ∼ NID 0, ρσ 1 zt◦ W tγ vt vt The data is determined following ( yt = yt◦ if zt◦ > 0 yt unobserved otherwise and ( zt = 1 if zt◦ > 0 zt = 0 otherwise The loglikelihood function is X X log Pr(zt = 0) + log Pr(zt = 1)f (yt◦ |zt = 1) , zt =0 zt =1 which can be estimated by ML. Another simpler technique to compute consistent estimates is to use the Heckman’s TwoStep Method, which is based on the following model yt = X t β + ρσvt + et , where the error term ut is divided into two parts: one perfectly correlated with vt an one independent of vt . The idea is to replace vt with its conditional mean E(vt |zt = 1, W t ) = E(vt |vt > −W t γ, W t ) = φ(W t γ) . Φ(W t γ) Therefore, we have yt = X t β + ρσ φ(W t γ) + et , Φ(W t γ) ˆ from The first step is to obtain consistent estimates γ zt◦ = W t γ + vt , using ordinary probit model. Then replace the γ with the estimate yt = X t β + ρσ ˆ) φ(W t γ + et ˆ) Φ(W t γ and obtain the consistent estimates of β using OLS. Of course, the above regression can also be used to test if selectivity is a problem by examining if ρ = 0 or not following ordinary t statistic. 3.7 Duration Models Economists are sometimes interested in how much time elapses before some event occurs. From now on, we use i to index observations and denote ti to measure duration. Suppose 3.7. DURATION MODELS 21 that how long a state endures is measured by T , a nonnegative, continuous random variable with PDF f (t) and CDF F (t). The survivor function is defined as S(t) ≡ 1 − F (t). The probability that a state ends in the period from time t to time t + ∆t is Pr(t < T ≤ t + ∆t) = F (t + ∆t) − F (t). Therefore, the conditional probability that the state survived time t is Pr(t < T ≤ t + ∆t|T ≥ t) = F (t + ∆t) − F (t) . S(t) We divide the RHS term by ∆t. F (t + ∆t) − F (t) . S(t). ∆t As ∆t → 0, the numerator is simply the PDF f (t). We denote such function as the hazard function f (t) f (t) h(t) ≡ = . S(t) 1 − F (t) The function F (t) may take various forms due to different assumptions. For example, F (t) can follow the Weibull distribution F (t, θ, α) = 1 − exp − (θt)α . The associated PDF, survivor function and hazard function are simply f (t) = αθα tα−1 exp − (θt)α S(t) = exp − (θt)α h(t) = αθα tα−1 . The loglikelihood function for t, the vector of observations with typical element ti , is just l(t, β, α) = = n X i=1 n X log f (ti |X i , β, α) log h(ti |X i , β, α) + i=1 = n log α + n X log S(ti |X i , β, α) i=1 n X i=1 X i β + (α − 1) n X i=1 We then obtain the estimates by MLE in the usual way. log ti − n X i=1 tαi exp(X i β). Chapter 4 Multivariate Models In this chapter, we discuss models which jointly determine the values of two or more dependent variables using two or more equations. 4.1 Seemingly Unrelated Linear Regressions (SUR) We suppose that there are g dependent variables indexed by i. Each dependent variable is associated with n observations. The ith equation of a multivariate linear regression model can be written as y i = X i β i + ui , E(ui u> i ) = σii I n . To allow correlation between error term in the same period, we further impose the following assumption E(uti utj ) = σij ∀t, E(uti usj ) = 0 ∀t 6= s, where σij is the ij th element of the g × g PSD matrix Σ. We construct an n × g matrix U that contains the error terms ui from each equation, of which a typical row is the 1 × g vector U t . Then, 1 E(U > U ) = Σ. E(U > t U t) = n 4.1.1 MM Estimation We can write the entire SUR system as y • = X • β • + u• , where y• = y1 y2 .. . yg X• = X1 O · · · O X2 · · · .. .. .. . . . O O ··· O O .. . β• = Xg and the OLS estimator for the entire system is simply ˆ OLS = (X > X • )−1 X > y , β • • • • 22 β1 β2 .. . βg u• = u1 u2 .. . ug 4.1. SEEMINGLY UNRELATED LINEAR REGRESSIONS (SUR) if we assume homoskedasticity. The covariance matrix of the vector u• is E(u1 u> E(u1 u> σ11 I n · · · 1 ) ··· g) . . . . .. > .. .. .. E(u• u• ) = = .. . > E(ug u> ) · · · E(u u ) σ I · · · g g g1 n 1 23 σ1g I n .. ≡ Σ . • . σgg I n The matrix Σ• is a symmetric gn × gn matrix, which can be written more compactly as Σ• ≡ Σ ⊗ I n . For the Kronecker product Σ⊗I n , we let each element in Σ times the matrix I n and arrange each block into Σ• . If there is heteroskedasticity, the feasible GLS estimator is −1 F GLS > ˆ −1 ˆ ˆ −1 ⊗ I n )y • , β• = X • (Σ ⊗ I n )X • X> • (Σ where ˆ ≡ 1U ˆ >U ˆ Σ n The covariance matrix is then ˆ = [ˆ ˆ g ]. with U u1 , ..., u −1 ˆ F GLS ) = X > (Σ ˆ −1 ⊗ I n )X • d β . Var( • • 4.1.2 Maximum Likelihood Estimation Let z be a random m-vector with known density fz (z), and let x be another random m−vecor such that z = h(x), then ∂h(x) . fx (x) = fz h(x) det ∂x We rewrite the SUR system y • = X • β • + u• , u• ∼ N (0, Σ ⊗ I n ). Since u• = y • − X • β • , given the pdf of u• is 1 −1 −gn/2 −1/2 > f (u• ) = (2π) |Σ ⊗ I n | exp − (y • − X • β • ) (Σ ⊗ I n )(y • − X • β • ) , 2 we have −gn/2 f (y • ) = (2π) −1/2 |Σ⊗I n | 1 ∂h(y) −1 > . exp − (y • − X • β • ) (Σ ⊗ I n )(y • − X • β • ) det 2 ∂y Note that h(y) = y • − X • β • , therefore, the determinant above equals to 1, if there are no lagged dependent variables in the matrix X • . 24 CHAPTER 4. MULTIVARIATE MODELS Taking the log of the above equation, we have the loglikelihood function l(Σ, β • ) equals to gn n 1 log 2π − log |Σ| − (y • − X • β • )> (Σ−1 ⊗ I n )(y • − X • β • ). 2 2 2 You probably won’t be able to obtain a result, if you only using the above loglikelihood function for the maximum likelihood estimation. In fact, many modern software are taking advantage of the following equation, with or without telling the user: − let σij be the ij th element in Σ, we have σij = 1 (y − X i β i )> (y j − X j β j ). n i If we define the n × g matrix U (β • ) to have ith column y i − X i β i , then Σ= 1 > U (β • )U (β • ) n We can replace the Σ in the loglikelihood function with n1 U > (β • )U (β • ), then taking the ˆ M L , we can use it to compute Σ ˆ M L . Consequently, FOC w.r.t β • . Once we obtained the β • ML ˆ we can estimate the covariance matrix of β • ˆ M L ) = X > (Σ ˆ −1 ⊗ I n )X • d β Var( • ML • 4.2 −1 Linear Simultaneous Equations Models The linear simultaneous equations models in this section are indeed multivariate IV models. The ith equation of a linear simultaneous system can be written as y i = X i β i + ui = Z i β 1i + Y i β 2i + ui , where the regressors X i can be partitioned into two parts: the predetermined Z i (n × k1i ) and the endogenous Y i (n × k2i ). Although we can still use a single equation to represent the whole system: y • = X • β • + u• , E(u• u> • ) = Σ ⊗ I n. But we need to keep in mind that X • is correlated with u• . 4.2.1 GMM Estimation We can adopt the GMM estimation to solve the above model. The theoretical moment conditions that lead to the efficient GMM estimator is > −1 ¯ E X Ω (y − Xβ) = 0. (4.1) Let W denote an n × l matrix of exogenous and predetermined variables. This instruments matrix contains all the Z i plus instruments for each Y i . Similar to IV estimation, we 4.2. LINEAR SIMULTANEOUS EQUATIONS MODELS 25 ˆ i = PW X i . We construct a blockpremultiply PW on the regressors X i and define X ˆ ˆ ¯ in (4.1). This allows us diagonal matrix X • , with diagonal blocks the X i , to replace the X to write the estimating equations for efficient GMM estimation ˆ > (Σ−1 ⊗ I n )(y • − X • β • ) = 0. X • (4.2) The above equation looks like equations for GMM estimation. In fact, it is more like a group equations for IV estimation. It becomes straightforward if we rewrite the above equation in the form σ 11 X > σ 1g X > y1 − X 1β1 1 PW · · · 1 PW .. .. .. .. = 0, . . . . yg − X g βg σ g1 X > σ gg X > g PW · · · g PW For a particular element i, we have g X σ ij X > i PW (y j − X j β j ) = 0. j=1 or g X ij X> i σ · PW (y j − X j β j ) = 0. j=1 Therefore, we can rewrite the estimating equation (4.2) as −1 X> ⊗ PW )(y • − X • β • ) = 0. • (Σ The above equation is not feasible unless Σ is known. We need to first obtain a consistent estimate of Σ, then compute the estimate of β • . The procedure can be simplified to the following three steps: (1) We first estimate the individual equations of the system by 2SLS. ˆ 2SLS (2) Then, we use the 2SLS residuals to compute the matrix Σ ˆ >U ˆ, ˆ 2SLS = 1 U Σ n ˆ is an n × g matrix with ith column u ˆ 2SLS where U . i (3) Finally, we compute the β • −1 ˆ 3SLS = X > (Σ ˆ −1 ⊗ PW )X • ˆ −1 β X> • • 2SLS • (Σ2SLS ⊗ PW )y • . This estimator is called the three-stage least squares, or 3SLS, estimator. We can estimate the covariance matrix of the classical 3SLS estimator by −1 ˆ 3SLS ) = X > (Σ ˆ −1 ⊗ PW )X • d β Var( . • • 2SLS 26 CHAPTER 4. MULTIVARIATE MODELS 4.2.2 Structural and Reduced Forms When the system of models are written in the form of y i = X i β i + ui = Z i β 1i + Y i β 2i + ui , (4.3) it is normally the case that each equation has a direct economic interpretation. It is for this reason that these are called structural equations. The full system of equations constitutes what is called the structural form of the model. Instead of stacking the equations vertically like we did before, we can also stack the equations horizontally. Define n × g matrix Y as [y 1 , y 2 , ..., y g ]. Similarly, we define an n × g matrix U of which each column corresponds to a vector ui of error term. In this notation, the structural form can be represented as Y Γ = W B + U. (4.4) According to the textbook, (4.3) is embodied in (4.4) through the following equation. 1 [y i Y i ] = Z i β 1i + ui . −β 2i The above equation is a bit tricky. Let me explain it in details: (a) First, the endogenous variable Y i in equation i is the dependent variables in some other equation, say j. Therefore, Y i = y j . Therefore, the ith column of the LHS of (4.4) is 0 .. . 1 . 1 Y Γi = [y 1 ... y i ... y j ... y g ] .. = [y i Y i ] −β 2i −β 2i . . . 0 (b) For the ith column of the RHS of (4.4). It is more intuitive if we rewrite W as W = [Z 1 ... Z i ... Z g C], where the instruments W contains all the exogenous variables Z i and some instruments variables C. Therefore, the ith column of the RHS of (4.4) is 0 .. . β 1i W B i + ui = [Z 1 ... Z i ... Z g C] ... + ui = Z i β 1i + ui . 0 0 4.2. LINEAR SIMULTANEOUS EQUATIONS MODELS 27 We can postmultiply both sides of equation (4.4) by Γ−1 to obtain Y = W BΓ−1 + V , where V ≡ U Γ−1 . This representation is called the restricted reduced form or RRF. This is in contrast to the unrestricted reduced form or URF Y = WΠ + V , where we simply consider Π as the RHS coefficients without imposing the restrictions: Π = BΓ−1 . 4.2.3 Maximum Likelihood Estimation For simplicity, we assume the error terms are normally distributed y • = X • β • + u• , u• ∼ N (0, Σ ⊗ I n ). The maximum likelihood estimator of a linear simultaneous system is called the full information maximum likelihood, or FIML, estimator. The loglikelihood function is − n 1 gn log 2π − log |Σ| + n log | det Γ| − (y • − X • β • )> (Σ−1 ⊗ I n )(y • − X • β • ) 2 2 2 As usual, we need the following equation to set up the link between Σ and β • , or, equivalently, B and Γ 1 Σ = (Y Γ − W B)> (Y Γ − W B). n There are many numerical methods for obtaining FIML estimates. One of them is to make use of the artificial regression (Φ> ⊗ I n )(y • − X • β • ) = (Φ> ⊗ I n )X • (B, Γ)b + resid, where ΦΦ> = Σ−1 . We start from initial consistent estimates, then use the above equation to update the estimates of B and Γ. Another approach is to concentrate the loglikelihood function with respect to Σ. The concentrated loglikelihood function can be written as 1 n gn > − (log 2π + 1) + n log | det Γ| − log (Y Γ − XB) (Y Γ − XB) . 2 2 n ˆ M L and Γ ˆ M L directly by minimizing the above equation w.r.t B and Γ. We can obtain B It is also important to test any overidentifying restrictions, in which it is natural to use a LR test. The restricted value of the loglikelihood function is the maximized value of 1 gn n > ˆ − log (Y Γ ˆ − X B) ˆ (Y Γ ˆ − X B) ˆ . LRr = − (log 2π + 1) + n log | det Γ| 2 2 n The unrestricted value is 1 gn n > ˆ ˆ LRu = − (log 2π + 1) − log (Y − W Π) (Y − W Π) , 2 2 n 28 CHAPTER 4. MULTIVARIATE MODELS ˆ denotes the matrix of OLS estimates of the parameters of the URF. The test is where Π simply a 2(LRu − LRr ) ∼ χ2 (gl − k). When a system of equations consists of just one structural equation, we can write it as y = Xβ = Zβ 1 + Y β 2 + u, where β 1 is k1 ×1, β 2 is k2 ×1 and k = k1 +k2 . Since the above equation includes endogenous variables Y , we can compose a complete simultaneous system by using the URF equations: Y = W Π + V = ZΠ1 + W 1 Π2 + V . Maximum likelihood estimation of the above equation is called limited-information maximum likelihood or LIML. We can treat LIML as a FIML applied to a system in which only one equation is overidentified. In fact, LIML is a single-equation estimation method. Although its name includes the term maximum likelihood, the most popular way to calculate the coefficients is more like a least squares estimation. Such calculation was investigated by statisticians as early as 1949 by Anderson and Rubin. Anderson and Rubin proposed calculating the coefficients β 2 by minimizing the ratio (y − Y β 2 )> MZ (y − Y β 2 ) . (y − Y β 2 )> MW (y − Y β 2 ) This is also why the LIML estimator is also referred as the least variance ratio estimator. ˆ LIM L following Once we obtain the estimate of κ ˆ , we compute the LIML coefficients β −1 > ˆ LIM L = X > (I − κ X (I − κ ˆ MW )y. β ˆ MW )X κ= A suitable estimate of the covariance matrix of the LIML estimator is −1 ˆ LIM L = σ d β ˆ 2 X > (I − κ , Var ˆ MW )X where 1 ˆ LIM L > y − X β ˆ LIM L . y − Xβ n The one significant virtue of LIML (also FIML) is its invariance to the normalization of the equation. This is a very useful/important property, since simultaneous equations systems can be parameterize in many different ways. Consider the following demand-supply model σ ˆ2 = qt = γd pt + X dt β d + udt , qt = γs pt + X st β s + ust . We can certainly reparameterize the whole system as pt = γd0 qt + X dt β 0d + resid, pt = γs0 qt + X st β 0s + resid. where γd0 = 1/γd , β 0d = −β d /γd , γs0 = 1/γs , β 0s = −β s /γs . (4.5) The invariance property implies that the FIML or LIML estimates bear precisely the same relationship as the true parameters in (4.5). This nice property is not shared by 2SLS, 3SLS or other GMM estimators. Chapter 5 Methods for Stationary Time-Series Data 5.1 AR Process The pth order autoregressive, or AR(p), process can be written as yt = γ + ρ1 yt−1 + ρ2 yt−2 + ... + ρp yt−p + t , t ∼ IID(0, σ2 ), where the process forPt is often referred to as white noise. If we define ut ≡ yt − E(yt ), with E(yt ) = γ/(1 − pi=1 ρi ), the above equation can be then rewritten as ut = p X ρi ut−i + i , i=1 or with the lag operator notation ut = ρ(L)ut + , or as 1 − ρ(L) ut = t . For ut an AR(1) process, the autocovariance matrix 1 ρ ··· 2 1 ··· σ ρ Ω(ρ) = .. .. 2 1−ρ . . ρn−1 ρn−2 · · · is ρn−1 ρn−2 .. . . 1 Applications that involve autoregressions of order greater than two are relatively unusual. Nonetheless, higher-order models can be handled in the same fashion. . Let vi denote the covariance of ut and ut−i , for i = 0, 2, ..., p. The elements of the autocovariance matrix, autocovariances, will obey the Yule-Walker equations v0 = ρ1 v1 + ρ2 v2 + ... + ρp vp + σ2 , v1 = ρ1 v0 + ρ2 v1 + ... + ρp vp−1 , .. . vp = ρ1 vp−1 + ρ2 vp−2 + ... + ρp v0 . 29 30 CHAPTER 5. METHODS FOR STATIONARY TIME-SERIES DATA We solve the above system of linear equations for vi , i = 0, ..., p. We compute the rest of vk with k > p following vk = ρ1 vk−1 + ρ2 vk−2 + ... + ρp vk−p . If there are n observations, the autocovariance matrix is v0 v1 v2 · · · v1 v0 v1 · · · v2 v v ··· 1 0 Ω= .. .. .. . . . vn−1 vn−2 vn−3 · · · simply vn−1 vn−2 vn−3 . .. . v0 If we normalize the above matrix by multiplying a scalar chosen to make the diagonal elements equal to unity, the result is the autocorrelation matrix, and we call its elements the autocorrelations. There are three popular methods to estimate the AR(p) model: (a) Drop the first p observations and estimate the nonlinear regression model yt = X t β + p X ρi (yt−i − X t−i β) + t i=1 by NLS. (b) Estimate by feasible GLS, possibly iterated. (c) Estimate by the GNR that corresponds to the nonlinear regression with an extra artificial observation corresponding to the first observation. 5.2 MA Process and ARMA Process A q th order moving-average, or MA(q), process can be written as yt = µ + α0 t + α1 t−1 + ... + αq t−q , where E(yt ) = µ. By defining ut ≡ yt − µ, we can write ut = q X αj t−j = 1 + α(L) t . j=0 The autocovariances of an MA process are much easier to calculate. Since the t are white noise, and hence uncorrelated, the variance of the ut is ! q X Var(ut ) = E(u2t ) = σ2 1 + αj2 . j=1 5.2. MA PROCESS AND ARMA PROCESS 31 Similarly, the j th order autocovariance is, for j > 0, Pq−j σ2 (αj + i=1 αj+i αi ) for j < q, E(ut ut−j ) = σ2α for j = q, and j 0 for j > q. The autoregressive moving-average process, or ARMA process is the combination of MA process with AR process. In general, we can write an ARMA(p, q) process with nonzero mean as 1 − ρ(L) yt = γ + 1 − α(L) t . For any known stationary ARMA process, the autocorrelation between ut and ut−j can be easily calculated. For an ARMA process of possibly unknown order, we define the autocorrelation function or ACF expresses the autocorrelation as a function of the lag j for j = 1, 2, ... If we have a sample yt , t = 1, ..., n, the j th order autocorrelation ρ(j) can be estimated using the formula d t , yt−j ) Cov(y . ρˆ(j) = d t) Var(y The autocorrelation function ACF(j) gives the gross correlation between yt and yt−j . But in this setting, we observe, for example, that a correlation between yt and yt−2 could arise primarily because both variables are correlated with yt−1 . Therefore, we might ask what is the correlation between yt and yt−2 net of the intervening effect of yt−1 . We use the partial autocorrelation function, or PACF to characterize such relationship. The partial autocorrelation coefficient of order j is defined as the plim of the least squares estimator of (j) the coefficient ρj in the linear regression (j) (j) yt = γ (j) + ρ1 yt−1 + ... + ρj yt−j + t for j = 1, ..., J. We can calculate the empirical PACF up to order J by running the above (j) regression and retaining only the estimate ρˆj for each j. In practice, it is necessary to allow yt to depend on exogenous explanatory variables. Such models are sometimes referred to as ARMAX models. An ARMAX(p, q) model takes the form yt = X t β + ut , ut ∼ ARMA (p, q), E(ut ) = 0. Let Ω denote the autocovariance matrix of the vector y. Note that Ω is composed by the vi that includes parameters ρi and αj of the ARMA(p, q) process. The log of the joint density of the observed sample is n 1 1 − log 2π − log |Ω| − (y − Xβ)> Ω−1 (y − Xβ). 2 2 2 We can certainly build up our likelihood function based on the above equation and solve it with MLE. However, this seemingly trivial process can be computationally difficult due to the fact that Ω is a n × n matrix and each of its element is a nonlinear combination of two unknown variables. 32 CHAPTER 5. METHODS FOR STATIONARY TIME-SERIES DATA 5.3 Single-Equation Dynamic Models When a dependent variable depends on current and lagged values of xt , but not on lagged values of itself, we have what is called a distributed lag model: q X βj xt−j + ut , ut ∼ IID(0, σ 2 ). yt = δ + j=0 The OLS estimates of the βj may be quite imprecise. However, this is not a problem if we are merely interested in the long-run impact of changes in the independent variable q q X X ∂yt βj = γ≡ . ∂x t−j j=0 j=0 One popular alternative to distributed lag models is the partial adjustment model. Suppose that the desired level of an economic variable yt is yt◦ . This desired level is assumed to depend on a vector of exogenous variables X t according to yt◦ = X t β ◦ + et , et ∼ IID(0, σe2 ). The term yt is assumed to adjust toward yt◦ according to the equation yt − yt−1 = (1 − δ)(yt◦ − yt−1 ) + vt , vt ∼ IID(0, σv2 ), where δ is an adjustment parameter that is assumed to be positive and strictly less than 1. Then, we have yt = yt−1 − (1 − δ)yt−1 + (1 − δ)X t β ◦ + (1 − δ)et + vt = X t β + δyt−1 + ut , where β ≡ (1 − δ)β ◦ and ut ≡ (1 − δ)et + vt . Under the assumption that |δ| < 1, we find that ∞ ∞ X X j yt = δ X t−j β + δ j ut−j , j=0 j=0 which implies a particular form of distributed lag. An autoregressive distributed lag or ADL model can be written as p q X X yt = β0 + βi yt−i + γj xt−j + ut , ut ∼ IID(0, σ 2 ). i=1 j=0 This is sometimes called an ADL(p, q) model. A widely encountered case is the ADL(1, 1) model yt = β0 + β1 yt−1 + γ0 xt + γ1 xt−1 + ut . It is straightforward to check that the ADL(1, 1) model can be rewritten as ∆yt = β0 + (β1 − 1)(yt−1 − λxt−1 ) + γ0 ∆xt + ut , γ0 +γ1 . 1−β1 The above equation is called an error-correction model. It expresses the where λ = ADL(1, 1) model in terms of an error-correction mechanism. Due to the reason that ECM is a nonlinear model, we need to adopt NLS to solve this model. ECM is not very popular to deal with stationary time series. However, it is very popular for estimating models with unit root or cointegration. 5.4. SEASONALITY 5.4 33 Seasonality Many economic time series display a regular pattern of seasonal variation over the course of every year. There are two different ways to deal with seasonality in economic data: (a) Try to model it explicitly. For example, seasonal ARMA process: AR(4), ARMA(12, 4)...; seasonal ADL models: ADL(4, q)..., etc. (b) Use seasonally adjusted data, which have been filtered to remove the seasonal variation. Of course, there is severe consequence for doing that. So, don’t do it. 5.5 ARCH, GARCH The concept of autoregressive conditional heteroskedasticity, or ARCH, was introduced by Engle (1982). The basic idea of ARCH models is that the variance of the error term at time t depends on the realized values of the squared error terms in previous time periods. Let ut denotes the error term and Ωt−1 denotes an information set that consists of data observed through period t − 1, an ARCH(q) process can be written as ut = σt t ; σt2 ≡ E(u2t |Ωt−1 ) = α0 + q X αi u2t−i , i=1 where αi > 0 for i = 0, 1, ..., q and t is white noise with variance 1. The above function is clearly autoregressive. Since this function depends on t, the model is also heteroskedastic. Also, the variance of ut is a function of ut−1 through σt , which means the variance of ut is conditional on the past of the process. That is where the term conditional came from. The error term ut and ut−1 are clearly dependent. They are, however, uncorrelated. Thus, ARCH process involves only heteroskedasticity, but not serial correlation. The original ARCH process has not proven to be very satisfactory in applied work. In fact, the ARCH model became famous because of its descendent: the generalized ARCH model, which was proposed by Bollerslev (1986). We may write a GARCH(p, q) process a q p X X 2 2 2 2 ut = σt t ; σt ≡ E(ut |Ωt−1 ) = α0 + αi ut−i + δi σt−j , i=1 i=1 The conditional variance here can be written more compactly as σt2 = α0 + α(L)u2t + δ(L)σt2 . The simplest and by far the most popular GARCH model is the GARCH(1, 1) process, for which the conditional variance can be written as 2 . σt2 = α0 + α1 u2t−1 + δ1 σt−1 Unlike the original ARCH model, the GARCH(1, 1) process generally seems to work quite well in practice. More precisely, GARCH(1, 1) cannot be rejected against any more general GARCH(p, q) process in many cases. 34 CHAPTER 5. METHODS FOR STATIONARY TIME-SERIES DATA There are two possible methods to estimate the ARCH and GARCH models: (1) Feasible GLS: since ARCH and GARCH processes induce heteroskedasticity, it might seem natural to use feasible GLS. However, this approach is very very rarely used, because it is not asymptotically efficient. In case of a GARCH(1, 1), σ ˆt2 depends on uˆ2t−1 which in turn depends on the estimates of the regression function. Because of this, estimating the following function together yields more efficient estimates yt = X t β + ut 2 σt2 = α0 + α1 u2t−1 + δ1 σt−1 . (2) MLE: the most popular way to estimate GARCH models is to assume that the error terms are normally distributed and use ML method. To do that, we first write a linear regression model with GARCH errors defined in terms of a normal innovation process as yt − X t β = t , t ∼ N (0, 1). σt (β, θ) The density of yt conditional on Ωt−1 is then 1 yt − X t β φ , σt (β, θ) σt (β, θ) where φ(·) denotes the standard normal density. Therefore, the contribution to the loglikelihood function made by the tth observation is 1 (yt − X t β)2 1 1 lt (β, θ) = − log 2π − log σt2 (β, θ) − . 2 2 2 σt2 (β, θ) This function is not easy to calculate due to the skedastic function σt2 (β, θ). It is defined implicitly by the recursion 2 σt2 = α0 + α1 u2t−1 + δ1 σt−1 2 and there is no good starting values for σt−1 . An ARCH(q) model does not have the 2 lagged σt term, therefore, does not have such problem. We can simply use the first q observations to compute the squared residuals so as to form the skedastic function σt2 (β, θ). For the starting values of lagged σt2 , there are some popular ad hoc procedures: (a) Set all unknown pre-sample values of uˆ2t and σt2 to zero. (b) Replace them by an estimate of their common unconditional expectation: an appropriate function of the θ parameters, or use the SSR/n from OLS estimation. (c) Treat the unknown starting values as extra parameters. Anyway, different procedures can produce very different results. For STATA or any black-box programs, the users should know what the packages are actually doing. Chapter 6 Unit Roots and Cointegration Before we get started, there are several definitions we need to be familiar with: (1) I(0): as t → ∞, the first and second moments tend to fixed stationary values, and the covariances of the elements yt and ys tend to stationary values that depend only on |t − s|. Such a series is said to be integrated to order zero, or I(0). This is a necessary, but not sufficient condition for a stationary process. Therefore, all stationary processes are I(0), but not all I(0) processes are stationary. (2) I(1) or unit root: A nonstationary time series is said to be integrated to order one, or I(1), if the series of its first differences ∆yt ≡ yt − yt−1 = (1 − L)yt is I(0). We also say the I(1) series has a unit root. (3) I(d): a series is said to be integrated to order d, or I(d), if it must be differenced d times to obtain an I(0) series. Note, if yt = t ∼ I(0), then ∆yt = t − t−1 ∼ I(−1). This is what we called an over differencing problem. (4) Cointegration: if two or more series are individually integrated, but some linear combination of them has a lower order of integration, then the series are said to be cointegrated. (5) Brownian motion: also referred as standardized Wiener process. This process, denoted W (r) for 0 ≤ r ≤ 1, can be interpreted as the limit of the standardized random walk wt as the length of each interval becomes infinitesimally small. It is defined as W (r) ≡ plim n n→∞ −1/2 −1/2 w[rn] = plim n n→∞ [rn] X t=1 where [rn] means the integer part of the quantity rn. 35 t , t ∼ IID(0, 1) 36 6.1 CHAPTER 6. UNIT ROOTS AND COINTEGRATION Unit Root A random walk process is defined as yt = yt−1 + et , y0 = 0, et ∼ IID(0, σ 2 ). This process is obviously nonstationary. An obvious generalization is to add a constant term, which gives us the random walk with drift model yt = γ1 + yt−1 + et , et ∼ IID(0, σ 2 ). y0 = 0, The first differences of the yt is ∆yt = γ1 + e1 , which is I(0). Therefore, yt is I(1), or say, has a unit root. There are many methods to test for unit roots. The simplest and most widely-used tests are variants of the Dickey-Fuller tests, or DF tests. Consider the model yt = βyt−1 + et , et ∼ IID(0, σ 2 ). When β = 1, this model has a unit root. If we subtract yt−1 from both sides, we obtain ∆yt = (β − 1)yt−1 + et , The obvious way to test the unit root hypothesis is to use test the t statistic for the hypothesis β − 1 = 0 against the alternative that this quantity is negative. This statistic is usually referred as a τ statistic. Another possible test statistic is n times the OLS estimate of β − 1. This statistic is called a z statistic. If we wish to test the unit root in a model where the random walk has a drift, the appropriate test regression is ∆yt = γ0 + γ1 t + (β − 1)yt−1 + et , and if we wish to test the unit root with the random walk has both a drift and a trend, the appropriate test regression is ∆yt = γ0 + γ1 t + γ2 t2 + (β − 1)yt−1 + et . The asymptotic distributions of the Dickey-Fuller test statistics are referred to as nonstandard distributions or as Dickey-Fuller distributions. We adopt one of simplest random walk model as an example. For model wt = wt−1 + σt , w0 = 0, t ∼ IID(0, 1), the associated z-statistic is znc P Pn n−1 nt=2 wt−1 t t=2 wt−1 t . = n Pn−1 2 = P 2 n−2 n−1 t=1 wt t=1 wt (6.1) 6.1. UNIT ROOT 37 (a) for the numerator of (6.1): since n X wt2 = n−1 X t=1 n−1 n−1 n−1 X X 2 X wt + (wt+1 − wt ) = wt2 + 2 wt t+1 + 2t+1 , t=0 t=0 t=0 t=0 implies n−1 X wt t+1 t=0 1 = 2 n−1 X wn2 − ! 2t+1 . t=0 Therefore, n n−1 1X 1X 1 plim wt−1 t = plim wt t+1 = W 2 (1) − 1 . 2 n→∞ n n→∞ n t=2 t=0 (b) for the denominator of (6.1): since n −2 n−1 X n−1 1X 2 W n t=1 a wt2 = t=1 And Z 0 1 t . n n 1X f (x)dx ≡ lim f n→∞ n t=1 We have −2 plim n n→∞ n−1 X wt2 Z = t . n 1 W 2 (r)dr. 0 t=1 These results imply that 1 2 W 2 (1) − 1 . W 2 (r)dr In practice, we need to simulate this distribution, since there is no simple, analytical expression for it. The models we have seen so far do not include any economic variables beyond yt−1 . If the error terms are serially correlated, the DF tests are no longer asymptotically valid. The most popular approach is to use the so called augmented Dickey-Fuller, or ADF tests. For a unit root test regression model plim znc = R 1 n→∞ 0 ∆yt = X t γ ◦ + (β − 1)yt−1 + ut , we assume that the error term ut follows AR(1) process ut = ρ1 ut−1 + et , et ∼ white noise. The the test regression becomes ∆yt = X t γ ◦ − ρ1 X t−1 γ ◦ + (ρ1 + β − 1)yt−1 − βρ1 yt−2 + et , = X t γ + (β − 1)(1 − ρ1 )yt−1 + δ1 ∆yt−1 + et . = X t γ + πyt−1 + δ1 ∆yt−1 + et . Here, we let π = (β − 1)(1 − ρ1 ). The τ statistic is simply the t test of π = 0. The z statistic is a little bit tricky. Since ρ1 ∈ (−1, 1), to test for a unit root, we only need to consider if β − 1 = π/(1 − ρ1 ) = 0. Therefore, a valid ADF z statistic is nˆ π /(1 − ρˆ1 ). 38 6.2 CHAPTER 6. UNIT ROOTS AND COINTEGRATION Cointegration We begin by considering the simplest case, namely, a VAR(1) model with just two variables. The model can be written as yt1 = φ11 yt−1,1 + φ12 yt−1,2 + ut1 , ut1 ∼ IID(0, Ω). yt2 = φ21 yt−1,1 + φ22 yt−1,2 + ut2 , ut2 Let z t and ut be 2-vectors and let Φ be the 2 × 2 matrix. With proper definitions of z t , ut and Φ, the above equations can be represented by z t = Φz t−1 + ut , ut ∼ IID(0, Ω). Both yt1 and yt2 are I(1) if the coefficients on the lagged dependent variable are equal to unity, or equivalently, at least one of the eigenvalues of the matrix Φ is equal to 1. The series yt1 and yt2 are cointegrated, if there exists a 2-vector η with elements η1 and η2 such that η > z t = η1 yt1 − η2 yt2 is I(0). The vector η is called a cointegrating vector. We call the whole system of models as CI(1, 1), which is short for cointegration with each process following I(1). In practice, we usually expect the relationship between yt1 and yt2 to change over time by adding a constant term and trend terms, η > z t = X t γ + vt , (6.2) where X t denotes a deterministic row vector that may or may not have any elements. As we can see, η is not unique. Usually, we normalize η by setting the first element to 1. In that case, equation (6.2) can be written as 1 −η2 η > z t = X t γ + vt , yt1 = X t γ + vt , yt2 yt1 = X t γ + η2 yt2 + vt . The OLS estimator ηˆ2L is known as the levels estimator. It might be weird to use OLS here, since yt2 is correlated with vt and vt is probably serially correlated to itself. In fact, we can show that not only doing OLS is valid, but the OLS estimator ηˆ2L is also super-consistent. Normally, the OLS estimator is root-n consistent, in the sense of βˆOLS − β0 goes to zero like n−1/2 as n → ∞. The estimator ηˆ2L is n-consistent or super-consistent, as ηˆ2L − η2 goes to zero like n−1 at a much faster rate. We assume that the first eigenvalue of Φ, λ1 = 1 and the second eigenvalue |λ2 | < 1. Then, the whole system can also be presented by an Error Correction Model ∆yt1 = X t γ + η2 ∆yt2 + (λ2 − 1)(yt−1,1 − η2 yt−1,2 ) + resid, (6.3) where λ2 is the second eigenvalue of Φ (Of course, if we have strong belief that the second eigenvalue λ2 = 1, it is no longer appropriate to use the ECM representation). 6.2. COINTEGRATION 39 The equation (6.3) is too restrictive such that the parameter η2 appears twice. For the purpose of estimation and testing, we usually use the following unrestricted equation ∆yt1 = X t γ + α∆yt2 + δ1 yt−1,1 + δ2 yt−1,2 + resid. The ECM estimator is ηˆ2E ≡ −δˆ2 /δˆ1 , using the OLS estimates of δ1 and δ2 . 6.2.1 VAR Representation What we have discussed so far for estimating cointegrating vectors are all single-equation methods. When there are more than one cointegrating relation among a set of more than two variables, we need to adopt the VAR representation. Consider the VAR Y t = X tB + p+1 X Y t−i Φi + U t , i=1 where Y t is a 1 × g vector of observations on the levels of a set of variables, each of which is assumed to be I(1), X t is a row vector of deterministics, B is a matrix of coefficients of those deterministics, U t is a 1 × g vector of error terms, and the Φi are g × g matrices of coefficients. We can reparameterized the above equation as ∆Y t = X t B + Y t−1 Π + p X ∆Y t−i Γi + U i , (6.4) i=1 where Γp = −Φp+1 , Γi = Γi+1 − Φi+1 , Π= p+1 X Φi − I g . i=1 Suppose that the matrix Π has rank r, with 0 ≤ r < g. In this case, we can always write Π = ηα> , where η and α are both g × r matrices. Therefore, (6.4) can be rewritten as ∆Y t = X t B + Y t−1 ηα> + p X ∆Y t−i Γi + U i , i=1 We can easily tell that not all the elements of η and α can be identified. One convenient normalization is to set the first element in η as 1. To estimate the above nonlinear regression, we can use the GNR or the MLE by assuming normality to the error terms. 6.2.2 Testing for Cointegration There are three popular tests we can use to test for cointegration. To make our lives easier, we consider a CI(1, 1) model 40 CHAPTER 6. UNIT ROOTS AND COINTEGRATION (a) Engle-Granger Tests: the idea is to estimate the cointegrating regression yt1 = X t γ + yt2 η2 + vt , (6.5) . where Y t ≡ [yt1 yt2 ] and η = [1 .. − η2 ]. And test the resulting estimates of vt by a Dickey-Fuller test. If there is no cointegration vt ∼ I(1) and vt follows I(0) if there is cointegration. The augmented EG test is performed in almost exactly the same way as an augmented DF test, by running the regression ˜vt−1 + ∆ˆ vt = X t γ + βˆ p X δj ∆ˆ vt−j + ei j=1 where p is chosen to remove any evidence of serial correlation in the residuals. The series of vˆt are estimated residuals from EG test in (6.5) (b) ECM Tests: a second way to test for cointegration involves the estimation of an error correction model. For the same CI(1, 1) model, we have the restricted ECM: ∆yt1 = X t γ + η2 ∆yt2 + (λ2 − 1)(yt−1,1 − η2 yt−1,2 ) + resid, If there is no cointegration, the second eigenvalue λ2 must be 1 (there will be cointegration if |λ2 | < 1). Which means, for the unrestricted version ∆yt1 = X t γ + α∆yt2 + δ1 yt−1,1 + δ2 yt−1,2 + resid. the coefficient δ1 and δ2 should be zeros. In practice, however, we do not need to test both terms equal to 0 at the same time. We can simply test if δ1 = 0 by t statistics. The distribution of such statistic is certainly non-normal due to unit root. If Y t is a 1 × g vector, Ericsson and MacKinnon (2002) call it the κd (g) distribution. See the paper for further details. (c) VAR Test: as its name indicates, this test is based on a vector autoregression representation. The idea is to estimate a VAR model ∆Y t = X t B + Y t−1 Π + p X ∆Y t−i Γi + U i , i=1 subject to the constraint Π = ηα> for various values of the rank r of the impact matrix Π, using ML estimation. Due to this design, we can actually test the null for which there are any number of cointegrating relations from 0 to g − 1 against alternatives with a greater number of relations, up to a maximum of g. The most convenient test statistics are LR statistics In practice, we can perform the VAR LR test according to following steps: 6.2. COINTEGRATION 41 (1) Regress both ∆Y t and Y t−1 on the deterministic variables X t and the lags ∆Y t−1 , ∆Y t−2 through ∆Y t−p , which yields two sets of residuals, Vˆ t1 = ∆Y t − ∆Yˆ t , Vˆ t2 = Y t−1 − Yˆ t−1 , where Vˆ t1 and Vˆ t2 are both 1 × g vectors. ∆Yˆ t and Yˆ t−1 denote the fitted values from the regressions. (2) Compute the g × g sample covariance matrices n X > ˆ jl = 1 Σ Vˆ Vˆ tl , n t=1 tj j = 1, 2, l = 1, 2 (3) Then, we find the solutions λi and z i , for i = 1, ..., g, to the equations ˆ 22 − Σ ˆ 21 Σ ˆ −1 Σ ˆ 12 )z i = 0. (λi Σ 11 where λi is a scalar and z i is a g × 1 vector. This is similar to solving for the ˆ 22 , where eigenvalues and eigenvectors. We first compute the Ψ ˆ 22 Ψ ˆ> = Σ ˆ −1 . Ψ 22 22 We find the eigenvalues for the PSD matri A, in which ˆ>Σ ˆ ˆ −1 ˆ ˆ A≡Ψ 22 21 Σ11 Σ12 Ψ22 . The eigenvalues for A are the same λi we are looking for. We sort the eigenvalues λi from the largest to smallest. (4) It can be shown that the maximized loglikelihood function for the restricted model is r nX gn log(1 − λi ). − (log 2π + 1) − 2 2 i=1 (5) Therefore, to test the null hypothesis that r = r1 against the alternative that r = r2 , for r1 < r2 ≤ g, we compute the LR statistic −n r2 X log(1 − λi ), i=r1 +1 which follows a non-standard distribution as usual. Chapter 7 Testing the Specification of Econometric Models Estimating a misspecified regression model generally yields biased and inconsistent parameter estimates. In this chapter, we discuss a number of procedures that are designed for testing the specification of econometric models. 7.1 Specification Tests Based on Artificial Regressions Let M denotes a model, µ denotes a DGP which belongs to that model, and plimµ means a probability limit taken under the DGP µ. 7.1.1 RESET Test One of the oldest specification tests for linear regression models, but one that is still widely used, is the regression specification error test, or RESET test. The idea is to test the null hypothesis that y t = X t β + ut , ut ∼ IID(0, σ 2 ), where the explanatory variables X t are predetermined with respect tot the error terms ut , against the alternative that E(yt |X t ) is a nonlinear function of the elements of X t . The ˆ and simplest version of RESET involves regressing yt on X t to obtain fitted values X t β then running the regression ˆ 2 + ut . yt = X t β + γ(X t β) The test statistic is the ordinary t statistic for γ = 0. 7.1.2 Conditional Moment Tests If a model M is correctly specified, many random quantities that are functions of the dependent variables should have expectation of zero. Often, these expectations are taken conditional on some information set. Let mt (yt , θ) be a moment function, θ is the parameˆ is referred to as an empirical ter vector and yt is the dependent variables. Then, m(yt , θ) 42 7.2. NONNESTED HYPOTHESIS TESTS 43 moment. The idea of the conditional moment test is then to test the quantity n ˆ ≡ m(y, θ) 1X ˆ mt (yt , θ) n t=1 is significantly different from zero. 7.1.3 Information Matrix Tests For a model estimated by ML with parameter θ, the asymptotic information matrix is equal to minus the asymptotic Hessian. Therefore, we should expect that, in general, the information matrix equality does not hold when the model we are estimating is misspecified. The null hypothesis for the IM test is that n 1X plim n→∞ n t=1 ∂ 2 lt (θ) ∂lt (θ) ∂lt (θ) + ∂θi ∂θj ∂θi ∂θj = 0, for i = 1, ..., k and j = 1, ..., i. We can calculate IM test statistics by means of the OPG regression. ι = G(θ) + M (θ)c + resid. The matrix M (θ) is constructed as an n × 1/2k(k + 1) matrix with typical element ∂ 2 lt (θ) ∂lt (θ) ∂lt (θ) + . ∂θi ∂θj ∂θi ∂θj ˆ ˆ The test statistic is then the ESS from the above regression. If the matrix [G( θ), M (θ)] 2 has full rank, this test statistic is asymptotically distributed as χ 1/2k(k + 1) . If it does ˆ have to be dropped, and the number of not have full rank, one or more columns of M (θ) dof reduced accordingly. 7.2 Nonnested Hypothesis Tests Two models are said to be nonnested, if neither model can be written as a special case of the other without imposing restrictions on both models. We can write the two models as H1 : H2 : y = Xβ + u1 , y = Zγ + u2 . The simplest and most widely-used nonnested hypothesis tests start from the artificial comprehensive model y = (1 − α)Xβ + αZγ + u, (7.1) where α is a scalar parameter. Ideally, we can test if α = 1 or 0, to determine whether H1 or H2 represent the true DGP. However, this is not possible since at least one of the parameters can not be identified. 44 CHAPTER 7. TESTING THE SPECIFICATION OF ECONOMETRIC MODELS 7.2.1 J Tests One popular way to make (7.1) identified is to replace the unknown vector γ by its estimates. This idea was first suggested by Davidson and MacKinnon (1981). Thus, equation (7.1) becomes ˆ + u, y = Xβ + αZ γ = Xβ + αPZ y + u. The ordinary t statistic for α = 0 is called the J statistics, which is asymptotically distributed as N (0, 1) under the null hypothesis H1 . J test tends to overreject, often quite severe when at least one of the following conditions holds: • The sample size is small; • The model under test does not fit very well; • The number of regressors in H2 that do not appear in H1 is large. There are many ways we can to do to improves the its finite sample performance. We can use a fully parametric or semiparametric bootstrap method. Or we can adopt the JA test by Fisher and McAleer (1981). The test regression is ˜ + u, y = Xβ + αZ γ = Xβ + αPZ PX y + u, and the JA statistic is the t statistic for α = 0. 7.2.2 P Tests The J test can be extended to nonlinear regression models. Suppose the two models are H1 : H2 : y = x(β) + u1 , y = z(γ) + u2 . The J statistic is the t statistic for α = 0 in the nonlinear regression y = (1 − α)x(β) + αˆ z + resid, ˆ ≡ z(ˆ ˆ being the vector of NLS estimates of the regression model H2 . We can, where z γ ), γ as usual, run the GNR which corresponds to the above equation, evaluated at α = 0 and ˆ This GNR is β = β. ˆ + a(ˆ ˆ = Xb ˆ ) + resid, y−x z−x ˆ and X ˆ is the matrix of derivatives of x(β) with respect to β, ˆ = X(β) ˆ = x(β), where x ˆ The ordinary t statistic for a = 0 is called the P statistic. evaluated at the NLS estimates β. 7.3. MODEL SELECTION BASED ON INFORMATION CRITERIA 7.2.3 45 Cox Tests If the two nonnested models are estimated by ML, and that their loglikelihood functions are n n X X l1 (θ 1 ) = l1t (θ 1 ) and l2 (θ 2 ) = l2t (θ 2 ), t=1 t=1 for models H1 and H2 , respectively, we can use another approach called Cox tests. Cox’s idea was to extend the idea of a likelihood ratio test. Cox’s test statistic is ˆ 2 ) − l1 (θ ˆ1) , ˆ 2 ) − l1 (θ ˆ 1 ) − 2n−1/2 E ˆ l2 (θ T1 ≡ 2n−1/2 l2 (θ θ1 which follows a normal distribution asymptotically. 7.3 Model Selection Based on Information Criteria The true model is rarely feasible to economists due to the complexity of economic processes. Thus, economists formulate approximation models to capture the effects or factors supported by the empirical data. However, using different approximation models usually end up with different empirical results, which give rise to the so-called model uncertainty. One of the two popular ways to deal with model uncertainty is model selection. Model selection is a procedure of selecting the best model from a set of approximation models. Such a procedure generally involves calculating a criterion function for all approximation models and ranking them accordingly. One of the most widely used criterion functions is the Akaike information criterion (AIC) proposed by Akaike (1973). The simplest version of AIC is ˆ i ) − ki . AICi = li (β We then choose the model that maximizes AICi . A popular alternative to AIC is the Bayesian information criterion (BIC) of Schwarz (1978), which takes the following form ˆ ) − 1 ki log n. BICi = li (β i 2 BIC has a stronger penalty for complexity. There are other methods based on various criteria including the Mallows Criterion (Mallows’ Cp ) by Mallows (1973), the prediction criterion by Amemiya (1980), the focused information criterion (FIC) by Claeskens and Hjort (2003), etc. 7.4 Nonparametric Estimation (For this part, you can also read “Nonparametric Econometrics: Theory and Practice” by Qi Li and Jeffrey Racine. ) Nonparametric regression is a form of regression analysis in which the predictor does not take a predetermined form but is constructed according to information derived from the data. Nonparametric regression requires larger sample sizes than regression based on parametric models because the data must supply the model structure as well as the model estimates. 46 7.4.1 CHAPTER 7. TESTING THE SPECIFICATION OF ECONOMETRIC MODELS Kernel Estimation One traditional way of estimating a PDF is to form a histogram. Given a sample xt , t = 1, ..., n, of independent realizations of a random variable X, for any arbitrary argument x, the empirical distribution function (EDF) is n 1X I(xt ≤ x). Fˆ (x) = n t=1 The indicator function I is clearly discontinuous, which makes the above EDF discontinuous. In practice and theory, we always prefer to have a smooth EDF for various reasons, for example, differentiation. For these reasons, we replace I with a continuous CDF, K(z), with mean 0. This function is called a cumulative kernel. It is convenient to be able to control the degree of smoothness of the estimate. Accordingly, we introduce the bandwidth parameter h as a scaling parameter for the actual smoothing distribution. This gives the kernel CDF estimator n X x − x 1 t K . (7.2) Fˆh (x) = n t=1 h There are many arbitrary kernels we can choose and a popular one is the standard normal distribution or say the Gaussian kernel. If we differentiate equation (7.2) with respect to x, we obtain the kernel density estimator n 1 X x − xt ˆ fh (x) = k . nh t=1 h This estimator is very sensitive to the value of the bandwidth h. Two popular choices for h are h = 0.1059sn−1/5 and h = 0.785(ˆ q.75 − qˆ.75 )n−1/5 , where s is the standard deviation of xt and qˆ.75 − qˆ.75 is the difference between the estimated .75 and .25 quantiles of the data. 7.4.2 Nonparametric Regression The nonparametric regression estimates E(yt |xt ) directly, without making any assumptions about functional form. We suppose that two random variables Y and X are jointly distributed, and we wish to estimate the conditional expectation µ(x) ≡ E(Y |x) as a function of x, using a sample of paired observations (yt , xt ) for t = 1, ..., n. For given x, we define Z ∞ Z ∞ g(x) ≡ yf (y, x)dy = f (x) yf (y|x)dy = f (x)E(Y |x), −∞ −∞ where f (x) is the marginal density of X, and f (y|x) is the density of Y conditional on X = x. Then, Z ∞ ˆ f (x, y) gˆ(x) µ ˆ(x) = = y dy. fˆ(x) fˆ(x) −∞ We use kernel density estimation for the joint distribution f (x, y) and f (x) with a kernel k, n n x − xi y − yi 1 X x − xi 1 X ˆ ˆ f (x, y) = k k , f (x) = k nhhy i=1 h hy nh i=1 h 7.4. NONPARAMETRIC ESTIMATION 47 Therefore, Z ∞ Z ∞ n x − xi 1 X y − yi k dy yk nhhy i=1 h hy −∞ Z ∞ n n 1 X x − xi 1 X x − xi = k (yi + hy v)k(v)(hy )dv = k yi nhhy i=1 h nh i=1 h −∞ y fˆ(x, y)dy = −∞ Finally, we obtain the so-called Nadaraya-Watson estimator, Pn yt kt x − xt t=1 , kt ≡ k µ ˆ(x) = Pn . h t=1 kt Appendix A Sample Assignment and Solution 1. Show that the difference between the matrix (J 0 W 0 X)−1 J 0 W 0 ΩW J(X 0 W J)−1 and the matrix (X 0 W (W 0 ΩW )−1 W 0 X)−1 is a positive semidefinite matrix. Solution: Our goal is to show that (J T W T X)−1 J T W T ΩW J(X T W J)−1 − X T W (W T ΩW )−1 W T X −1 is psd. (A.1) Note −1 (J T W T X)−1 J T W T ΩW J(X T W J)−1 = (X T W J)(J T W T ΩW J)−1 (J T W T X) Hence, proving (1) is psd. is equivalent to prove X T W (W T ΩW )−1 W T X − (X T W J)(J T W T ΩW J)−1 (J T W T X) is psd. (A.2) by ETM exercise 3.8 1 1 Define PΩ 21 W = Ω 2 W (W T ΩW )−1 W T (Ω 2 )T , we have 1 1 X T W (W T ΩW )−1 W T X = X T Ω− 2 PΩ 12 W (Ω− 2 )T X 1 1 Similarly, define PΩ 12 W J = Ω 2 W J(J T W T ΩW J)−1 J T W T (Ω 2 )T , we have 1 1 (X T W J)(J T W T ΩW J)−1 (J T W T X) = X T Ω− 2 PΩ 21 W J (Ω− 2 )T X Hence, 1 1 (2) = (X T Ω− 2 ) · PΩ 21 W − PΩ 21 W J · (X T Ω− 2 )T 48 (A.3) 49 Note SΩ 21 W ⊇ SΩ 21 W J , hence, PΩ 12 W · PΩ 12 W J = PΩ 12 W J , which implies PΩ 12 W − PΩ 21 W J T PΩ 12 W − PΩ 12 W J = PΩ 12 W − PΩ 12 W J Finally, T 1 1 PΩ 12 W − PΩ 12 W J · (X T Ω− 2 )T (3) = (X T Ω− 2 ) · PΩ 12 W − PΩ 21 W J 1 T = k PΩ 12 W − PΩ 12 W J X T Ω− 2 ||2 ≥ 0 Hence, (2) = (3) is psd, which implies (1) is also psd. 50 APPENDIX A. SAMPLE ASSIGNMENT AND SOLUTION 2. Suppose ut follows an AR(1) process ut = ρut−1 + t , t ∼ iid(0, 1), , |ρ| < 1 (a) Compute Eu2t , Eut ut−1 , and Eu Pt∞ut−2 . Confirm that they do not depend on t (b) Define γ(h) = Eut ut+h . Find h=−∞ γ(h) Solution: (a) we have ut = ρut−1 + t = t + ρt−1 + ρ2 t−2 + · · · Since t ∼ iid(0, 1), and the t s are uncorrelated. Hence Eu2t = E(2t + ρ2 2t−1 + ρ4 2t−2 + · · · ) + 0 = 1 + ρ2 + ρ4 + · · · 1 = 1 − ρ2 Similarly, Eut ut−1 = E(ρ2t−1 + ρ3 2t−2 + · · · ) + 0 = ρ + ρ3 + ρ5 + · · · ρ = 1 − ρ2 And, Eut ut−2 = E(ρ2 2t−2 + ρ4 2t−3 + · · · ) + 0 = ρ2 + ρ4 + ρ6 + · · · ρ2 = 1 − ρ2 Hence, we concluded that they don’t depend on time t. (b) Note that Eut ut+1 = Eut+1 ut , which means the value of γ(h) is symmetric around 1 h = 0, where we know γ(0) = Eu2t = 1−ρ 2 . Hence ∞ X h=−∞ γ(h) = 2 ∞ X γ(h) + γ(0) h=1 We only need to consider the case which h > 0, hence ut+h = t+h + ρt+h−1 + · · · + ρh t + · · · 51 Then we have Eut+h ut = E(ρh 2t + ρh+2 2t−1 + · · · ) = ρh + ρh+2 + · · · ρh = = γ(h) 1 − ρ2 Hence, ∞ X γ(h) = h=1 = 1 (ρ + ρ2 + ρ3 + · · · ) 1 − ρ2 ρ ρ 1 = · 2 2 1−ρ 1−ρ (1 − ρ )(1 − ρ) Finally, ∞ X γ(h) = h=−∞ 2ρ (1 − ρ2 )(1 − ρ) + 1 1 − ρ2 1+ρ (1 − ρ2 )(1 − ρ) 1 = (1 − ρ)2 = 3. Consider the linear regression model yt = βxt + ut , t = 1, · · · , n, where ut is generated by ut = ρut−1 + t , t ∼ iid(0, 1), |ρ| < 1. β is a scalar, and xt = 1 for all t. Let βˆGM M be the efficient GMM estimator using the instrument ˆ the OLS estimator of β. W = (1, 1, · · · , 1)0 . Show that βˆGM M is equal to β, Solution: We know the efficient GMM estimator is −1 T βˆGM M = X T W (W T Ω0 W )−1 W T X X W (W T Ω0 W )−1 W T y Since, we know W = [1, 1, · · · , 1]T , we have T W Ω0 W = n X σi2 (A.4) i=1 Define the value of (4) as a, since a is a scalar. Hence, W (W T Ω0 W )−1 W T = W a−1 W T = a−1 W W T = na−1 Define the value of (5) as b, we then transform the βˆGM M as βˆGM M = (X T bX)−1 X T by = b−1 b(X T X)−1 X T y = βˆOLS (A.5) 52 APPENDIX A. SAMPLE ASSIGNMENT AND SOLUTION 4. Show that E ∂ 2 log f (yt , θ) ∂θ∂θ0 Solution: = −E ∂ log f (yt , θ) ∂ log f (yt , θ) · ∂θ ∂θ0 ∂ log f (yt , θ) 1 ∂f (yt , θ) = · ∂θ f (yt , θ) ∂θ Then, we have ∂ 2 log f (yt , θ) ∂ 2 f (yt , θ) ∂f (yt , θ) ∂f (yt , θ) 1 1 · · · = − ∂θ∂θ0 f (yt , θ) ∂θ∂θ0 f 2 (yt , θ) ∂θ ∂θ0 ∂ 2 log f (yt , θ) ∂f (yt , θ) ∂f (yt , θ) 1 1 · · + 2 = 0 0 ∂θ∂θ f (yt , θ) ∂θ ∂θ f (yt , θ) 2 1 ∂ log f (yt , θ) ∂ log f (yt , θ) ∂ log f (yt , θ) + · = 0 0 ∂θ∂θ ∂θ ∂θ f (yt , θ) ∂ 2 f (yt , θ) ∂θ∂θ0 ∂ 2 f (yt , θ) · ∂θ∂θ0 · Hence, 2 ∂ log f (yt , θ) ∂ log f (yt , θ) ∂ log f (yt , θ) 1 ∂ 2 f (yt , θ) E + · · = E ∂θ∂θ0 ∂θ ∂θ0 f (yt , θ) ∂θ∂θ0 Z ∂ 2 f (yt , θ) 1 · f (yt , θ)dyt = f (yt , θ) ∂θ∂θ0 Z ∂2 = ft dyt ∂θ∂θ0 ∂ 21 =0 = ∂θ∂θ0 Finally, 2 ∂ log f (yt , θ) ∂ log f (yt , θ) ∂ log f (yt , θ) E + · = 0 ∂θ∂θ0 ∂θ ∂θ0 2 ∂ log f (yt , θ) ∂ log f (yt , θ) ∂ log f (yt , θ) = −E · E ∂θ∂θ0 ∂θ ∂θ0 5. Show that Var(W (X)) ≥ E 2 d EW (X) dθ∂ 2 log f (x|θ) ∂θ Solution: First, E Z ∂ 1 ∂f (x|θ) = log f (x|θ) f (x|θ)dx ∂θ f (x|θ) ∂θ Z d = f (x|θ)dx = 0 dθ 53 Hence, we yields E 2 ∂ ∂ log f (x|θ) = Var log f (x|θ) ∂θ ∂θ By Cauchy-Schwartz Inequality, Var(W (X)) · Var 2 ∂ ∂ log f (x|θ) ≥ Cov W (X), log f (x|θ) ∂θ ∂θ Note, ∂ log f (x|θ) Cov W (X), ∂θ Z ∂ = (W (X) − EW (X)) · log f (x|θ) · f (x|θ)dx ∂θ Z ∂ = (W (X) − EW (X)) · f (x|θ)dx ∂θ Z Z ∂ ∂ = W (X) f (x|θ)dx − EW (X) f (x|θ)dx ∂θ ∂θ Z d W (X)f (x|θ)dx − 0 = dθ d = EW (X) dθ Hence, Var(W (X)) ≥ 2 d EW (X) dθ ∂ log f (x|θ) Var ∂θ = E 2 d EW (X) dθ ∂ 2 log f (x|θ) ∂θ 6. Consider the linear regression models y = Xβ + resid y = Xβ + Zγ + resid y = Xβ + MX Zγ + resid where y and u are n × 1, X is n × k1 , Z is n × k2 , β is k1 × 1, and γ is k2 × 1. Let βˆ(0) be the OLS estimate of β from the first regression. Let βˆ(1) and βˆ(2) be the OLS estimate of β from the second and the third regression, respectively. Let γˆ (1) and γˆ (2) be the OLS estimate of γ from the second and third regression, respectively. (a) show βˆ(0) =βˆ(2) and γˆ (1) =ˆ γ (2) (b) show PX,MX Z y = PX y + PMX Z y (c) show S([X, Z]) = S([X, MX Z]) and conclude PX,Z y = PX y + PMX Z y for any y ∈ Rn . Solution: 54 APPENDIX A. SAMPLE ASSIGNMENT AND SOLUTION (a) Define MX Z = H, then y = Xβ (2) + Hγ (2) + resid Using FWL, we obtain βˆ(2) = (X 0 MH X)−1 X 0 MH y = (X 0 X − X 0 PH X)−1 (X 0 y − X 0 PH y) Note, from MX Z = H, we derive, X 0 H = X 0 MX Z = 0, hence, X 0 PH = 0. Then βˆ(2) = (X 0 X)−1 X 0 y = βˆ(0) To prove γˆ (1) = γˆ (2) , using FWL, we obtain, γˆ (1) = (X 0 MZ X)−1 X 0 MZ y Premultiply MX on equation (3), we obtain MX y = MX MX Zγ (2) + resid Hence, γˆ (2) = (X 0 MZ X)−1 X 0 MZ y = γˆ (1) (b) To prove PX,MX Z y = PX y + PMX Z y is equivalent to prove PX,MX Z = PX + PMX Z . That means, we need to show i. S([X, MX Z]) is invariant under the action of PX + PMX Z ii. S ⊥ ([X, MX Z]) is annilated by PX + PMX Z Prove (i), we know any vector in S([X, MX Z]) can be expressed as [X, MX Z]γ for some k1 + k2 vector γ. Then (PX + PMX Z )[X, MX Z]γ = [(PX + PMX Z )X, (PX + PMX Z )MX Z]γ = [X + 0, 0 + MX Z]γ = [X, MX Z]γ Note, PMX Z X = 0 since X and MX Z are orthogonal, and PX MX Z = 0. Hence, S([X, MX Z[) is invariant under the action of PX + PMX Z . Prove (ii), let w be any element in S ⊥ ([X, MX Z]), then [X, MX Z]0 w = 0 implies, X 0 w = 0 and Z 0 MX w = 0. To finish our proof, we premultiply PX +PMX Z on w, we obtain (PX + PMX Z )w = X(X 0 X)−1 X 0 w + MX Z(Z 0 MX Z)−1 Z 0 MX w = X(X 0 X)−1 ∗ 0 + MX Z(Z 0 MX Z)−1 ∗ 0 = 0 Hence, S ⊥ ([X, MX Z]) is annilated by PX + PMX Z . Finally, we obtain PX,MX Z = PX + PMX Z , then PX,MX Z y = PX y + PMX Z y 55 (c) To prove PX,Z y = PX y + PMX Z y, we can use result from part (b). We only need to show S([X, Z]) is equivalent to S([X, MX Z]). Let a be any element from S([X, Z]), then a can always be expressed as a = [X, Z]γ = Xγ1 + Zγ2 for some k1 × 1 vector γ1 , and some k2 × 1 vector γ2 . Then, we rewrite Zγ2 as MX Zγ2 + PX Zγ2 . Note, S(PX Z) ⊆ S(X). Hence, the element PX Zγ2 from S(PX Z) can always be expressed as X γ˜ for some k1 × 1 γ˜ . Then, we obtain a = Xγ1 + MX Zγ2 + PX Zγ2 = Xγ1 + MX Zγ2 + X γ˜ = X(γ1 + γ˜ ) + MX Zγ2 γ1 + γ˜ = [X MX Z] γ2 Hence, a ∈ S([X, MX Z]), implies S([X, Z]) ⊆ S([X, MX Z]). Let b be any element from S([X, MX Z]), then b can always be expressed as b = [X, MX Z]δ = Xδ1 + MX Zδ2 for some k1 × 1 vector δ1 , and some k2 × 1 vector δ2 . Then, we rewrite MX Zδ2 as Zδ2 − PX Zδ2 . Note, S(PX Z) ⊆ S(X). Hence, the element PX Zδ2 from S(PX Z) ˜ Then, we obtain can always be expressed as X δ˜ for some k1 × 1 δ. b = Xδ1 + Zδ2 − PX Zδ2 = Xδ1 + Zδ2 − X δ˜ ˜ + Zδ2 = X(δ1 − δ) δ1 − δ˜ = [X Z] δ2 Hence, b ∈ S([X, Z]), implies S([X, MX Z]) ⊆ S([X, Z]). Together with S([X, Z]) ⊆ S([X, MX Z]), we yield S([X, Z]) = S([X, MX Z]) Recall in part (b), we already proved that i. S([X, MX Z]) is invariant under the action of PX + PMX Z ii. S ⊥ ([X, MX Z]) is annilated by PX + PMX Z By S([X, Z]) = S([X, MX Z]), we have i. S([X, Z]) is invariant under the action of PX + PMX Z ii. S ⊥ ([X, Z]) is annilated by PX + PMX Z This implies PX,Z = PX + PMX Z Hence, PX,Z y = PX y + PMX Z y 56 APPENDIX A. SAMPLE ASSIGNMENT AND SOLUTION 7. The file tbrate.data contains data for 1950:1 to 1996:4 for three series: rt , the interest rate on 90-day treasury bills, πt , the rate of inflation, and yt , the logarithm of real GDP. For the period 1950:4 to 1996:4, run the regression ∆rt = β1 + β2 πt−1 + β3 ∆yt−1 + β4 ∆r−1 + β5 ∆rt−2 + ut , where ∆ is the first-difference operator, defined so that ∆rt = rt − rt−1 . Plot the residuals and fitted values against time. Then regress the residuals on the fitted values and on a constant. What do you learn from this second regression? Now regress the fitted values on the residuals and on a constant. What do you learn from this third regression?
© Copyright 2024