Document 275673

A Sample Lecture Notes for Advanced
Graduate Econometrics
by
Tian Xie
Queen’s University
Kingston, Ontario, Canada
December 1, 2012
c
Tian
Xie 2012
Preface
This is a sample lecture notes for advanced econometrics. Students are assumed to have
finished an introductory econometric course and an intermediate econometric course or the
equivalent. The course aims to help students to establish a solid background in both theoretical and empirical econometrics studies. The lecture notes contain comprehensive information
that is sufficient for a two semester course or it can be condensed to fit a one semester course,
depending on the course design. The lecture notes cover various topics that start at a moderate level of difficulty, which gradually increases as the course proceeds.
The lecture notes were developed during my time of completing my PhD in the Queen’s
University economics department. The lecture notes were created based on the following
textbooks:
• “Principles of Econometrics” (3rd Edition) by R. Carter Hill, William Griffiths and
Guay C. Lim
• “Introduction to Econometrics” (3rd Edition) by James H. Stock and Mark W. Watson
• “Econometric Analysis” (6th Edition) by William H. Greene
• “Econometric Theory and Methods” by Russell Davidson and James G. MacKinnon
I used these lecture notes when I was working as a lecturer/TA/private tutor for both
undergraduate and graduate students. These lecture notes have received positive feedback
and great reviews from my students, which is demonstrated in the enclosed “2012 Winter
Term Evaluation Form”.
i
Contents
1
The Generalized Method of Moments
1.1 GMM Estimators for Linear Regression Models
1.2 HAC Covariance Matrix Estimation . . . . . . .
1.3 Tests Based on the GMM Criterion Function . .
1.3.1 Tests of Overidentifying Restrictions . .
1.3.2 Tests of Linear Restrictions . . . . . . .
1.4 GMM Estimators for Nonlinear Models . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
1
1
2
3
3
4
4
2 The Method of Maximum Likelihood
2.1 Basic Concepts . . . . . . . . . . . . . . . . . . . . .
2.1.1 Regression Models with Normal Errors . . . .
2.1.2 Computing ML Estimates . . . . . . . . . . .
2.2 Asymptotic Properties of ML Estimators . . . . . . .
2.2.1 Consistency of MLE . . . . . . . . . . . . . .
2.2.2 Dependent Observations . . . . . . . . . . . .
2.2.3 The Gradient, the Information Matrix and the
2.2.4 Asymptotic Normality of the MLE . . . . . .
2.3 ∗ The Covariance Matrix of the ML Estimator . . . .
2.3.1 ∗ Example: the Classical Normal Linear Model
2.4 ∗ Hypothesis Testing . . . . . . . . . . . . . . . . . . .
2.5 ∗ Transformations of the Dependent Variable . . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
Hessian
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
5
5
6
6
7
7
8
9
9
10
10
12
12
3 Discrete and Limited Dependent Variables
3.1 Binary Response Models: Estimation . . . .
3.2 Binary Response Models: Inference . . . . .
3.3 Models for Multiple Discrete Responses . . .
3.4 Models for Count Data . . . . . . . . . . . .
3.5 Models for Censored and Truncated Data . .
3.6 Sample Selectivity . . . . . . . . . . . . . .
3.7 Duration Models . . . . . . . . . . . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
14
14
15
16
18
18
19
20
.
.
.
.
22
22
22
23
24
.
.
.
.
.
.
.
.
.
.
.
.
.
.
4 Multivariate Models
4.1 Seemingly Unrelated Linear Regressions (SUR)
4.1.1 MM Estimation . . . . . . . . . . . . . .
4.1.2 Maximum Likelihood Estimation . . . .
4.2 Linear Simultaneous Equations Models . . . . .
ii
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
CONTENTS
4.2.1
4.2.2
4.2.3
iii
GMM Estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Structural and Reduced Forms . . . . . . . . . . . . . . . . . . . . . .
Maximum Likelihood Estimation . . . . . . . . . . . . . . . . . . . .
24
26
27
5 Methods for Stationary Time-Series
5.1 AR Process . . . . . . . . . . . . .
5.2 MA Process and ARMA Process . .
5.3 Single-Equation Dynamic Models .
5.4 Seasonality . . . . . . . . . . . . .
5.5 ARCH, GARCH . . . . . . . . . .
Data
. . . .
. . . .
. . . .
. . . .
. . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
29
29
30
32
33
33
6 Unit Roots and Cointegration
6.1 Unit Root . . . . . . . . . . . .
6.2 Cointegration . . . . . . . . . .
6.2.1 VAR Representation . .
6.2.2 Testing for Cointegration
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
35
36
38
39
39
7 Testing the Specification of Econometric Models
7.1 Specification Tests Based on Artificial Regressions
7.1.1 RESET Test . . . . . . . . . . . . . . . . .
7.1.2 Conditional Moment Tests . . . . . . . . .
7.1.3 Information Matrix Tests . . . . . . . . . .
7.2 Nonnested Hypothesis Tests . . . . . . . . . . . .
7.2.1 J Tests . . . . . . . . . . . . . . . . . . .
7.2.2 P Tests . . . . . . . . . . . . . . . . . . .
7.2.3 Cox Tests . . . . . . . . . . . . . . . . . .
7.3 Model Selection Based on Information Criteria . .
7.4 Nonparametric Estimation . . . . . . . . . . . . .
7.4.1 Kernel Estimation . . . . . . . . . . . . .
7.4.2 Nonparametric Regression . . . . . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
42
42
42
42
43
43
44
44
45
45
45
46
46
A Sample Assignment and Solution
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
48
Chapter 1
The Generalized Method of Moments
1.1
GMM Estimators for Linear Regression Models
We can consider GMM as a combination of IV and GLS, in which endogenous regressors
and heteroskedasticity coexist. The linear regression model is given as
E(uu> ) = Ω,
y = Xβ + u,
(1.1)
where there are n observations. We assume that there exist an n × l matrix of predetermined
instrumental variables W satisfying the condition
E(ut |W t ) = 0 for t = 1, ..., n.
As we have seen in IV, we need to first “break” the correlation between X and u by
premultiplying a linear combination of W > , then “purify” the error terms to make the
covariance matrix of purified error terms proportional to I. The sample moment conditions
of (1.1) then become
J > W > (y − Xβ) = 0,
ˆ = (J > W > X)−1 J > W > y. We choose J so as to minimize
where J is l × k. The associated β
ˆ − β 0 ), in which
the covariance matrix of the plim of n1/2 (β
J = (W > Ω0 W )−1 W > X,
where Ω0 is the true value of Ω. Then, the efficient (infeasible) GMM estimator is
ˆ GMM = X > W (W > Ω0 W )−1 W > X −1 X > W (W > Ω0 W )−1 W > y.
β
(1.2)
Of course, the above efficient GMM estimator can be obtained by minimizing the (efficient) GMM criterion function
Q(β, y) ≡ (y − Xβ)> W (W > Ω0 W )−1 W > (y − Xβ).
(1.3)
However, the above criterion is actually based on the efficient J . Without knowing J in
advance, there is no way you can form up a criterion function like (1.3), but
(y − Xβ)> W ΛW > (y − Xβ),
which will lead us to the inefficient GMM estimator
ˆ = (X > W ΛW > X)−1 X > W ΛW > y.
β
1
2
1.2
CHAPTER 1.
THE GENERALIZED METHOD OF MOMENTS
HAC Covariance Matrix Estimation
We notice that both (1.2) and (1.3) include an unknown matrix Ω0 . If we replace Ω0 with
ˆ we obtain the feasible efficient GMM estimator
an efficient estimate Ω,
ˆ FGMM = X > W (W > ΩW
ˆ )−1 W > X
β
−1
ˆ )−1 W > y.
X > W (W > ΩW
(1.4)
ˆ or equivThe performance of the feasible estimator (1.4) crucially depends on the estimate Ω
>ˆ
>ˆ
1
alently W ΩW . In practice, we should always compute n W ΩW instead for computation
efficiency.
ˆ
If there is only heteroskedasticity, we can compute n1 W > ΩW
in a fashion similar to the
ˆ
HCCMEs introduced Chapter 5. A typical element of the matrix n1 W > ΩW
is
n
1X 2
uˆ wti wtj ,
n t=1 t
ˆ and β
ˆ is a preliminary estimator; for example, the generalized IV
where uˆt ≡ yt − X t β
estimator.
The HCCMEs are not appropriate when there is also autocorrelation. In this case, we
need to use a heteroskedasticity and autocorrelation consistent (HAC) estimator.
We define a set of autocovariance matrices to mimic the serial correlation:
1 Pn
E(ut ut−j W >
t W t−j ) for j ≥ 0
n P t=j+1
Γ(j) ≡
n
>
1
t=−j+1 E(ut+j ut W t+j W t ) for j < 0
n
It is easy to show that Γ(j) = Γ> (−j). Then, we can write the sample HAC estimator for
1
ˆ
W > ΩW
as
n
p X
ˆ HW = Γ(0)
ˆ
ˆ
ˆ > (j) .
Σ
+
Γ(j)
+Γ
(1.5)
j=1
This estimator is usually refer as the Hansen-White estimator. The threshold parameter
p, which is also called the lag truncation parameter, is somewhat arbitrary. For the
purposes of asymptotic theory, we need p to go to infinity at some suitable rate as the
sample size goes to infinity; for example, n1/4 . In practice, when the sample size is finite, we
need to choose a specific value of p.
Another issue with the HW estimator (1.5) is that it is not always positive-definite in
finite samples. To solve this critical deficiency, Newey and West (1987) proposed a modified
HW estimator, which is later known as the Newey-West estimator:
ˆ NW
Σ
p X
ˆ
= Γ(0)
+
1−
j=1
j
p+1
>
ˆ
ˆ
Γ(j) + Γ (j) .
(1.6)
The NW estimator is a biased estimator of Σ, since it underestimates the autocovariance
matrices. It is still consistent if p increases as n does, and the appropriate rate is n1/3 . Note
that Newey and West (1994) proposed a procedure to automatically select p.
1.3. TESTS BASED ON THE GMM CRITERION FUNCTION
3
ˆ (HW, NW or other estimators), the feasible efficient GMM estimaWith an estimated Σ
tor becomes
−1 >
>
ˆ
ˆ −1 >
ˆ −1 W > y.
β
X WΣ
FGMM = X W Σ W X
The associated covariance matrix is then
>
ˆ
ˆ −1 > −1
Var(β
F GM M ) = n(X W Σ W X) .
1.3
Tests Based on the GMM Criterion Function
We first rewrite the GMM criterion function here
Q(β, y) ≡ (y − Xβ)> W (W > ΩW )−1 W > (y − Xβ).
Define Ω−1 = ΨΨ> , which implies Ω = (Ψ> )−1 Ψ−1 . Then
h
i
−1
>
> −1 −1
>
> −1
>
−1
Q(β, y) = (y − Xβ) Ψ Ψ W (W (Ψ ) Ψ W ) W (Ψ ) Ψ> (y − Xβ)
= (y − Xβ)> ΨPA Ψ> (y − Xβ)
˜ > PA u
˜
= u
˜ = Ψ> u ∼ iid(0, I). Therefore,
where A = Ψ−1 W , u
a
Q(β 0 , y) ∼ χ2 Rank(A) = χ2 (l),
(1.7)
where β 0 is true parameter vector. All tests in this section are based on result (1.7).
1.3.1
Tests of Overidentifying Restrictions
Whenever l > k, a model estimated by GMM involves l − k overidentifying restrictions. Our
ˆ
target is to find the distribution of test statistic Q(β
GMM , y):
ˆ GMM , y) = (y − X β
ˆ GMM )> ΨPA Ψ> (y − X β
ˆ GMM ).
Q(β
If the model is correctly specified, it can be shown that
ˆ
PA Ψ> (y − X β
GMM ) =
=
=
=
PA (I − PPA Ψ> X )Ψ> y
(PA − PPA Ψ> X )Ψ> (Xβ 0 + u)
(PA − PPA Ψ> X )Ψ> u
(PA − PPA Ψ> X )˜
u.
Therefore,
a
2
>
2
ˆ
Q(β
GMM , y) ∼ χ Rank(A) − Rank(PA Ψ X) = χ (l − k).
This test statistic is often called Hansen’s overidentification statistic.
4
1.3.2
CHAPTER 1.
THE GENERALIZED METHOD OF MOMENTS
Tests of Linear Restrictions
It is even easier to construct a test statistics for linear restrictions based on the GMM
criterion. For model
y = X 1 β 1 + X 2 β 2 + u,
Suppose we want to test the restrictions β 2
˜
˜
obtain a constraint estimates β
= [β
FGMM
1
E(uu> ) = Ω.
= 0, where β 1 is k1 × 1 and β 2 is k2 × 1. We first
..
. 0]. Then, the constraint criterion function is
˜ FGMM , y) = (y − X 1 β
˜ 1 )> W (W > ΩW
˜ )
ˆ )−1 W > (y − X 1 β
Q(β
1
a
>
>
2
= u Ψ(PA − PPA Ψ> X1 )Ψ u ∼ χ (l − k1 )
ˆ GMM , y) from Q(β
˜ FGMM , y), the difference between the constrained and
If we subtract Q(β
unconstrained criterion function is
˜ FGMM , y) − Q(β
ˆ GMM , y) = u> Ψ(PP Ψ> X − PP Ψ> X )Ψ> u
Q(β
A
A
1
a
∼ χ2 (k − k1 ) = χ2 (k2 )
Note that the above results hold only for efficient GMM estimation. It is not true for
nonoptimal criterion functions like
(y − Xβ)> W ΛW > (y − Xβ).
The reason is very simple: we can (almost) never construct a projection matrix of any forms
from Λ. Therefore, the asymptotic distribution can rarely be a χ2 distribution.
1.4
GMM Estimators for Nonlinear Models
When dealing with nonlinear models, we can adopt an approach that employs the concept
of an elementary zero function. We denote f (θ, y) as a set of such function, where θ is
a k-vector of parameters. The n × 1 f (θ, y) correspond to the error terms, in which
f (θ, y) ≡ y − Xθ = u
in the linear regression case. The n × k matrix F (θ), which has typical element
∂ft (θ)
,
∂θi
correspond to the regressors X in linear models.
We can simply duplicate results from above sections by replacing u with f (θ, y) and X
ˆ is
with F (θ). For example, the covariance matrix of θ
Fti (θ) ≡
ˆ = n(F
ˆ >W Σ
ˆ −1 W > F
ˆ )−1
d θ)
Var(
and the nonlinear GMM criterion function is
Q(θ, y) = f > (θ, y)ΨPA Ψ> f (θ, y).
Of course, the real story is much more complicated than one we just showed. Assumptions
ˆ
and regularity conditions are different for nonlinear models. Also, in practice, the F (θ)
matrix is usually very painful to estimate.
Chapter 2
The Method of Maximum Likelihood
2.1
Basic Concepts
Models that are estimated by maximum likelihood must be fully specified parametric
models. We denote the dependent variable by the n-vector y. For a given k-vector θ of
parameters, let the joint PDF of y be written as f (y, θ). This joint PDF function f (y, θ)
is referred to as the likelihood function of the model for the given data set; and the
ˆ that maximizes f (y, θ) is called a maximum likelihood estimate, or
parameter vector θ
MLE, of the parameters.
If observations are assumed to be independent, the joint density of the entire sample y
is
n
Y
f (yt , θ).
f (y, θ) =
t=1
Because the above product is often a very large or very small number, it is customary to
work instead with the loglikelihood function
l(y, θ) ≡ log f (y, θ) =
n
X
lt (yt , θ),
t=1
where lt (yt , θ) is equal to log ft (y, θ). For example, if yt is generated by the density
f (yt , θ) = θe−θyt ,
yt > 0,
θ > 0,
we take the logarithm of the density and obtain
lt (yt , θ) = log θ − θyt .
Therefore,
l(y, θ) =
n
X
(log θ − θyt ) = n log θ − θ
t=1
n
X
t=1
We take FOC with respect to θ,
n
n X
−
yt = 0,
θ
t=1
P
n
which can be solved to yield θˆ = n/ t=1 yt .
5
yt .
6
2.1.1
CHAPTER 2. THE METHOD OF MAXIMUM LIKELIHOOD
Regression Models with Normal Errors
For the classical normal linear models
y = Xβ + u,
u ∼ N(0, σ 2 I),
it is very easy to show that the regressand yt is distributed, conditionally on X, as N(X t β, σ 2 ).
Thus the PDF of yt is
1
(yt − X t β)2
ft (yt , β, σ) = √ exp −
.
2σ 2
σ 2π
Take the log to obtain lt (yt , β, σ) such that
1
1
1
lt (yt , β, σ) = − log 2π − log σ 2 − 2 (yt − X t β)2 .
2
2
2σ
Thus the sum of all lt (yt , β, σ) is
n
1
n
l(y, β, σ) = − log 2π − log σ 2 − 2 (y − Xβ)> (y − Xβ).
2
2
2σ
(2.1)
Taking FOC with respect to σ gives us
∂l(y, β, σ)
n
1
= − + 3 (y − Xβ)> (y − Xβ) = 0,
∂σ
σ σ
which yields the result that
σ
ˆ 2 (β) =
1
(y − Xβ)> (y − Xβ).
n
Substituting σ
ˆ 2 (β) into (2.1) yields the concentrated loglikelihood functions
n
n
n
1
c
>
(y − Xβ) (y − Xβ) − .
l (y, β) = − log 2π − log
2
2
n
2
Maximizing (2.2) is equivalent to minimize (y − Xβ)> (y − Xβ), which yields
ˆ = (X > X)−1 X > y.
β
ˆ
Then, the ML estimate σ
ˆ 2 = SSR(β)/n,
which is biased downward.
2.1.2
Computing ML Estimates
We define the gradient vector g(y, θ), which has typical element
n
∂l(y, θ) X ∂lt (yt , θ)
gi (y, θ) ≡
=
.
∂θi
∂θ
i
t=1
(2.2)
2.2. ASYMPTOTIC PROPERTIES OF ML ESTIMATORS
7
The gradient vector is the vector of first derivatives of the loglikelihood function. Let H(θ)
be the Hession matrix of l(θ) with typical element
Hij (θ) =
∂ 2 l(θ)
.
∂θi ∂θj
The Hessian is the matrix of second derivatives of the loglikelihood function.
Let θ (j) denote the value of the vector of estimates at step j of the algorithm, and let
g (j) and H (j) denote the gradient and Hessian evaluated at θ (j) . We perform a secondorder Taylor expansion of l(y, θ) around θ (0) (the initial value of θ) in order to obtain an
approximation l∗ (θ) to l(θ):
1
>
l∗ (θ) = l(θ (0) ) + g >
(0) (θ − θ (0) ) + (θ − θ (0) ) H (0) (θ − θ (0) ),
2
The FOC for a maximum of l∗ (θ) with respect to θ can be written as
g (0) + H (0) (θ − θ (0) ) = 0,
which implies θ (1) in the next step
θ (1) = θ (0) − H −1
(0) g (0) .
We keep doing this iteration until the estimator converges. The whole process can be summarized by the following equation:
θ (j+1) = θ (j) − H −1
(j) g (j) ,
which is the fundamental equation for Newton’s Method.
In practice, however, Newton’s Method does not always work due to the reason that
the Hessian is not negative definite. In such cases, one popular and more appropriate way
to obtain the MLE is to use the so-called quasi-Newton method, in which the fundamental
equation is replaced by
θ (j+1) = θ (j) + α(j) D −1
(j) g (j) ,
where α(j) is a scalar which is determined at each step, and D (j) is a positive definite matrix
which approximates −H (j) .
2.2
2.2.1
Asymptotic Properties of ML Estimators
Consistency of MLE
The loglikelihood function l(θ) is a concave function. Otherwise, we are not able to find its
maximum. Let L(θ) = exp l(θ) denote the likelihood function. By Jensen’s Inequality,
we have
Z
L(θ ∗ )
L(θ ∗ )
L(θ ∗ )
< log E0
= log
L(θ 0 )dy = 0,
E0 log
L(θ 0 )
L(θ 0 )
L(θ 0 )
which implies
E0 l(θ ∗ ) − E0 l(θ 0 ) < 0.
8
CHAPTER 2. THE METHOD OF MAXIMUM LIKELIHOOD
By LLN, we can assert
1
1
plim l(θ ∗ ) ≤ plim l(θ 0 ) ∀θ ∗ 6= θ 0 .
n→∞ n
n→∞ n
(2.3)
ˆ maximizes l(θ), it must be the case that
Since the MLE θ
1 ˆ
1
plim l(θ)
≥ plim l(θ 0 ).
n→∞ n
n→∞ n
(2.4)
The only way that (2.3) and (2.4) can both be true is if
1 ˆ
1
plim l(θ)
= plim l(θ 0 ).
n→∞ n
n→∞ n
(2.5)
ˆ is consistent, simply because the
Note that the result (2.5) alone does not prove that θ
inequality condition (2.3) may also imply that there are many values θ ∗ satisfies
1
1
plim l(θ ∗ ) = plim l(θ 0 ).
n→∞ n
n→∞ n
Therefore we must also assume that
1
1
plim l(θ ∗ ) 6= plim l(θ 0 ) ∀θ ∗ 6= θ 0 ,
n→∞ n
n→∞ n
which is also known as a asymptotic identification condition.
2.2.2
Dependent Observations
For a sample of size n, the joint density of n dependent observations follows
n
f (y ) =
n
Y
f (yt |y t−1 ),
t=1
where the vector y t is a t-vector with components y1 , y2 , ..., yt . If the model is to be estimated
by MLE, the density f (y n ) depends on a k-vector of parameters θ,
n
f (y , θ) =
n
Y
f (yt |y t−1 ; θ).
t=1
The corresponding loglikelihood function is then
l(y, θ) =
n
X
t=1
lt (y t , θ).
2.2. ASYMPTOTIC PROPERTIES OF ML ESTIMATORS
2.2.3
9
The Gradient, the Information Matrix and the Hessian
We define the n × k matrix of contributions to the gradient G(y, θ) so as to have
typical element
∂lt (y t , θ)
Gti (y t , θ) ≡
.
∂θi
We should be able to tell the difference between G(y, θ) and g(y, θ) that has typical element
n
gi (y, θ) ≡
∂l(y, θ) X ∂lt (yt , θ)
=
.
∂θi
∂θ
i
t=1
The covariance matrix of the elements of the tth row Gt (y t , θ) of G(y, θ) is the k × k
matrix I t (θ), which is normally positive definite. The information matrix is the sum of
I t (θ)
n
n
X
X
t
t
.
(y
,
θ)G
(y
,
θ)
I t (θ) =
Eθ G>
I(θ) ≡
t
t
t=1
t=1
We have already defined the Hessian matrix H(y, θ) that has a typical element
Hij (y, θ) =
∂ 2 l(y, θ)
.
∂θi ∂θj
For asymptotic analysis, we are generally more interested in their asymptotic matrices
1
I(θ) ≡ plimθ I(θ),
n→∞ n
1
H(θ) ≡ plimθ H(θ),
n→∞ n
where
I(θ) = −H(θ).
2.2.4
Asymptotic Normality of the MLE
ˆ we perform a Taylor expansion around θ 0
For a likelihood estimate g(θ),
ˆ = g(θ 0 ) + H(θ)(
¯ θ
ˆ − θ 0 ) = 0,
g(θ)
which implies
ˆ − θ 0 ) = − n−1 H(θ)
¯
n1/2 (θ
−1
n−1/2 g(θ 0 )
a
= −H−1 (θ 0 )n−1/2 g(θ 0 )
a
= I −1 (θ 0 )n−1/2 g(θ 0 ).
Since
gi (y, θ) =
n
X
t=1
Gti (y t , θ)
and E(Gti (y t , θ)) = 0,
10
CHAPTER 2. THE METHOD OF MAXIMUM LIKELIHOOD
By CLT, we have
a
n−1/2 g(θ 0 ) ∼ N 0, I(θ 0 ) .
Finally, we obtain
a
ˆ − θ0 ) ∼
n1/2 (θ
N 0, I −1 (θ 0 )
a
∼ N 0, −H−1 (θ 0 )
a
∼ N 0, H−1 (θ 0 )I(θ 0 )H−1 (θ 0 )
∗
2.3
The Covariance Matrix of the ML Estimator
Based on the asymptotic result we just derived above, four covariance matrix estimators can
be formulated:
(i) The empirical Hessian estimator:
ˆ = −H −1 (θ).
ˆ
d H (θ)
Var
(ii) The information matrix estimator:
ˆ = I −1 (θ).
ˆ
d IM (θ)
Var
(iii) The outer-product-of-the-gradient estimator:
−1
> ˆ
ˆ
ˆ
d
VarOPG (θ) = G (θ)G(θ)
,
which is based on the following theoretical result:
I(θ 0 ) = E G> (θ 0 )G(θ 0 ) .
(iv) The sandwich estimator:
ˆ = H −1 (θ)G
ˆ > (θ)G(
ˆ
ˆ −1 (θ).
ˆ
d S (θ)
Var
θ)H
2.3.1
∗
Example: the Classical Normal Linear Model
For a classical normal linear model
y = Xβ + u,
u ∼ N(0, σ 2 I),
where X is n × k. The contribution to the loglikelihood function made by the tth observation
is
1
1
1
lt (yt , β, σ) = − log 2π − log σ 2 − 2 (yt − X t β)2 .
2
2
2σ
A typical element of any of the first k columns of the matrix G is
Gti (β, σ) =
∂lt
1
= 2 (yt − X t β)xti ,
∂βi
σ
i = 1, ..., k.
2.3.
∗
THE COVARIANCE MATRIX OF THE ML ESTIMATOR
11
A typical element of the last column of G is
Gt,k+1 (β, σ) =
1
1
∂lt
= − + 3 (yt − X t β)2 .
∂σ
σ σ
This implies the following:
(i) For i, j = 1, ..., k, the ij th element of G> G is
2 X
n n
n
X
X
1
1
1
2
(yt − X t β)xti =
(yt − X t β) xti xtj =
xti xtj .
2
4
σ
σ
σ2
t=1
t=1
t=1
(ii) The (k + 1), (k + 1)th element of G> G is
2
n X
1
1
2
− + 3 (yt − X t β)
σ σ
t=1
n
n
X 2
X 1
n
2
=
−
(y
−
X
β)
+
(y − X t β)4
t
t
4
6 t
σ2
σ
σ
t=1
t=1
2n 3n
2n
n
− 2 + 2 = 2.
2
σ
σ
σ
σ
=
(iii) The (i, k + 1)th element of G> G is
n X
1
1
1
2
− + 3 (yt − X t β)
(yt − X t β)xti
σ σ
σ2
t=1
= −
n
n
X
X
1
1
(y
−
X
β)x
+
(yt − X t β)3 xti
t
t
ti
3
5
σ
σ
t=1
t=1
= 0.
Therefore,
1
I(β, σ) = plim G> G = plim
n→∞
n→∞ n
n−1 X > X/σ 2
0
0>
2/σ 2
.
If we want to estimate the covariance matrix using the IM estimator, we have
2 > −1
σ
ˆ (X X)
0
ˆ
d
VarIM (β, σ
ˆ) =
.
0>
σ
ˆ 2 /2n
√
Thus the standard error of σ
ˆ is estimated as σ
ˆ / 2n and the C.I. can be constructed following
√
√
[ˆ
σ − cα/2 σ
ˆ / 2n, σ
ˆ + cα/2 σ
ˆ / 2n],
given the fact that u ∼ N(0, σ 2 I).
12
2.4
CHAPTER 2. THE METHOD OF MAXIMUM LIKELIHOOD
∗
Hypothesis Testing
Assume that the null hypothesis imposes r restrictions on θ and we can write these as
r(θ) = 0.
˜ and θ
ˆ denote, respectively, the restricted and unrestricted maximum likelihood estiLet θ
mates of θ. There are three classical tests:
(i) The likelihood ratio test
a 2
ˆ − l(θ)
˜ ∼
LR = 2 l(θ)
χ (r).
(ii) The Wald test
a
ˆ R(θ)
ˆ Var(
ˆ > (θ)
ˆ −1 r(θ)
ˆ ∼
d θ)R
W = r > (θ)
χ2 (r),
where R(θ) = ∂r(θ)/∂θ is an r × k matrix.
(iii) The Lagrange multiplier test
˜ I˜ −1 g(θ),
˜
LM = g > (θ)
˜ > R(θ)
˜ I˜ −1 R> (θ)
˜ λ,
˜
= λ
a
∼ χ2 (r).
˜ is estimated from the constrained maximization problem
where λ
max l(θ) − r > (θ)λ.
θ
2.5
∗
Transformations of the Dependent Variable
We consider the following model
log yt = X t β + ut ,
ut ∼ NID(0, σ 2 )
The loglikelihood for the log yt is
n
1
n
− log 2π − log σ 2 − 2 (log y − Xβ)> (log y − Xβ) .
2
2
2σ
Using the Jacobian transformation of log yt , we establish the link between f (yt ) and f (log yt ),
where
d log yt f (log yt )
=
f (yt ) = f (log yt ) ,
dyt yt
which implies
lt (yt ) = lt (log yt ) − log yt
2.5.
∗
TRANSFORMATIONS OF THE DEPENDENT VARIABLE
13
Therefore, the loglikelihood function of yt we are seeking is
n
X
n
n
1
− log 2π − log σ 2 − 2 (log y − Xβ)> (log y − Xβ) −
log yt .
2
2
2σ
t=1
And the loglikelihood function concentrated with respect only to σ is
!
n
n
X
n n
1X
n
(log yt − X t β)2 −
log yt .
− log 2π − − log
2
2
2
n t=1
t=1
ˆ >G
ˆ is a (k + 1) × (k + 1) matrix with three distinct elements similar
The associated G
to those in example 3.1
ˆ >G
ˆ is
(i) For i, j = 1, ..., k, the ij th element of G
2 X
n
n X
1
1
ˆ
ˆ 2 xti xtj .
(log yt − X t β)xti =
(log yt − X t β)
2
4
σ
ˆ
σ
ˆ
t=1
t=1
ˆ >G
ˆ is
(ii) The (k + 1), (k + 1)th element of G
2
n X
1
1
2
ˆ
− + 3 (log yt − X t β)
σ
ˆ
σ
ˆ
t=1
n
n
X 2
X 1
n
2
ˆ
ˆ 4.
−
(log
y
−
X
β)
+
(log yt − X t β)
=
t
t
4
6
σ
ˆ2
σ
ˆ
σ
ˆ
t=1
t=1
ˆ >G
ˆ is
(iii) The (i, k + 1)th element of G
n X
1
1
1
2
ˆ
ˆ ti
− + 3 (log yt − X t β)
(log yt − X t β)x
2
σ
ˆ
σ
ˆ
σ
ˆ
t=1
= −
n
n
X
X
1
1
ˆ
ˆ 3 xti .
(log
y
−
X
β)x
+
(log yt − X t β)
t
t
ti
3
5
σ
ˆ
σ
ˆ
t=1
t=1
The Hessian matrix can be obtained in a similar fashion. We need to consider three
ˆ the upper left k × k block that corresponds to β, the lower right scalar
elements of −H:
that corresponds to σ, and the k × 1 vector that corresponds to β and σ. Of course, all
ˆ must be replaced by a sample estimates (which means it must be
the parameters in −H
“hatted”).
Chapter 3
Discrete and Limited Dependent
Variables
This chapter explores two “typical” limits of traditional regression models: when the dependent variable is discrete, or is continuous but is limited in a range.
3.1
Binary Response Models: Estimation
A binary dependent variable yt can take on only two values 0 and 1. We encounter
these variables a lot in survey data, which the economic agent usually chooses between two
alternatives. A binary response model explains the probability that the agent chooses
alternative 1 as a function of some observed explanatory variables.
Let Pt denote the probability that yt = 1 conditional on the information set Ωt
Pt ≡ Pr(yt = 1|Ωt ) = E(yt |Ωt ) = F (X t β) ∈ [0, 1],
where X t β is an index function and the CDF function F (x) is also called a transformation function. There are two popular choices for F (x) are
(
Rx
F (x) = Φ(x) ≡ √12π −∞ exp − 12 X 2 dX probit model
F (x) = γ(x) ≡ ex /(1 + ex )
logit model
Since the probability that yt = 1 is F (X t β), the loglikelihood function for observation t
when yt = 1 is log F (X t β). Therefore the loglikelihood function for y is simply
n X
l(y, β) =
yt log F (X t β) + (1 − yt ) log 1 − F (X t β) ,
t=1
which can be easily estimated by MLE.
The binary response models can deal with models/dataset that is not generally suitable
to traditional regression models. For example, there is unobserved, or latent, variable yt◦ .
Suppose that
yt◦ = X t β + ut ,
ut ∼ NID(0, 1).
We observe only the sign of yt◦ , which determines the value of the observed binary variable
yt according to the relationship
(
yt = 1 if yt◦ > 0
yt = 0 if yt◦ ≤ 0
14
3.2. BINARY RESPONSE MODELS: INFERENCE
15
However, the binary response models do have drawbacks comparing to traditional regression
models. For example, the binary response models are likely to encounter the perfect classifier problem when the sample size is small, and almost all of the yt are equal to 0 or 1.
In that case, it is highly possible that we can sort the dependent variables by some linear
combination of the independent variables, say X t β • , such that
(
yt = 0 whenever X t β • < 0,
yt = 1 whenever X t β • > 0.
Unfortunately, there is no finite ML estimator exists.
3.2
Binary Response Models: Inference
Follow the standard results for ML estimation, we can show that
−1
1 >
1/2 ˆ
X Υ(β 0 )X
,
Var plim n (β − β 0 ) = plim
n
n→∞
n→∞
where
f 2 (X t β)
Υt (β) ≡
.
F (X t β)(1 − F (X t β))
In practice, the covariance matrix estimator is
−1
ˆ = X > Υ(β)X
ˆ
d β)
Var(
.
We can also test restrictions on binary response models simply using LR tests. As usual, the
LR test statistic is asymptotically distributed as χ2 (r), where r is the number of restrictions.
Another somewhat easier way to perform a hypothesis testing is to construct the binary
response model regression (BRMR)
−1/2
Vt
−1/2
(β) yt − F (X t β) = Vt
(β)f (X t β)X t b + resid,
−1/2
where Vt
(β) ≡ F (X t β) 1 − F (X t β) and f (x) = ∂F (x)/∂x. The BRMR is a modified
version of the GNR.
.
We partition X as [X 1 , X 2 ] and β as [β 1 .. β 2 ], where β 2 is a r−vector. Suppose we
want to test β 2 = 0, we can do this by running the BRMR
−1/2
−1/2 ˜
−1/2 ˜
V˜t
yt − F˜t = V˜t
ft X t1 b1 + V˜t
ft X t2 b2 + resid,
˜ = [β˜1
where V˜t , F˜t and f˜t are estimated under the restricted coefficients β
testing β 2 = 0 is equivalent to test b2 = 0. The best test statistic to use
ESS from above regression , which is asymptotically distributed as χ2 (r).
..
. 0]. Therefore,
is probably the
16
3.3
CHAPTER 3. DISCRETE AND LIMITED DEPENDENT VARIABLES
Models for Multiple Discrete Responses
It is quite common that discrete dependent variables can take on more than just two values.
Models that deal with multiple discrete responses are referred to as qualitative response
models or discrete choice models. We usually divided these models into two groups: ones
designed to deal with ordered responses, for example, rate your professor in the course
evaluation form; and ones designed to deal with unordered responses, for example, your
choice of what-to-do in the reading-week.
The most widely-used model for ordered response data is the ordered probit model.
The model for the latent variable is
yt◦ = X t β + ut ,
Consider the number of responses is 3 for
observed yt and the latent yt◦ is given by


y t = 0
yt = 1


yt = 2
ut ∼ NID(0, 1).
an example, we assume the relation between the
if yt◦ < γ1
if γ1 ≤ yt◦ ≤ γ2
if yt◦ ≥ γ2
where γ1 and γ2 are threshold parameters that must be estimated. The probability that
yt = 0 is
Pr(yt = 0) = Pr(yt◦ < γ1 ) = Pr(ut < γ1 − X t β) = Φ(γ1 − X t β).
similarly, we have
Pr(yt = 1) = Φ(γ2 − X t β) − Φ(γ1 − X t β),
Pr(yt = 2) = Φ(X t β − γ2 ).
Therefore, the loglikelihood function for the ordered probit model is
X
X
log Φ(X t β − γ2 )
log Φ(γ1 − X t β) +
l(β, γ1 , γ2 ) =
yt =2
yt =0
+
X
log Φ(γ2 − X t β) − Φ(γ1 − X t β)
yt =1
The most popular model to deal with unordered responses is the multinomial logit
model or multiple logit model. This model is designed to handle J + 1 responses, for
J ≥ 1. The probability that a response j is observed is
exp(W tj β j )
Pr(yt = j) = PJ
j=0
exp(W tj β j )
for j = 0, ..., J,
where W tj is the explanatory variables for response j and β j is usually different for each
j = 0, ..., J. The loglikelihood function is simply
!
J
J
n
X
X
X
I(yt = j)W tj β j − log
exp(W tj β j )
,
t=1
j=0
J=0
3.3. MODELS FOR MULTIPLE DISCRETE RESPONSES
17
where I(·) is the indicator function.
In practice, however, we often face a situation when the explanatory variables W tj are
the same for each j. In that case, we can denote the familiar X t as the explanatory variables
for all responses and the probabilities for j = 1, ..., J is simply
Pr(yt = j) =
exp(X t β j )
PJ
1 + j=1 exp(X t β j )
and for outcome 0, the probability is
Pr(yt = 0) =
1
1+
PJ
j=1
exp(X t β j )
.
The unknown parameters in each vector βj are typically jointly estimated by maximum a
posteriori (MAP) estimation, which is an extension of MLE.
An important property of the general multinomial logit model is that
exp(W tl β l )
Pr(yt = l)
=
.
Pr(yt = j)
exp(W tj β j )
for any two responses l and j. This property is called the independence of irrelevant
alternatives, or IIA. Unfortunately, the IIA property is usually incompatible with the data.
The logit model that do not possess the IIA property is often denoted as the nested logit
model. We can derive the probability of a nested logit model based on the multinomial
logit model:
• We partition the set of outcomes {0, 1, ..., J} into m disjoint subsets Ai , i = 1, ..., m.
• Suppose that the choice among the members of Ai is governed by a standard multinomial logit model
exp(W tj β j /θi )
Pr(yt = j|yt ∈ Ai ) = P
l
l∈Ai exp(W tl β /θi )
where θi is a scale parameter for the parameter vectors β j , j ∈ Ai .
• Specifically, we assume that
exp(θi hti )
Pr(yt ∈ Ai ) = Pm
k=1 exp(θk htk )
where
!
hti = log
X
exp(W tj β j /θi ) .
j∈Ai
• Finally, by Bayes’ rule, we have
Pr(yt = j) = Pr(yt = j|yt ∈ Ai(j) )Pr(yt ∈ Ai(j) )
!
X
exp(W tj β j /θi )
= P
log
exp(W tj β j /θi ) .
l
l∈Ai exp(W tl β /θi )
j∈Ai
18
3.4
CHAPTER 3. DISCRETE AND LIMITED DEPENDENT VARIABLES
Models for Count Data
Many economic data are nonnegative integers, for example, the total number of population.
Data of this type are called event count data or count data. In probability theory
and statistics, the Poisson distribution (or Poisson law of small numbers) is a discrete
probability distribution that expresses the probability of a given number of events occurring
in a fixed interval of time and/or space if these events occur with a known average rate and
independently of the time since the last event. Therefore, the Poisson distribution model
is one of the most popular model to deal with count data:
y
exp − λt (β) λ − t(β)
, y = 0, 1, 2, ...
Pr(Yt = y) =
y!
where
λt (β) ≡ exp(X t β).
The loglikelihood function is simply
l(y, β) =
n
X
(− exp(X t β) + yt X t β − log yt !)
t=1
and the associated Hessian matrix is
H(β) = −
n
X
>
exp(X t β)X >
t X t = −X Υ(β)X,
t=1
where Υ(β) is an n × n diagonal matrix with typical diagonal element equal to Υt (β) ≡
exp(X t β).
In practice, however, the Poisson regression model tends to under predict the variance of
the actual data. Such failure is also called overdispersion. If the variance of yt is indeed
equal to exp(X t β), the quantity
zt (β) ≡ yt − exp(X t β)
2
− yt
has expectation 0. To test the overdispersion in the Poisson regression model, we make use
the artificial OPG regression:
ˆ + cˆ
ι = Gb
z + resid,
ˆ X t and z
ˆ = yt − exp(X t β)
ˆ 2 − yt . We test c = 0 using t-stat
ˆ = yt − exp(X t β)
where G
following the asymptotic distribution N(0, 1) or we can examine the ESS by χ2 (1).
3.5
Models for Censored and Truncated Data
A data sample is said to be truncated if some observations have been systematically excluded from the sample. And a data sample has been censored if some of the information
contained in them has been suppressed.
3.6. SAMPLE SELECTIVITY
19
Any dependent variable that has been either censored or truncated is said to be a limited
dependent variable. Consider the regression model
yt◦ = β1 + β2 xt + ut ,
ut ∼ NID(0, σ 2 ).
What we actually observe is yt , which differs from yt◦ because it is either truncated or
censored.
Suppose yt is truncated such that all the negative values are omitted. The probability
that yt◦ is included in the sample is
Pr(yt◦ ≥ 0) = Pr(X t β + ut ≥ 0) = 1 − Pr(ut /σ < −X t β/σ) = Φ(X t β/σ).
The density of yt is then
σ −1 φ (yt − X t β)/σ
.
Φ(X t β/σ)
This implies that the log likelihood function of yt conditional on yt◦ ≥ 0 is
n
n
X
n
1 X
2
l(y, β, σ) = − log(2π) − n log(σ) − 2
(yt − X t β) −
log Φ(X t β/σ),
2
2σ t=1
t=1
which can be estimated by MLE.
The most popular model for censored data is the tobit model. Assume we impose the
following restriction
(
yt = yt◦ if yt◦ > 0,
yt = 0 otherwise
The loglikelihood function for yt is simply
X
X
1
log
l(y, β, σ) =
φ (yt − X t β)/σ +
log Φ(−X t β/σ),
σ
y >0
y =0
t
t
which is the sum of the logs of probability for the censored observations and the uncensored observations. Fortunately, we can still use MLE, even though the distribution of the
dependent variable in a tobit model is a mixture of discrete and continuous random variables.
3.6
Sample Selectivity
Many samples are truncated on the basis of another variable that is correlated with the
dependent variable. For example, if a student shows up to a tutorial depends on his own
desire of knowledge and the TA’s ability, when these two parameters do not meet certain
standard, the student will not show up. If we want to evaluate the performance of students
in the tutorial, our evaluation will be upward biased because our sample only includes those
geniuses who actually go to the tutorial. The consequences of this type of sample selection are
often said to be due to sample selectivity. Therefore, to have a fair evaluation of students’
performance, we need to include the parameters of student’s desire and TA’s ability in our
model.
20
CHAPTER 3. DISCRETE AND LIMITED DEPENDENT VARIABLES
Suppose that yt◦ and zt◦ are two latent variables, generated by the bivariate process
◦ 2
yt
X tβ
ut
ut
σ ρσ
=
+
,
∼ NID 0,
ρσ 1
zt◦
W tγ
vt
vt
The data is determined following
(
yt = yt◦
if zt◦ > 0
yt unobserved otherwise
and
(
zt = 1 if zt◦ > 0
zt = 0 otherwise
The loglikelihood function is
X
X
log Pr(zt = 0) +
log Pr(zt = 1)f (yt◦ |zt = 1) ,
zt =0
zt =1
which can be estimated by ML.
Another simpler technique to compute consistent estimates is to use the Heckman’s TwoStep Method, which is based on the following model
yt = X t β + ρσvt + et ,
where the error term ut is divided into two parts: one perfectly correlated with vt an one
independent of vt . The idea is to replace vt with its conditional mean
E(vt |zt = 1, W t ) = E(vt |vt > −W t γ, W t ) =
φ(W t γ)
.
Φ(W t γ)
Therefore, we have
yt = X t β + ρσ
φ(W t γ)
+ et ,
Φ(W t γ)
ˆ from
The first step is to obtain consistent estimates γ
zt◦ = W t γ + vt ,
using ordinary probit model. Then replace the γ with the estimate
yt = X t β + ρσ
ˆ)
φ(W t γ
+ et
ˆ)
Φ(W t γ
and obtain the consistent estimates of β using OLS. Of course, the above regression can also
be used to test if selectivity is a problem by examining if ρ = 0 or not following ordinary t
statistic.
3.7
Duration Models
Economists are sometimes interested in how much time elapses before some event occurs.
From now on, we use i to index observations and denote ti to measure duration. Suppose
3.7. DURATION MODELS
21
that how long a state endures is measured by T , a nonnegative, continuous random variable
with PDF f (t) and CDF F (t). The survivor function is defined as
S(t) ≡ 1 − F (t).
The probability that a state ends in the period from time t to time t + ∆t is
Pr(t < T ≤ t + ∆t) = F (t + ∆t) − F (t).
Therefore, the conditional probability that the state survived time t is
Pr(t < T ≤ t + ∆t|T ≥ t) =
F (t + ∆t) − F (t)
.
S(t)
We divide the RHS term by ∆t.
F (t + ∆t) − F (t) .
S(t).
∆t
As ∆t → 0, the numerator is simply the PDF f (t). We denote such function as the hazard
function
f (t)
f (t)
h(t) ≡
=
.
S(t)
1 − F (t)
The function F (t) may take various forms due to different assumptions. For example,
F (t) can follow the Weibull distribution
F (t, θ, α) = 1 − exp − (θt)α .
The associated PDF, survivor function and hazard function are simply
f (t) = αθα tα−1 exp − (θt)α
S(t) = exp − (θt)α
h(t) = αθα tα−1 .
The loglikelihood function for t, the vector of observations with typical element ti , is just
l(t, β, α) =
=
n
X
i=1
n
X
log f (ti |X i , β, α)
log h(ti |X i , β, α) +
i=1
= n log α +
n
X
log S(ti |X i , β, α)
i=1
n
X
i=1
X i β + (α − 1)
n
X
i=1
We then obtain the estimates by MLE in the usual way.
log ti −
n
X
i=1
tαi exp(X i β).
Chapter 4
Multivariate Models
In this chapter, we discuss models which jointly determine the values of two or more
dependent variables using two or more equations.
4.1
Seemingly Unrelated Linear Regressions (SUR)
We suppose that there are g dependent variables indexed by i. Each dependent variable
is associated with n observations. The ith equation of a multivariate linear regression model
can be written as
y i = X i β i + ui ,
E(ui u>
i ) = σii I n .
To allow correlation between error term in the same period, we further impose the following
assumption
E(uti utj ) = σij ∀t, E(uti usj ) = 0 ∀t 6= s,
where σij is the ij th element of the g × g PSD matrix Σ. We construct an n × g matrix
U that contains the error terms ui from each equation, of which a typical row is the 1 × g
vector U t . Then,
1
E(U > U ) = Σ.
E(U >
t U t) =
n
4.1.1
MM Estimation
We can write the entire SUR system as
y • = X • β • + u• ,
where



y• = 

y1
y2
..
.
yg








X• = 

X1 O · · ·
O X2 · · ·
..
..
..
.
.
.
O O ···
O
O
..
.








β• = 

Xg
and the OLS estimator for the entire system is simply
ˆ OLS = (X > X • )−1 X > y ,
β
•
•
• •
22
β1
β2
..
.
βg








u• = 

u1
u2
..
.
ug





4.1. SEEMINGLY UNRELATED LINEAR REGRESSIONS (SUR)
if we assume homoskedasticity.
The covariance matrix of the vector u• is

 
E(u1 u>
E(u1 u>
σ11 I n · · ·
1 ) ···
g)



.
.
.
.
..
>
..
..
..
E(u• u• ) = 
 =  ..
.
>
E(ug u>
)
·
·
·
E(u
u
)
σ
I
·
·
·
g g
g1 n
1
23

σ1g I n
..  ≡ Σ .
•
. 
σgg I n
The matrix Σ• is a symmetric gn × gn matrix, which can be written more compactly as
Σ• ≡ Σ ⊗ I n .
For the Kronecker product Σ⊗I n , we let each element in Σ times the matrix I n and arrange
each block into Σ• .
If there is heteroskedasticity, the feasible GLS estimator is
−1
F GLS
> ˆ −1
ˆ
ˆ −1 ⊗ I n )y • ,
β•
= X • (Σ ⊗ I n )X •
X>
• (Σ
where
ˆ ≡ 1U
ˆ >U
ˆ
Σ
n
The covariance matrix is then
ˆ = [ˆ
ˆ g ].
with U
u1 , ..., u
−1
ˆ F GLS ) = X > (Σ
ˆ −1 ⊗ I n )X •
d β
.
Var(
•
•
4.1.2
Maximum Likelihood Estimation
Let z be a random m-vector with known density fz (z), and let x be another random m−vecor
such that z = h(x), then
∂h(x)
.
fx (x) = fz h(x) det
∂x We rewrite the SUR system
y • = X • β • + u• ,
u• ∼ N (0, Σ ⊗ I n ).
Since u• = y • − X • β • , given the pdf of u• is
1
−1
−gn/2
−1/2
>
f (u• ) = (2π)
|Σ ⊗ I n |
exp − (y • − X • β • ) (Σ ⊗ I n )(y • − X • β • ) ,
2
we have
−gn/2
f (y • ) = (2π)
−1/2
|Σ⊗I n |
1
∂h(y)
−1
>
.
exp − (y • − X • β • ) (Σ ⊗ I n )(y • − X • β • ) det
2
∂y Note that h(y) = y • − X • β • , therefore, the determinant above equals to 1, if there are no
lagged dependent variables in the matrix X • .
24
CHAPTER 4. MULTIVARIATE MODELS
Taking the log of the above equation, we have the loglikelihood function l(Σ, β • ) equals
to
gn
n
1
log 2π − log |Σ| − (y • − X • β • )> (Σ−1 ⊗ I n )(y • − X • β • ).
2
2
2
You probably won’t be able to obtain a result, if you only using the above loglikelihood
function for the maximum likelihood estimation. In fact, many modern software are taking
advantage of the following equation, with or without telling the user:
−
let σij be the ij th element in Σ, we have
σij =
1
(y − X i β i )> (y j − X j β j ).
n i
If we define the n × g matrix U (β • ) to have ith column y i − X i β i , then
Σ=
1 >
U (β • )U (β • )
n
We can replace the Σ in the loglikelihood function with n1 U > (β • )U (β • ), then taking the
ˆ M L , we can use it to compute Σ
ˆ M L . Consequently,
FOC w.r.t β • . Once we obtained the β
•
ML
ˆ
we can estimate the covariance matrix of β
•
ˆ M L ) = X > (Σ
ˆ −1 ⊗ I n )X •
d β
Var(
•
ML
•
4.2
−1
Linear Simultaneous Equations Models
The linear simultaneous equations models in this section are indeed multivariate IV
models. The ith equation of a linear simultaneous system can be written as
y i = X i β i + ui = Z i β 1i + Y i β 2i + ui ,
where the regressors X i can be partitioned into two parts: the predetermined Z i (n × k1i )
and the endogenous Y i (n × k2i ). Although we can still use a single equation to represent
the whole system:
y • = X • β • + u• ,
E(u• u>
• ) = Σ ⊗ I n.
But we need to keep in mind that X • is correlated with u• .
4.2.1
GMM Estimation
We can adopt the GMM estimation to solve the above model. The theoretical moment
conditions that lead to the efficient GMM estimator is
> −1
¯
E X Ω (y − Xβ) = 0.
(4.1)
Let W denote an n × l matrix of exogenous and predetermined variables. This instruments
matrix contains all the Z i plus instruments for each Y i . Similar to IV estimation, we
4.2. LINEAR SIMULTANEOUS EQUATIONS MODELS
25
ˆ i = PW X i . We construct a blockpremultiply PW on the regressors X i and define X
ˆ
ˆ
¯ in (4.1). This allows us
diagonal matrix X • , with diagonal blocks the X i , to replace the X
to write the estimating equations for efficient GMM estimation
ˆ > (Σ−1 ⊗ I n )(y • − X • β • ) = 0.
X
•
(4.2)
The above equation looks like equations for GMM estimation. In fact, it is more like a group
equations for IV estimation. It becomes straightforward if we rewrite the above equation in
the form



σ 11 X >
σ 1g X >
y1 − X 1β1
1 PW · · ·
1 PW



..
..
..
..


 = 0,
.
.
.
.
yg − X g βg
σ g1 X >
σ gg X >
g PW · · ·
g PW
For a particular element i, we have
g
X
σ ij X >
i PW (y j − X j β j ) = 0.
j=1
or
g
X
ij
X>
i σ · PW (y j − X j β j ) = 0.
j=1
Therefore, we can rewrite the estimating equation (4.2) as
−1
X>
⊗ PW )(y • − X • β • ) = 0.
• (Σ
The above equation is not feasible unless Σ is known. We need to first obtain a consistent
estimate of Σ, then compute the estimate of β • . The procedure can be simplified to the
following three steps:
(1) We first estimate the individual equations of the system by 2SLS.
ˆ 2SLS
(2) Then, we use the 2SLS residuals to compute the matrix Σ
ˆ >U
ˆ,
ˆ 2SLS = 1 U
Σ
n
ˆ is an n × g matrix with ith column u
ˆ 2SLS
where U
.
i
(3) Finally, we compute the β •
−1
ˆ 3SLS = X > (Σ
ˆ −1 ⊗ PW )X •
ˆ −1
β
X>
•
•
2SLS
• (Σ2SLS ⊗ PW )y • .
This estimator is called the three-stage least squares, or 3SLS, estimator.
We can estimate the covariance matrix of the classical 3SLS estimator by
−1
ˆ 3SLS ) = X > (Σ
ˆ −1 ⊗ PW )X •
d β
Var(
.
•
•
2SLS
26
CHAPTER 4. MULTIVARIATE MODELS
4.2.2
Structural and Reduced Forms
When the system of models are written in the form of
y i = X i β i + ui = Z i β 1i + Y i β 2i + ui ,
(4.3)
it is normally the case that each equation has a direct economic interpretation. It is for this
reason that these are called structural equations. The full system of equations constitutes
what is called the structural form of the model.
Instead of stacking the equations vertically like we did before, we can also stack the
equations horizontally. Define n × g matrix Y as [y 1 , y 2 , ..., y g ]. Similarly, we define an n × g
matrix U of which each column corresponds to a vector ui of error term. In this notation,
the structural form can be represented as
Y Γ = W B + U.
(4.4)
According to the textbook, (4.3) is embodied in (4.4) through the following equation.
1
[y i Y i ]
= Z i β 1i + ui .
−β 2i
The above equation is a bit tricky. Let me explain it in details:
(a) First, the endogenous variable Y i in equation i is the dependent variables in some
other equation, say j. Therefore, Y i = y j . Therefore, the ith column of the LHS of
(4.4) is


0
 .. 
 . 


 1 
 . 
1

Y Γi = [y 1 ... y i ... y j ... y g ] 
 ..  = [y i Y i ] −β 2i
 −β 

2i 
 . 
.
 . 
0
(b) For the ith column of the RHS of (4.4). It is more intuitive if we rewrite W as
W = [Z 1 ... Z i ... Z g
C],
where the instruments W contains all the exogenous variables Z i and some instruments
variables C. Therefore, the ith column of the RHS of (4.4) is


0
 .. 
 . 


 β 1i 


W B i + ui = [Z 1 ... Z i ... Z g C]  ...  + ui = Z i β 1i + ui .


 0 




0
4.2. LINEAR SIMULTANEOUS EQUATIONS MODELS
27
We can postmultiply both sides of equation (4.4) by Γ−1 to obtain
Y = W BΓ−1 + V ,
where V ≡ U Γ−1 . This representation is called the restricted reduced form or RRF.
This is in contrast to the unrestricted reduced form or URF
Y = WΠ + V ,
where we simply consider Π as the RHS coefficients without imposing the restrictions: Π =
BΓ−1 .
4.2.3
Maximum Likelihood Estimation
For simplicity, we assume the error terms are normally distributed
y • = X • β • + u• ,
u• ∼ N (0, Σ ⊗ I n ).
The maximum likelihood estimator of a linear simultaneous system is called the full information maximum likelihood, or FIML, estimator. The loglikelihood function is
−
n
1
gn
log 2π − log |Σ| + n log | det Γ| − (y • − X • β • )> (Σ−1 ⊗ I n )(y • − X • β • )
2
2
2
As usual, we need the following equation to set up the link between Σ and β • , or, equivalently,
B and Γ
1
Σ = (Y Γ − W B)> (Y Γ − W B).
n
There are many numerical methods for obtaining FIML estimates. One of them is to
make use of the artificial regression
(Φ> ⊗ I n )(y • − X • β • ) = (Φ> ⊗ I n )X • (B, Γ)b + resid,
where ΦΦ> = Σ−1 . We start from initial consistent estimates, then use the above equation
to update the estimates of B and Γ. Another approach is to concentrate the loglikelihood
function with respect to Σ. The concentrated loglikelihood function can be written as
1
n
gn
>
− (log 2π + 1) + n log | det Γ| − log (Y Γ − XB) (Y Γ − XB) .
2
2
n
ˆ M L and Γ
ˆ M L directly by minimizing the above equation w.r.t B and Γ.
We can obtain B
It is also important to test any overidentifying restrictions, in which it is natural to use
a LR test. The restricted value of the loglikelihood function is the maximized value of
1
gn
n
>
ˆ − log (Y Γ
ˆ − X B)
ˆ (Y Γ
ˆ − X B)
ˆ .
LRr = − (log 2π + 1) + n log | det Γ|
2
2
n
The unrestricted value is
1
gn
n
>
ˆ
ˆ
LRu = − (log 2π + 1) − log (Y − W Π) (Y − W Π) ,
2
2
n
28
CHAPTER 4. MULTIVARIATE MODELS
ˆ denotes the matrix of OLS estimates of the parameters of the URF. The test is
where Π
simply
a
2(LRu − LRr ) ∼ χ2 (gl − k).
When a system of equations consists of just one structural equation, we can write it as
y = Xβ = Zβ 1 + Y β 2 + u,
where β 1 is k1 ×1, β 2 is k2 ×1 and k = k1 +k2 . Since the above equation includes endogenous
variables Y , we can compose a complete simultaneous system by using the URF equations:
Y = W Π + V = ZΠ1 + W 1 Π2 + V .
Maximum likelihood estimation of the above equation is called limited-information
maximum likelihood or LIML. We can treat LIML as a FIML applied to a system in
which only one equation is overidentified. In fact, LIML is a single-equation estimation
method. Although its name includes the term maximum likelihood, the most popular way
to calculate the coefficients is more like a least squares estimation. Such calculation was
investigated by statisticians as early as 1949 by Anderson and Rubin.
Anderson and Rubin proposed calculating the coefficients β 2 by minimizing the ratio
(y − Y β 2 )> MZ (y − Y β 2 )
.
(y − Y β 2 )> MW (y − Y β 2 )
This is also why the LIML estimator is also referred as the least variance ratio estimator.
ˆ LIM L following
Once we obtain the estimate of κ
ˆ , we compute the LIML coefficients β
−1 >
ˆ LIM L = X > (I − κ
X (I − κ
ˆ MW )y.
β
ˆ MW )X
κ=
A suitable estimate of the covariance matrix of the LIML estimator is
−1
ˆ LIM L = σ
d β
ˆ 2 X > (I − κ
,
Var
ˆ MW )X
where
1
ˆ LIM L > y − X β
ˆ LIM L .
y − Xβ
n
The one significant virtue of LIML (also FIML) is its invariance to the normalization of
the equation. This is a very useful/important property, since simultaneous equations systems
can be parameterize in many different ways. Consider the following demand-supply model
σ
ˆ2 =
qt = γd pt + X dt β d + udt ,
qt = γs pt + X st β s + ust .
We can certainly reparameterize the whole system as
pt = γd0 qt + X dt β 0d + resid,
pt = γs0 qt + X st β 0s + resid.
where
γd0 = 1/γd ,
β 0d = −β d /γd ,
γs0 = 1/γs ,
β 0s = −β s /γs .
(4.5)
The invariance property implies that the FIML or LIML estimates bear precisely the same
relationship as the true parameters in (4.5). This nice property is not shared by 2SLS, 3SLS
or other GMM estimators.
Chapter 5
Methods for Stationary Time-Series
Data
5.1
AR Process
The pth order autoregressive, or AR(p), process can be written as
yt = γ + ρ1 yt−1 + ρ2 yt−2 + ... + ρp yt−p + t ,
t ∼ IID(0, σ2 ),
where the process forPt is often referred to as white noise. If we define ut ≡ yt − E(yt ),
with E(yt ) = γ/(1 − pi=1 ρi ), the above equation can be then rewritten as
ut =
p
X
ρi ut−i + i ,
i=1
or with the lag operator notation
ut = ρ(L)ut + ,
or as
1 − ρ(L) ut = t .
For ut an AR(1) process, the autocovariance matrix

1
ρ
···

2
1
···
σ  ρ
Ω(ρ) =
 ..
..
2
1−ρ  .
.
ρn−1 ρn−2 · · ·
is
ρn−1
ρn−2
..
.



.

1
Applications that involve autoregressions of order greater than two are relatively unusual.
Nonetheless, higher-order models can be handled in the same fashion. . Let vi denote the
covariance of ut and ut−i , for i = 0, 2, ..., p. The elements of the autocovariance matrix,
autocovariances, will obey the Yule-Walker equations
v0 = ρ1 v1 + ρ2 v2 + ... + ρp vp + σ2 ,
v1 = ρ1 v0 + ρ2 v1 + ... + ρp vp−1 ,
..
.
vp = ρ1 vp−1 + ρ2 vp−2 + ... + ρp v0 .
29
30
CHAPTER 5. METHODS FOR STATIONARY TIME-SERIES DATA
We solve the above system of linear equations for vi , i = 0, ..., p. We compute the rest of vk
with k > p following
vk = ρ1 vk−1 + ρ2 vk−2 + ... + ρp vk−p .
If there are n observations, the autocovariance matrix is

v0
v1
v2 · · ·
 v1
v0
v1 · · ·

 v2
v
v
···
1
0
Ω=
 ..
..
..
 .
.
.
vn−1 vn−2 vn−3 · · ·
simply

vn−1
vn−2 

vn−3 
.
.. 
. 
v0
If we normalize the above matrix by multiplying a scalar chosen to make the diagonal
elements equal to unity, the result is the autocorrelation matrix, and we call its elements
the autocorrelations.
There are three popular methods to estimate the AR(p) model:
(a) Drop the first p observations and estimate the nonlinear regression model
yt = X t β +
p
X
ρi (yt−i − X t−i β) + t
i=1
by NLS.
(b) Estimate by feasible GLS, possibly iterated.
(c) Estimate by the GNR that corresponds to the nonlinear regression with an extra
artificial observation corresponding to the first observation.
5.2
MA Process and ARMA Process
A q th order moving-average, or MA(q), process can be written as
yt = µ + α0 t + α1 t−1 + ... + αq t−q ,
where E(yt ) = µ. By defining ut ≡ yt − µ, we can write
ut =
q
X
αj t−j = 1 + α(L) t .
j=0
The autocovariances of an MA process are much easier to calculate. Since the t are
white noise, and hence uncorrelated, the variance of the ut is
!
q
X
Var(ut ) = E(u2t ) = σ2 1 +
αj2 .
j=1
5.2. MA PROCESS AND ARMA PROCESS
31
Similarly, the j th order autocovariance is, for j > 0,

Pq−j
 σ2 (αj + i=1 αj+i αi ) for j < q,
E(ut ut−j ) =
σ2α
for j = q, and
 j
0
for j > q.
The autoregressive moving-average process, or ARMA process is the combination
of MA process with AR process. In general, we can write an ARMA(p, q) process with
nonzero mean as
1 − ρ(L) yt = γ + 1 − α(L) t .
For any known stationary ARMA process, the autocorrelation between ut and ut−j can be
easily calculated. For an ARMA process of possibly unknown order, we define the autocorrelation function or ACF expresses the autocorrelation as a function of the lag j for
j = 1, 2, ... If we have a sample yt , t = 1, ..., n, the j th order autocorrelation ρ(j) can be
estimated using the formula
d t , yt−j )
Cov(y
.
ρˆ(j) =
d t)
Var(y
The autocorrelation function ACF(j) gives the gross correlation between yt and yt−j .
But in this setting, we observe, for example, that a correlation between yt and yt−2 could
arise primarily because both variables are correlated with yt−1 . Therefore, we might ask
what is the correlation between yt and yt−2 net of the intervening effect of yt−1 . We use the
partial autocorrelation function, or PACF to characterize such relationship. The partial
autocorrelation coefficient of order j is defined as the plim of the least squares estimator of
(j)
the coefficient ρj in the linear regression
(j)
(j)
yt = γ (j) + ρ1 yt−1 + ... + ρj yt−j + t
for j = 1, ..., J. We can calculate the empirical PACF up to order J by running the above
(j)
regression and retaining only the estimate ρˆj for each j.
In practice, it is necessary to allow yt to depend on exogenous explanatory variables. Such
models are sometimes referred to as ARMAX models. An ARMAX(p, q) model takes the
form
yt = X t β + ut , ut ∼ ARMA (p, q), E(ut ) = 0.
Let Ω denote the autocovariance matrix of the vector y. Note that Ω is composed by the vi
that includes parameters ρi and αj of the ARMA(p, q) process. The log of the joint density
of the observed sample is
n
1
1
− log 2π − log |Ω| − (y − Xβ)> Ω−1 (y − Xβ).
2
2
2
We can certainly build up our likelihood function based on the above equation and solve it
with MLE. However, this seemingly trivial process can be computationally difficult due to
the fact that Ω is a n × n matrix and each of its element is a nonlinear combination of two
unknown variables.
32
CHAPTER 5. METHODS FOR STATIONARY TIME-SERIES DATA
5.3
Single-Equation Dynamic Models
When a dependent variable depends on current and lagged values of xt , but not on lagged
values of itself, we have what is called a distributed lag model:
q
X
βj xt−j + ut , ut ∼ IID(0, σ 2 ).
yt = δ +
j=0
The OLS estimates of the βj may be quite imprecise. However, this is not a problem if we
are merely interested in the long-run impact of changes in the independent variable
q
q
X
X
∂yt
βj =
γ≡
.
∂x
t−j
j=0
j=0
One popular alternative to distributed lag models is the partial adjustment model.
Suppose that the desired level of an economic variable yt is yt◦ . This desired level is assumed
to depend on a vector of exogenous variables X t according to
yt◦ = X t β ◦ + et ,
et ∼ IID(0, σe2 ).
The term yt is assumed to adjust toward yt◦ according to the equation
yt − yt−1 = (1 − δ)(yt◦ − yt−1 ) + vt ,
vt ∼ IID(0, σv2 ),
where δ is an adjustment parameter that is assumed to be positive and strictly less than 1.
Then, we have
yt = yt−1 − (1 − δ)yt−1 + (1 − δ)X t β ◦ + (1 − δ)et + vt
= X t β + δyt−1 + ut ,
where β ≡ (1 − δ)β ◦ and ut ≡ (1 − δ)et + vt . Under the assumption that |δ| < 1, we find
that
∞
∞
X
X
j
yt =
δ X t−j β +
δ j ut−j ,
j=0
j=0
which implies a particular form of distributed lag.
An autoregressive distributed lag or ADL model can be written as
p
q
X
X
yt = β0 +
βi yt−i +
γj xt−j + ut , ut ∼ IID(0, σ 2 ).
i=1
j=0
This is sometimes called an ADL(p, q) model. A widely encountered case is the ADL(1, 1)
model
yt = β0 + β1 yt−1 + γ0 xt + γ1 xt−1 + ut .
It is straightforward to check that the ADL(1, 1) model can be rewritten as
∆yt = β0 + (β1 − 1)(yt−1 − λxt−1 ) + γ0 ∆xt + ut ,
γ0 +γ1
.
1−β1
The above equation is called an error-correction model. It expresses the
where λ =
ADL(1, 1) model in terms of an error-correction mechanism. Due to the reason that ECM
is a nonlinear model, we need to adopt NLS to solve this model. ECM is not very popular
to deal with stationary time series. However, it is very popular for estimating models with
unit root or cointegration.
5.4. SEASONALITY
5.4
33
Seasonality
Many economic time series display a regular pattern of seasonal variation over the course of
every year. There are two different ways to deal with seasonality in economic data:
(a) Try to model it explicitly. For example, seasonal ARMA process: AR(4), ARMA(12,
4)...; seasonal ADL models: ADL(4, q)..., etc.
(b) Use seasonally adjusted data, which have been filtered to remove the seasonal
variation. Of course, there is severe consequence for doing that. So, don’t do it.
5.5
ARCH, GARCH
The concept of autoregressive conditional heteroskedasticity, or ARCH, was introduced by Engle (1982). The basic idea of ARCH models is that the variance of the error
term at time t depends on the realized values of the squared error terms in previous time
periods. Let ut denotes the error term and Ωt−1 denotes an information set that consists of
data observed through period t − 1, an ARCH(q) process can be written as
ut = σt t ;
σt2
≡
E(u2t |Ωt−1 )
= α0 +
q
X
αi u2t−i ,
i=1
where αi > 0 for i = 0, 1, ..., q and t is white noise with variance 1. The above function is
clearly autoregressive. Since this function depends on t, the model is also heteroskedastic.
Also, the variance of ut is a function of ut−1 through σt , which means the variance of ut
is conditional on the past of the process. That is where the term conditional came from.
The error term ut and ut−1 are clearly dependent. They are, however, uncorrelated. Thus,
ARCH process involves only heteroskedasticity, but not serial correlation. The original
ARCH process has not proven to be very satisfactory in applied work.
In fact, the ARCH model became famous because of its descendent: the generalized
ARCH model, which was proposed by Bollerslev (1986). We may write a GARCH(p, q)
process a
q
p
X
X
2
2
2
2
ut = σt t ; σt ≡ E(ut |Ωt−1 ) = α0 +
αi ut−i +
δi σt−j
,
i=1
i=1
The conditional variance here can be written more compactly as
σt2 = α0 + α(L)u2t + δ(L)σt2 .
The simplest and by far the most popular GARCH model is the GARCH(1, 1) process, for
which the conditional variance can be written as
2
.
σt2 = α0 + α1 u2t−1 + δ1 σt−1
Unlike the original ARCH model, the GARCH(1, 1) process generally seems to work quite
well in practice. More precisely, GARCH(1, 1) cannot be rejected against any more general
GARCH(p, q) process in many cases.
34
CHAPTER 5. METHODS FOR STATIONARY TIME-SERIES DATA
There are two possible methods to estimate the ARCH and GARCH models:
(1) Feasible GLS: since ARCH and GARCH processes induce heteroskedasticity, it might
seem natural to use feasible GLS. However, this approach is very very rarely used,
because it is not asymptotically efficient. In case of a GARCH(1, 1), σ
ˆt2 depends on
uˆ2t−1 which in turn depends on the estimates of the regression function. Because of
this, estimating the following function together yields more efficient estimates
yt = X t β + ut
2
σt2 = α0 + α1 u2t−1 + δ1 σt−1
.
(2) MLE: the most popular way to estimate GARCH models is to assume that the error
terms are normally distributed and use ML method. To do that, we first write a linear
regression model with GARCH errors defined in terms of a normal innovation process
as
yt − X t β
= t , t ∼ N (0, 1).
σt (β, θ)
The density of yt conditional on Ωt−1 is then
1
yt − X t β
φ
,
σt (β, θ)
σt (β, θ)
where φ(·) denotes the standard normal density. Therefore, the contribution to the
loglikelihood function made by the tth observation is
1 (yt − X t β)2
1
1
lt (β, θ) = − log 2π − log σt2 (β, θ) −
.
2
2
2 σt2 (β, θ)
This function is not easy to calculate due to the skedastic function σt2 (β, θ). It is
defined implicitly by the recursion
2
σt2 = α0 + α1 u2t−1 + δ1 σt−1
2
and there is no good starting values for σt−1
. An ARCH(q) model does not have the
2
lagged σt term, therefore, does not have such problem. We can simply use the first
q observations to compute the squared residuals so as to form the skedastic function
σt2 (β, θ). For the starting values of lagged σt2 , there are some popular ad hoc procedures:
(a) Set all unknown pre-sample values of uˆ2t and σt2 to zero.
(b) Replace them by an estimate of their common unconditional expectation: an
appropriate function of the θ parameters, or use the SSR/n from OLS estimation.
(c) Treat the unknown starting values as extra parameters.
Anyway, different procedures can produce very different results. For STATA or any
black-box programs, the users should know what the packages are actually doing.
Chapter 6
Unit Roots and Cointegration
Before we get started, there are several definitions we need to be familiar with:
(1) I(0): as t → ∞, the first and second moments tend to fixed stationary values, and
the covariances of the elements yt and ys tend to stationary values that depend only
on |t − s|. Such a series is said to be integrated to order zero, or I(0).
This is a necessary, but not sufficient condition for a stationary process. Therefore, all
stationary processes are I(0), but not all I(0) processes are stationary.
(2) I(1) or unit root: A nonstationary time series is said to be integrated to order one,
or I(1), if the series of its first differences
∆yt ≡ yt − yt−1 = (1 − L)yt
is I(0). We also say the I(1) series has a unit root.
(3) I(d): a series is said to be integrated to order d, or I(d), if it must be differenced d
times to obtain an I(0) series. Note, if yt = t ∼ I(0), then ∆yt = t − t−1 ∼ I(−1).
This is what we called an over differencing problem.
(4) Cointegration: if two or more series are individually integrated, but some linear
combination of them has a lower order of integration, then the series are said to be
cointegrated.
(5) Brownian motion: also referred as standardized Wiener process. This process,
denoted W (r) for 0 ≤ r ≤ 1, can be interpreted as the limit of the standardized random
walk wt as the length of each interval becomes infinitesimally small. It is defined as
W (r) ≡ plim n
n→∞
−1/2
−1/2
w[rn] = plim n
n→∞
[rn]
X
t=1
where [rn] means the integer part of the quantity rn.
35
t ,
t ∼ IID(0, 1)
36
6.1
CHAPTER 6. UNIT ROOTS AND COINTEGRATION
Unit Root
A random walk process is defined as
yt = yt−1 + et ,
y0 = 0,
et ∼ IID(0, σ 2 ).
This process is obviously nonstationary. An obvious generalization is to add a constant term,
which gives us the random walk with drift model
yt = γ1 + yt−1 + et ,
et ∼ IID(0, σ 2 ).
y0 = 0,
The first differences of the yt is
∆yt = γ1 + e1 ,
which is I(0). Therefore, yt is I(1), or say, has a unit root.
There are many methods to test for unit roots. The simplest and most widely-used tests
are variants of the Dickey-Fuller tests, or DF tests. Consider the model
yt = βyt−1 + et ,
et ∼ IID(0, σ 2 ).
When β = 1, this model has a unit root. If we subtract yt−1 from both sides, we obtain
∆yt = (β − 1)yt−1 + et ,
The obvious way to test the unit root hypothesis is to use test the t statistic for the hypothesis
β − 1 = 0 against the alternative that this quantity is negative. This statistic is usually
referred as a τ statistic. Another possible test statistic is n times the OLS estimate of
β − 1. This statistic is called a z statistic. If we wish to test the unit root in a model where
the random walk has a drift, the appropriate test regression is
∆yt = γ0 + γ1 t + (β − 1)yt−1 + et ,
and if we wish to test the unit root with the random walk has both a drift and a trend, the
appropriate test regression is
∆yt = γ0 + γ1 t + γ2 t2 + (β − 1)yt−1 + et .
The asymptotic distributions of the Dickey-Fuller test statistics are referred to as nonstandard distributions or as Dickey-Fuller distributions. We adopt one of simplest
random walk model as an example. For model
wt = wt−1 + σt ,
w0 = 0,
t ∼ IID(0, 1),
the associated z-statistic is
znc
P
Pn
n−1 nt=2 wt−1 t
t=2 wt−1 t
.
= n Pn−1 2 =
P
2
n−2 n−1
t=1 wt
t=1 wt
(6.1)
6.1. UNIT ROOT
37
(a) for the numerator of (6.1): since
n
X
wt2 =
n−1
X
t=1
n−1
n−1
n−1
X
X
2 X
wt + (wt+1 − wt ) =
wt2 + 2
wt t+1 +
2t+1 ,
t=0
t=0
t=0
t=0
implies
n−1
X
wt t+1
t=0
1
=
2
n−1
X
wn2 −
!
2t+1 .
t=0
Therefore,
n
n−1
1X
1X
1
plim
wt−1 t = plim
wt t+1 = W 2 (1) − 1 .
2
n→∞ n
n→∞ n
t=2
t=0
(b) for the denominator of (6.1): since
n
−2
n−1
X
n−1
1X 2
W
n t=1
a
wt2 =
t=1
And
Z
0
1
t
.
n
n
1X
f (x)dx ≡ lim
f
n→∞ n
t=1
We have
−2
plim n
n→∞
n−1
X
wt2
Z
=
t
.
n
1
W 2 (r)dr.
0
t=1
These results imply that
1
2
W 2 (1) − 1
.
W 2 (r)dr
In practice, we need to simulate this distribution, since there is no simple, analytical expression for it.
The models we have seen so far do not include any economic variables beyond yt−1 . If
the error terms are serially correlated, the DF tests are no longer asymptotically valid. The
most popular approach is to use the so called augmented Dickey-Fuller, or ADF tests.
For a unit root test regression model
plim znc = R 1
n→∞
0
∆yt = X t γ ◦ + (β − 1)yt−1 + ut ,
we assume that the error term ut follows AR(1) process
ut = ρ1 ut−1 + et ,
et ∼ white noise.
The the test regression becomes
∆yt = X t γ ◦ − ρ1 X t−1 γ ◦ + (ρ1 + β − 1)yt−1 − βρ1 yt−2 + et ,
= X t γ + (β − 1)(1 − ρ1 )yt−1 + δ1 ∆yt−1 + et .
= X t γ + πyt−1 + δ1 ∆yt−1 + et .
Here, we let π = (β − 1)(1 − ρ1 ). The τ statistic is simply the t test of π = 0. The z statistic
is a little bit tricky. Since ρ1 ∈ (−1, 1), to test for a unit root, we only need to consider if
β − 1 = π/(1 − ρ1 ) = 0. Therefore, a valid ADF z statistic is nˆ
π /(1 − ρˆ1 ).
38
6.2
CHAPTER 6. UNIT ROOTS AND COINTEGRATION
Cointegration
We begin by considering the simplest case, namely, a VAR(1) model with just two variables.
The model can be written as
yt1 = φ11 yt−1,1 + φ12 yt−1,2 + ut1 ,
ut1
∼ IID(0, Ω).
yt2 = φ21 yt−1,1 + φ22 yt−1,2 + ut2 ,
ut2
Let z t and ut be 2-vectors and let Φ be the 2 × 2 matrix. With proper definitions of z t , ut
and Φ, the above equations can be represented by
z t = Φz t−1 + ut ,
ut ∼ IID(0, Ω).
Both yt1 and yt2 are I(1) if the coefficients on the lagged dependent variable are equal to
unity, or equivalently, at least one of the eigenvalues of the matrix Φ is equal to 1. The series
yt1 and yt2 are cointegrated, if there exists a 2-vector η with elements η1 and η2 such that
η > z t = η1 yt1 − η2 yt2
is I(0). The vector η is called a cointegrating vector. We call the whole system of models
as CI(1, 1), which is short for cointegration with each process following I(1).
In practice, we usually expect the relationship between yt1 and yt2 to change over time
by adding a constant term and trend terms,
η > z t = X t γ + vt ,
(6.2)
where X t denotes a deterministic row vector that may or may not have any elements.
As we can see, η is not unique. Usually, we normalize η by setting the first element
to 1. In that case, equation (6.2) can be written as
1 −η2
η > z t = X t γ + vt ,
yt1
= X t γ + vt ,
yt2
yt1 = X t γ + η2 yt2 + vt .
The OLS estimator ηˆ2L is known as the levels estimator. It might be weird to use OLS here,
since yt2 is correlated with vt and vt is probably serially correlated to itself. In fact, we can
show that not only doing OLS is valid, but the OLS estimator ηˆ2L is also super-consistent.
Normally, the OLS estimator is root-n consistent, in the sense of βˆOLS − β0 goes to zero like
n−1/2 as n → ∞. The estimator ηˆ2L is n-consistent or super-consistent, as ηˆ2L − η2 goes to
zero like n−1 at a much faster rate.
We assume that the first eigenvalue of Φ, λ1 = 1 and the second eigenvalue |λ2 | < 1.
Then, the whole system can also be presented by an Error Correction Model
∆yt1 = X t γ + η2 ∆yt2 + (λ2 − 1)(yt−1,1 − η2 yt−1,2 ) + resid,
(6.3)
where λ2 is the second eigenvalue of Φ (Of course, if we have strong belief that the second
eigenvalue λ2 = 1, it is no longer appropriate to use the ECM representation).
6.2. COINTEGRATION
39
The equation (6.3) is too restrictive such that the parameter η2 appears twice. For the
purpose of estimation and testing, we usually use the following unrestricted equation
∆yt1 = X t γ + α∆yt2 + δ1 yt−1,1 + δ2 yt−1,2 + resid.
The ECM estimator is ηˆ2E ≡ −δˆ2 /δˆ1 , using the OLS estimates of δ1 and δ2 .
6.2.1
VAR Representation
What we have discussed so far for estimating cointegrating vectors are all single-equation
methods. When there are more than one cointegrating relation among a set of more than
two variables, we need to adopt the VAR representation. Consider the VAR
Y t = X tB +
p+1
X
Y t−i Φi + U t ,
i=1
where Y t is a 1 × g vector of observations on the levels of a set of variables, each of which
is assumed to be I(1), X t is a row vector of deterministics, B is a matrix of coefficients of
those deterministics, U t is a 1 × g vector of error terms, and the Φi are g × g matrices of
coefficients. We can reparameterized the above equation as
∆Y t = X t B + Y t−1 Π +
p
X
∆Y t−i Γi + U i ,
(6.4)
i=1
where
Γp = −Φp+1 ,
Γi = Γi+1 − Φi+1 ,
Π=
p+1
X
Φi − I g .
i=1
Suppose that the matrix Π has rank r, with 0 ≤ r < g. In this case, we can always write
Π = ηα> ,
where η and α are both g × r matrices. Therefore, (6.4) can be rewritten as
∆Y t = X t B + Y t−1 ηα> +
p
X
∆Y t−i Γi + U i ,
i=1
We can easily tell that not all the elements of η and α can be identified. One convenient
normalization is to set the first element in η as 1. To estimate the above nonlinear regression,
we can use the GNR or the MLE by assuming normality to the error terms.
6.2.2
Testing for Cointegration
There are three popular tests we can use to test for cointegration. To make our lives easier,
we consider a CI(1, 1) model
40
CHAPTER 6. UNIT ROOTS AND COINTEGRATION
(a) Engle-Granger Tests: the idea is to estimate the cointegrating regression
yt1 = X t γ + yt2 η2 + vt ,
(6.5)
.
where Y t ≡ [yt1 yt2 ] and η = [1 .. − η2 ]. And test the resulting estimates of vt by a
Dickey-Fuller test. If there is no cointegration vt ∼ I(1) and vt follows I(0) if there is
cointegration.
The augmented EG test is performed in almost exactly the same way as an augmented
DF test, by running the regression
˜vt−1 +
∆ˆ
vt = X t γ + βˆ
p
X
δj ∆ˆ
vt−j + ei
j=1
where p is chosen to remove any evidence of serial correlation in the residuals. The
series of vˆt are estimated residuals from EG test in (6.5)
(b) ECM Tests: a second way to test for cointegration involves the estimation of an error
correction model. For the same CI(1, 1) model, we have the restricted ECM:
∆yt1 = X t γ + η2 ∆yt2 + (λ2 − 1)(yt−1,1 − η2 yt−1,2 ) + resid,
If there is no cointegration, the second eigenvalue λ2 must be 1 (there will be cointegration if |λ2 | < 1). Which means, for the unrestricted version
∆yt1 = X t γ + α∆yt2 + δ1 yt−1,1 + δ2 yt−1,2 + resid.
the coefficient δ1 and δ2 should be zeros. In practice, however, we do not need to test
both terms equal to 0 at the same time. We can simply test if δ1 = 0 by t statistics.
The distribution of such statistic is certainly non-normal due to unit root. If Y t is a
1 × g vector, Ericsson and MacKinnon (2002) call it the κd (g) distribution. See the
paper for further details.
(c) VAR Test: as its name indicates, this test is based on a vector autoregression representation. The idea is to estimate a VAR model
∆Y t = X t B + Y t−1 Π +
p
X
∆Y t−i Γi + U i ,
i=1
subject to the constraint
Π = ηα>
for various values of the rank r of the impact matrix Π, using ML estimation. Due to
this design, we can actually test the null for which there are any number of cointegrating
relations from 0 to g − 1 against alternatives with a greater number of relations, up to
a maximum of g. The most convenient test statistics are LR statistics
In practice, we can perform the VAR LR test according to following steps:
6.2. COINTEGRATION
41
(1) Regress both ∆Y t and Y t−1 on the deterministic variables X t and the lags
∆Y t−1 , ∆Y t−2 through ∆Y t−p , which yields two sets of residuals,
Vˆ t1 = ∆Y t − ∆Yˆ t ,
Vˆ t2 = Y t−1 − Yˆ t−1 ,
where Vˆ t1 and Vˆ t2 are both 1 × g vectors. ∆Yˆ t and Yˆ t−1 denote the fitted values
from the regressions.
(2) Compute the g × g sample covariance matrices
n
X >
ˆ jl = 1
Σ
Vˆ Vˆ tl ,
n t=1 tj
j = 1, 2, l = 1, 2
(3) Then, we find the solutions λi and z i , for i = 1, ..., g, to the equations
ˆ 22 − Σ
ˆ 21 Σ
ˆ −1 Σ
ˆ 12 )z i = 0.
(λi Σ
11
where λi is a scalar and z i is a g × 1 vector. This is similar to solving for the
ˆ 22 , where
eigenvalues and eigenvectors. We first compute the Ψ
ˆ 22 Ψ
ˆ> = Σ
ˆ −1 .
Ψ
22
22
We find the eigenvalues for the PSD matri A, in which
ˆ>Σ
ˆ ˆ −1 ˆ ˆ
A≡Ψ
22 21 Σ11 Σ12 Ψ22 .
The eigenvalues for A are the same λi we are looking for. We sort the eigenvalues
λi from the largest to smallest.
(4) It can be shown that the maximized loglikelihood function for the restricted model
is
r
nX
gn
log(1 − λi ).
− (log 2π + 1) −
2
2 i=1
(5) Therefore, to test the null hypothesis that r = r1 against the alternative that
r = r2 , for r1 < r2 ≤ g, we compute the LR statistic
−n
r2
X
log(1 − λi ),
i=r1 +1
which follows a non-standard distribution as usual.
Chapter 7
Testing the Specification of
Econometric Models
Estimating a misspecified regression model generally yields biased and inconsistent parameter estimates. In this chapter, we discuss a number of procedures that are designed for
testing the specification of econometric models.
7.1
Specification Tests Based on Artificial Regressions
Let M denotes a model, µ denotes a DGP which belongs to that model, and plimµ means a
probability limit taken under the DGP µ.
7.1.1
RESET Test
One of the oldest specification tests for linear regression models, but one that is still widely
used, is the regression specification error test, or RESET test. The idea is to test the
null hypothesis that
y t = X t β + ut ,
ut ∼ IID(0, σ 2 ),
where the explanatory variables X t are predetermined with respect tot the error terms ut ,
against the alternative that E(yt |X t ) is a nonlinear function of the elements of X t . The
ˆ and
simplest version of RESET involves regressing yt on X t to obtain fitted values X t β
then running the regression
ˆ 2 + ut .
yt = X t β + γ(X t β)
The test statistic is the ordinary t statistic for γ = 0.
7.1.2
Conditional Moment Tests
If a model M is correctly specified, many random quantities that are functions of the dependent variables should have expectation of zero. Often, these expectations are taken
conditional on some information set. Let mt (yt , θ) be a moment function, θ is the parameˆ is referred to as an empirical
ter vector and yt is the dependent variables. Then, m(yt , θ)
42
7.2. NONNESTED HYPOTHESIS TESTS
43
moment. The idea of the conditional moment test is then to test the quantity
n
ˆ ≡
m(y, θ)
1X
ˆ
mt (yt , θ)
n t=1
is significantly different from zero.
7.1.3
Information Matrix Tests
For a model estimated by ML with parameter θ, the asymptotic information matrix is
equal to minus the asymptotic Hessian. Therefore, we should expect that, in general, the
information matrix equality does not hold when the model we are estimating is misspecified.
The null hypothesis for the IM test is that
n
1X
plim
n→∞ n
t=1
∂ 2 lt (θ) ∂lt (θ) ∂lt (θ)
+
∂θi ∂θj
∂θi ∂θj
= 0,
for i = 1, ..., k and j = 1, ..., i. We can calculate IM test statistics by means of the OPG
regression.
ι = G(θ) + M (θ)c + resid.
The matrix M (θ) is constructed as an n × 1/2k(k + 1) matrix with typical element
∂ 2 lt (θ) ∂lt (θ) ∂lt (θ)
+
.
∂θi ∂θj
∂θi ∂θj
ˆ
ˆ
The test statistic is then the ESS from the above regression. If the matrix [G(
θ), M (θ)]
2
has full rank, this test statistic is asymptotically distributed as χ 1/2k(k + 1) . If it does
ˆ have to be dropped, and the number of
not have full rank, one or more columns of M (θ)
dof reduced accordingly.
7.2
Nonnested Hypothesis Tests
Two models are said to be nonnested, if neither model can be written as a special case of
the other without imposing restrictions on both models. We can write the two models as
H1 :
H2 :
y = Xβ + u1 ,
y = Zγ + u2 .
The simplest and most widely-used nonnested hypothesis tests start from the artificial
comprehensive model
y = (1 − α)Xβ + αZγ + u,
(7.1)
where α is a scalar parameter. Ideally, we can test if α = 1 or 0, to determine whether H1 or
H2 represent the true DGP. However, this is not possible since at least one of the parameters
can not be identified.
44
CHAPTER 7. TESTING THE SPECIFICATION OF ECONOMETRIC MODELS
7.2.1
J Tests
One popular way to make (7.1) identified is to replace the unknown vector γ by its estimates.
This idea was first suggested by Davidson and MacKinnon (1981). Thus, equation (7.1)
becomes
ˆ + u,
y = Xβ + αZ γ
= Xβ + αPZ y + u.
The ordinary t statistic for α = 0 is called the J statistics, which is asymptotically distributed as N (0, 1) under the null hypothesis H1 .
J test tends to overreject, often quite severe when at least one of the following conditions
holds:
• The sample size is small;
• The model under test does not fit very well;
• The number of regressors in H2 that do not appear in H1 is large.
There are many ways we can to do to improves the its finite sample performance. We can
use a fully parametric or semiparametric bootstrap method. Or we can adopt the JA test
by Fisher and McAleer (1981). The test regression is
˜ + u,
y = Xβ + αZ γ
= Xβ + αPZ PX y + u,
and the JA statistic is the t statistic for α = 0.
7.2.2
P Tests
The J test can be extended to nonlinear regression models. Suppose the two models are
H1 :
H2 :
y = x(β) + u1 ,
y = z(γ) + u2 .
The J statistic is the t statistic for α = 0 in the nonlinear regression
y = (1 − α)x(β) + αˆ
z + resid,
ˆ ≡ z(ˆ
ˆ being the vector of NLS estimates of the regression model H2 . We can,
where z
γ ), γ
as usual, run the GNR which corresponds to the above equation, evaluated at α = 0 and
ˆ This GNR is
β = β.
ˆ + a(ˆ
ˆ = Xb
ˆ ) + resid,
y−x
z−x
ˆ and X
ˆ is the matrix of derivatives of x(β) with respect to β,
ˆ = X(β)
ˆ = x(β),
where x
ˆ The ordinary t statistic for a = 0 is called the P statistic.
evaluated at the NLS estimates β.
7.3. MODEL SELECTION BASED ON INFORMATION CRITERIA
7.2.3
45
Cox Tests
If the two nonnested models are estimated by ML, and that their loglikelihood functions are
n
n
X
X
l1 (θ 1 ) =
l1t (θ 1 ) and l2 (θ 2 ) =
l2t (θ 2 ),
t=1
t=1
for models H1 and H2 , respectively, we can use another approach called Cox tests. Cox’s
idea was to extend the idea of a likelihood ratio test. Cox’s test statistic is
ˆ 2 ) − l1 (θ
ˆ1) ,
ˆ 2 ) − l1 (θ
ˆ 1 ) − 2n−1/2 E ˆ l2 (θ
T1 ≡ 2n−1/2 l2 (θ
θ1
which follows a normal distribution asymptotically.
7.3
Model Selection Based on Information Criteria
The true model is rarely feasible to economists due to the complexity of economic processes.
Thus, economists formulate approximation models to capture the effects or factors supported
by the empirical data. However, using different approximation models usually end up with
different empirical results, which give rise to the so-called model uncertainty. One of the two
popular ways to deal with model uncertainty is model selection.
Model selection is a procedure of selecting the best model from a set of approximation models. Such a procedure generally involves calculating a criterion function for all
approximation models and ranking them accordingly. One of the most widely used criterion
functions is the Akaike information criterion (AIC) proposed by Akaike (1973). The simplest
version of AIC is
ˆ i ) − ki .
AICi = li (β
We then choose the model that maximizes AICi .
A popular alternative to AIC is the Bayesian information criterion (BIC) of Schwarz
(1978), which takes the following form
ˆ ) − 1 ki log n.
BICi = li (β
i
2
BIC has a stronger penalty for complexity. There are other methods based on various criteria
including the Mallows Criterion (Mallows’ Cp ) by Mallows (1973), the prediction criterion
by Amemiya (1980), the focused information criterion (FIC) by Claeskens and Hjort (2003),
etc.
7.4
Nonparametric Estimation
(For this part, you can also read “Nonparametric Econometrics: Theory and Practice” by
Qi Li and Jeffrey Racine. )
Nonparametric regression is a form of regression analysis in which the predictor does
not take a predetermined form but is constructed according to information derived from
the data. Nonparametric regression requires larger sample sizes than regression based on
parametric models because the data must supply the model structure as well as the model
estimates.
46
7.4.1
CHAPTER 7. TESTING THE SPECIFICATION OF ECONOMETRIC MODELS
Kernel Estimation
One traditional way of estimating a PDF is to form a histogram. Given a sample xt ,
t = 1, ..., n, of independent realizations of a random variable X, for any arbitrary argument
x, the empirical distribution function (EDF) is
n
1X
I(xt ≤ x).
Fˆ (x) =
n t=1
The indicator function I is clearly discontinuous, which makes the above EDF discontinuous.
In practice and theory, we always prefer to have a smooth EDF for various reasons, for
example, differentiation. For these reasons, we replace I with a continuous CDF, K(z),
with mean 0. This function is called a cumulative kernel. It is convenient to be able to
control the degree of smoothness of the estimate. Accordingly, we introduce the bandwidth
parameter h as a scaling parameter for the actual smoothing distribution. This gives the
kernel CDF estimator
n
X
x
−
x
1
t
K
.
(7.2)
Fˆh (x) =
n t=1
h
There are many arbitrary kernels we can choose and a popular one is the standard normal
distribution or say the Gaussian kernel. If we differentiate equation (7.2) with respect to
x, we obtain the kernel density estimator
n
1 X
x − xt
ˆ
fh (x) =
k
.
nh t=1
h
This estimator is very sensitive to the value of the bandwidth h. Two popular choices for h
are h = 0.1059sn−1/5 and h = 0.785(ˆ
q.75 − qˆ.75 )n−1/5 , where s is the standard deviation of xt
and qˆ.75 − qˆ.75 is the difference between the estimated .75 and .25 quantiles of the data.
7.4.2
Nonparametric Regression
The nonparametric regression estimates E(yt |xt ) directly, without making any assumptions
about functional form. We suppose that two random variables Y and X are jointly distributed, and we wish to estimate the conditional expectation µ(x) ≡ E(Y |x) as a function
of x, using a sample of paired observations (yt , xt ) for t = 1, ..., n. For given x, we define
Z ∞
Z ∞
g(x) ≡
yf (y, x)dy = f (x)
yf (y|x)dy = f (x)E(Y |x),
−∞
−∞
where f (x) is the marginal density of X, and f (y|x) is the density of Y conditional on X = x.
Then,
Z ∞ ˆ
f (x, y)
gˆ(x)
µ
ˆ(x) =
=
y
dy.
fˆ(x)
fˆ(x)
−∞
We use kernel density estimation for the joint distribution f (x, y) and f (x) with a kernel k,
n
n
x − xi
y − yi
1 X
x − xi
1 X
ˆ
ˆ
f (x, y) =
k
k
,
f (x) =
k
nhhy i=1
h
hy
nh i=1
h
7.4. NONPARAMETRIC ESTIMATION
47
Therefore,
Z
∞
Z ∞
n
x − xi
1 X
y − yi
k
dy
yk
nhhy i=1
h
hy
−∞
Z ∞
n
n
1 X
x − xi
1 X
x − xi
=
k
(yi + hy v)k(v)(hy )dv =
k
yi
nhhy i=1
h
nh i=1
h
−∞
y fˆ(x, y)dy =
−∞
Finally, we obtain the so-called Nadaraya-Watson estimator,
Pn
yt kt
x − xt
t=1
,
kt ≡ k
µ
ˆ(x) = Pn
.
h
t=1 kt
Appendix A
Sample Assignment and Solution
1. Show that the difference between the matrix
(J 0 W 0 X)−1 J 0 W 0 ΩW J(X 0 W J)−1
and the matrix
(X 0 W (W 0 ΩW )−1 W 0 X)−1
is a positive semidefinite matrix.
Solution: Our goal is to show that
(J T W T X)−1 J T W T ΩW J(X T W J)−1 − X T W (W T ΩW )−1 W T X
−1
is psd. (A.1)
Note
−1
(J T W T X)−1 J T W T ΩW J(X T W J)−1 = (X T W J)(J T W T ΩW J)−1 (J T W T X)
Hence, proving (1) is psd. is equivalent to prove
X T W (W T ΩW )−1 W T X − (X T W J)(J T W T ΩW J)−1 (J T W T X) is psd.
(A.2)
by ETM exercise 3.8
1
1
Define PΩ 21 W = Ω 2 W (W T ΩW )−1 W T (Ω 2 )T , we have
1
1
X T W (W T ΩW )−1 W T X = X T Ω− 2 PΩ 12 W (Ω− 2 )T X
1
1
Similarly, define PΩ 12 W J = Ω 2 W J(J T W T ΩW J)−1 J T W T (Ω 2 )T , we have
1
1
(X T W J)(J T W T ΩW J)−1 (J T W T X) = X T Ω− 2 PΩ 21 W J (Ω− 2 )T X
Hence,
1
1
(2) = (X T Ω− 2 ) · PΩ 21 W − PΩ 21 W J · (X T Ω− 2 )T
48
(A.3)
49
Note SΩ 21 W ⊇ SΩ 21 W J , hence, PΩ 12 W · PΩ 12 W J = PΩ 12 W J , which implies
PΩ 12 W − PΩ 21 W J
T PΩ 12 W − PΩ 12 W J = PΩ 12 W − PΩ 12 W J
Finally,
T 1
1
PΩ 12 W − PΩ 12 W J · (X T Ω− 2 )T
(3) = (X T Ω− 2 ) · PΩ 12 W − PΩ 21 W J
1 T
= k PΩ 12 W − PΩ 12 W J X T Ω− 2 ||2 ≥ 0
Hence, (2) = (3) is psd, which implies (1) is also psd.
50
APPENDIX A. SAMPLE ASSIGNMENT AND SOLUTION
2. Suppose ut follows an AR(1) process
ut = ρut−1 + t ,
t ∼ iid(0, 1),
, |ρ| < 1
(a) Compute Eu2t , Eut ut−1 , and Eu
Pt∞ut−2 . Confirm that they do not depend on t
(b) Define γ(h) = Eut ut+h . Find h=−∞ γ(h)
Solution:
(a) we have
ut = ρut−1 + t
= t + ρt−1 + ρ2 t−2 + · · ·
Since t ∼ iid(0, 1), and the t s are uncorrelated. Hence
Eu2t = E(2t + ρ2 2t−1 + ρ4 2t−2 + · · · ) + 0
= 1 + ρ2 + ρ4 + · · ·
1
=
1 − ρ2
Similarly,
Eut ut−1 = E(ρ2t−1 + ρ3 2t−2 + · · · ) + 0
= ρ + ρ3 + ρ5 + · · ·
ρ
=
1 − ρ2
And,
Eut ut−2 = E(ρ2 2t−2 + ρ4 2t−3 + · · · ) + 0
= ρ2 + ρ4 + ρ6 + · · ·
ρ2
=
1 − ρ2
Hence, we concluded that they don’t depend on time t.
(b) Note that Eut ut+1 = Eut+1 ut , which means the value of γ(h) is symmetric around
1
h = 0, where we know γ(0) = Eu2t = 1−ρ
2 . Hence
∞
X
h=−∞
γ(h) = 2
∞
X
γ(h) + γ(0)
h=1
We only need to consider the case which h > 0, hence
ut+h = t+h + ρt+h−1 + · · · + ρh t + · · ·
51
Then we have
Eut+h ut = E(ρh 2t + ρh+2 2t−1 + · · · )
= ρh + ρh+2 + · · ·
ρh
=
= γ(h)
1 − ρ2
Hence,
∞
X
γ(h) =
h=1
=
1
(ρ + ρ2 + ρ3 + · · · )
1 − ρ2
ρ
ρ
1
=
·
2
2
1−ρ 1−ρ
(1 − ρ )(1 − ρ)
Finally,
∞
X
γ(h) =
h=−∞
2ρ
(1 −
ρ2 )(1
− ρ)
+
1
1 − ρ2
1+ρ
(1 − ρ2 )(1 − ρ)
1
=
(1 − ρ)2
=
3. Consider the linear regression model
yt = βxt + ut ,
t = 1, · · · , n,
where ut is generated by ut = ρut−1 + t , t ∼ iid(0, 1), |ρ| < 1. β is a scalar, and
xt = 1 for all t. Let βˆGM M be the efficient GMM estimator using the instrument
ˆ the OLS estimator of β.
W = (1, 1, · · · , 1)0 . Show that βˆGM M is equal to β,
Solution: We know the efficient GMM estimator is
−1 T
βˆGM M = X T W (W T Ω0 W )−1 W T X
X W (W T Ω0 W )−1 W T y
Since, we know W = [1, 1, · · · , 1]T , we have
T
W Ω0 W =
n
X
σi2
(A.4)
i=1
Define the value of (4) as a, since a is a scalar. Hence,
W (W T Ω0 W )−1 W T = W a−1 W T = a−1 W W T = na−1
Define the value of (5) as b, we then transform the βˆGM M as
βˆGM M = (X T bX)−1 X T by
= b−1 b(X T X)−1 X T y = βˆOLS
(A.5)
52
APPENDIX A. SAMPLE ASSIGNMENT AND SOLUTION
4. Show that
E
∂ 2 log f (yt , θ)
∂θ∂θ0
Solution:
= −E
∂ log f (yt , θ) ∂ log f (yt , θ)
·
∂θ
∂θ0
∂ log f (yt , θ)
1
∂f (yt , θ)
=
·
∂θ
f (yt , θ)
∂θ
Then, we have
∂ 2 log f (yt , θ)
∂ 2 f (yt , θ)
∂f (yt , θ) ∂f (yt , θ)
1
1
·
·
·
=
−
∂θ∂θ0
f (yt , θ)
∂θ∂θ0
f 2 (yt , θ)
∂θ
∂θ0
∂ 2 log f (yt , θ)
∂f (yt , θ) ∂f (yt , θ)
1
1
·
·
+ 2
=
0
0
∂θ∂θ
f (yt , θ)
∂θ
∂θ
f (yt , θ)
2
1
∂ log f (yt , θ) ∂ log f (yt , θ) ∂ log f (yt , θ)
+
·
=
0
0
∂θ∂θ
∂θ
∂θ
f (yt , θ)
∂ 2 f (yt , θ)
∂θ∂θ0
∂ 2 f (yt , θ)
·
∂θ∂θ0
·
Hence,
2
∂ log f (yt , θ) ∂ log f (yt , θ) ∂ log f (yt , θ)
1
∂ 2 f (yt , θ)
E
+
·
·
= E
∂θ∂θ0
∂θ
∂θ0
f (yt , θ)
∂θ∂θ0
Z ∂ 2 f (yt , θ)
1
·
f (yt , θ)dyt
=
f (yt , θ)
∂θ∂θ0
Z
∂2
=
ft dyt
∂θ∂θ0
∂ 21
=0
=
∂θ∂θ0
Finally,
2
∂ log f (yt , θ) ∂ log f (yt , θ) ∂ log f (yt , θ)
E
+
·
= 0
∂θ∂θ0
∂θ
∂θ0
2
∂ log f (yt , θ) ∂ log f (yt , θ)
∂ log f (yt , θ)
= −E
·
E
∂θ∂θ0
∂θ
∂θ0
5. Show that
Var(W (X)) ≥
E
2
d
EW
(X)
dθ∂
2
log f (x|θ)
∂θ
Solution: First,
E
Z
∂
1 ∂f (x|θ)
=
log f (x|θ)
f (x|θ)dx
∂θ
f (x|θ) ∂θ
Z
d
=
f (x|θ)dx = 0
dθ
53
Hence, we yields
E
2
∂
∂
log f (x|θ) = Var
log f (x|θ)
∂θ
∂θ
By Cauchy-Schwartz Inequality,
Var(W (X)) · Var
2
∂
∂
log f (x|θ) ≥ Cov W (X),
log f (x|θ)
∂θ
∂θ
Note,
∂
log f (x|θ)
Cov W (X),
∂θ
Z
∂
= (W (X) − EW (X)) ·
log f (x|θ) · f (x|θ)dx
∂θ
Z
∂
= (W (X) − EW (X)) · f (x|θ)dx
∂θ
Z
Z
∂
∂
= W (X) f (x|θ)dx − EW (X)
f (x|θ)dx
∂θ
∂θ
Z
d
W (X)f (x|θ)dx − 0
=
dθ
d
= EW (X)
dθ
Hence,
Var(W (X)) ≥
2
d
EW (X)
dθ
∂
log f (x|θ)
Var ∂θ
=
E
2
d
EW (X)
dθ
∂
2
log f (x|θ)
∂θ
6. Consider the linear regression models
y = Xβ + resid
y = Xβ + Zγ + resid
y = Xβ + MX Zγ + resid
where y and u are n × 1, X is n × k1 , Z is n × k2 , β is k1 × 1, and γ is k2 × 1. Let
βˆ(0) be the OLS estimate of β from the first regression. Let βˆ(1) and βˆ(2) be the OLS
estimate of β from the second and the third regression, respectively. Let γˆ (1) and γˆ (2)
be the OLS estimate of γ from the second and third regression, respectively.
(a) show βˆ(0) =βˆ(2) and γˆ (1) =ˆ
γ (2)
(b) show PX,MX Z y = PX y + PMX Z y
(c) show S([X, Z]) = S([X, MX Z]) and conclude PX,Z y = PX y + PMX Z y for any
y ∈ Rn .
Solution:
54
APPENDIX A. SAMPLE ASSIGNMENT AND SOLUTION
(a) Define MX Z = H, then
y = Xβ (2) + Hγ (2) + resid
Using FWL, we obtain
βˆ(2) = (X 0 MH X)−1 X 0 MH y
= (X 0 X − X 0 PH X)−1 (X 0 y − X 0 PH y)
Note, from MX Z = H, we derive, X 0 H = X 0 MX Z = 0, hence, X 0 PH = 0. Then
βˆ(2) = (X 0 X)−1 X 0 y = βˆ(0)
To prove γˆ (1) = γˆ (2) , using FWL, we obtain,
γˆ (1) = (X 0 MZ X)−1 X 0 MZ y
Premultiply MX on equation (3), we obtain
MX y = MX MX Zγ (2) + resid
Hence,
γˆ (2) = (X 0 MZ X)−1 X 0 MZ y = γˆ (1)
(b) To prove PX,MX Z y = PX y + PMX Z y is equivalent to prove PX,MX Z = PX + PMX Z .
That means, we need to show
i. S([X, MX Z]) is invariant under the action of PX + PMX Z
ii. S ⊥ ([X, MX Z]) is annilated by PX + PMX Z
Prove (i), we know any vector in S([X, MX Z]) can be expressed as [X, MX Z]γ
for some k1 + k2 vector γ. Then
(PX + PMX Z )[X, MX Z]γ = [(PX + PMX Z )X, (PX + PMX Z )MX Z]γ
= [X + 0, 0 + MX Z]γ
= [X, MX Z]γ
Note, PMX Z X = 0 since X and MX Z are orthogonal, and PX MX Z = 0.
Hence, S([X, MX Z[) is invariant under the action of PX + PMX Z .
Prove (ii), let w be any element in S ⊥ ([X, MX Z]), then
[X, MX Z]0 w = 0
implies, X 0 w = 0 and Z 0 MX w = 0. To finish our proof, we premultiply PX +PMX Z
on w, we obtain
(PX + PMX Z )w = X(X 0 X)−1 X 0 w + MX Z(Z 0 MX Z)−1 Z 0 MX w
= X(X 0 X)−1 ∗ 0 + MX Z(Z 0 MX Z)−1 ∗ 0 = 0
Hence, S ⊥ ([X, MX Z]) is annilated by PX + PMX Z .
Finally, we obtain PX,MX Z = PX + PMX Z , then
PX,MX Z y = PX y + PMX Z y
55
(c) To prove PX,Z y = PX y + PMX Z y, we can use result from part (b). We only need
to show S([X, Z]) is equivalent to S([X, MX Z]).
Let a be any element from S([X, Z]), then a can always be expressed as
a = [X, Z]γ = Xγ1 + Zγ2
for some k1 × 1 vector γ1 , and some k2 × 1 vector γ2 . Then, we rewrite Zγ2
as MX Zγ2 + PX Zγ2 . Note, S(PX Z) ⊆ S(X). Hence, the element PX Zγ2 from
S(PX Z) can always be expressed as X γ˜ for some k1 × 1 γ˜ . Then, we obtain
a = Xγ1 + MX Zγ2 + PX Zγ2 = Xγ1 + MX Zγ2 + X γ˜
= X(γ1 + γ˜ ) + MX Zγ2
γ1 + γ˜
= [X MX Z]
γ2
Hence, a ∈ S([X, MX Z]), implies S([X, Z]) ⊆ S([X, MX Z]).
Let b be any element from S([X, MX Z]), then b can always be expressed as
b = [X, MX Z]δ = Xδ1 + MX Zδ2
for some k1 × 1 vector δ1 , and some k2 × 1 vector δ2 . Then, we rewrite MX Zδ2 as
Zδ2 − PX Zδ2 . Note, S(PX Z) ⊆ S(X). Hence, the element PX Zδ2 from S(PX Z)
˜ Then, we obtain
can always be expressed as X δ˜ for some k1 × 1 δ.
b = Xδ1 + Zδ2 − PX Zδ2 = Xδ1 + Zδ2 − X δ˜
˜ + Zδ2
= X(δ1 − δ)
δ1 − δ˜
= [X Z]
δ2
Hence, b ∈ S([X, Z]), implies S([X, MX Z]) ⊆ S([X, Z]). Together with S([X, Z]) ⊆
S([X, MX Z]), we yield
S([X, Z]) = S([X, MX Z])
Recall in part (b), we already proved that
i. S([X, MX Z]) is invariant under the action of PX + PMX Z
ii. S ⊥ ([X, MX Z]) is annilated by PX + PMX Z
By S([X, Z]) = S([X, MX Z]), we have
i. S([X, Z]) is invariant under the action of PX + PMX Z
ii. S ⊥ ([X, Z]) is annilated by PX + PMX Z
This implies
PX,Z = PX + PMX Z
Hence, PX,Z y = PX y + PMX Z y
56
APPENDIX A. SAMPLE ASSIGNMENT AND SOLUTION
7. The file tbrate.data contains data for 1950:1 to 1996:4 for three series: rt , the interest
rate on 90-day treasury bills, πt , the rate of inflation, and yt , the logarithm of real GDP.
For the period 1950:4 to 1996:4, run the regression
∆rt = β1 + β2 πt−1 + β3 ∆yt−1 + β4 ∆r−1 + β5 ∆rt−2 + ut ,
where ∆ is the first-difference operator, defined so that ∆rt = rt − rt−1 . Plot the
residuals and fitted values against time. Then regress the residuals on the fitted values
and on a constant. What do you learn from this second regression? Now regress the
fitted values on the residuals and on a constant. What do you learn from this third
regression?