Example 1. Somites of earthworms. 1. Introduction

1. Introduction
1.1 Why empirical likelihood
• nonparametric method: without having to assume the form of
the underlying distribution
• likelihood based inference: taking the advantages of likelihood
methods
• alternative method when other (more conventional) methods
are not applicable
Example 1. Somites of earthworms.
Earthworms have segmented bodies. The segments are known as
somites. As a worm grows, both the number and the length of its
somites increases.
The dataset contains the No. of somites on each of 487 worms
gathered near Ann Arbor in 1902.
The histogram shows that the distribution is skewed to the left,
and has a heavier tail to the left.
Skewness: γ =
E{(X−EX)3}
,
{Var(X)}3/2
E{(X−EX)4}
Kurtosis: κ = {Var(X)}2 −3,
—
a measure for symmetry
—
a measure for tail-heaviness
Estimation for γ and κ
¯ = n−1
Let X
Remark. (i) For N (µ, σ 2), γ = 0 and κ = 0.
(ii) For symmetric distributions, γ = 0.
γb =
P
P
¯ 2.
b 2 = (n − 1)−1 1≤i≤n(Xi − X)
1≤i≤n Xi, and and σ
n
1 X
¯ 3,
(Xi − X)
b3
nσ
i=1
b =
κ
n
1 X
¯ 4.
(Xi − X)
b4
nσ
i=1
How to find the confidence sets for (γ, κ)?
(iii) When κ > 0, heavier tails than those of N (µ, σ 2).
Answer: Empirical likelihood contours.
Let l(γ, κ) be the (log-) empirical likelihood function of (γ, κ). The
confidence region for (γ, κ) is defined as
{(γ, κ) : l(γ, κ) > C},
where C > 0 is a constant determined by the confidence level, i.e.
P {l(γ, κ) > C} = 1 − α.
Why do conventional methods not apply?
Parametric likelihood. Not normal distribution! Likelihood inference for high moments is typically not robust wrt a misspecified distribution.
Bootstrap. Difficult in picking out the confidence region from
a point cloud consisting of a large number of bootstrap estimates for (γ, κ).
In the second panel, the empirical likelihood confidence regions
(i.e. contours) correspond to confidence levels of 50%, 90%,
95%, 99%, 99.9% and 99.99%.
For example, given 1000 bootstrap estimates for (γ, κ), ideally
95% confidence region should contain 950 central points.
In practice, we restrict to rectangle or ellipse regions in order
to facilitate the estimation.
Note. (γ, κ) = (0, 0) is not contained in the confidence regions
1.2 Introducing empirical likelihood
Let X = (X1, · · · , Xn)τ be a random sample from an unknown
distribution F (·). We know nothing about F (·).
Since
P {X1 = x1, · · · , Xn = xn} = p1 · · · pn,
the likelihood is
In practice we observe Xi = xi (i = 1, · · · , n), x1, · · · , xn are n
known numbers.
L(p1, · · · , pn) ≡ L(p1, · · · , pn; X) =
n
Y
pi ,
i=1
which is called an empirical likelihood.
Basic idea. Assume F is a discrete distribution on {x1, · · · , xn}
with
pi = F (xi),
i = 1, · · · , n,
where
pi ≥ 0,
n
X
Remark. The number of parameters is the same as the number
of observations.
Note
pi = 1.
i=1
What is the likelihood function of {pi}? What is the MLE?
n
Y
i=1
pi
1/n
≤
n
1 X
1
pi = ,
n i=1
n
the equality holds iff p1 = · · · = pn = 1/n.
Example 2. Find the MELE for µ ≡ EX1.
Put pbi = 1/n, we have
L(p1, · · · , pn; X) ≤ L(pb1, · · · , pbn; X)
for any pi ≥ 0 and
P
i pi = 1.
Hence the MLE based on the empirical likelihood, which is called
maximum empirical likelihood estimator (MELE), puts the
equal probability mass 1/n on the n observed values x1, · · · , xn.
Namely the MELE for F is the uniform distribution on observed
data points. The corresponding distribution function
Fn(x) =
n
1 X
n i=1
I(Xi ≤ x)
is called the empirical distribution of the sample X = (X1, · · · , Xn)τ .
Corresponding to the EL,
µ=
n
X
i=1
pixi = µ(p1, · · · , pn).
Therefore, the MELE for µ is
b = µ(pb1, · · · , pbn) =
µ
n
n
1 X
1 X
¯
xi =
Xi = X.
n i=1
n i=1
Similarly, the MELE for µk ≡ E(X1k ) is the simply the sample k-th
moment:
n
1 X
bk =
µ
X k.
n i=1 i
2. Empirical likelihood for means
Let X1, · · · , Xn be a random sample from an unknown distribution.
Remarks. (i) MELEs, without further constraints, are simply the
method of moments estimators, which is not new.
(ii) Empirical likelihood is a powerful tool in dealing with testing
hypotheses and interval estimation in a nonparametric manner
based on the likelihood tradition, which also involves evaluating
MELEs under some further constraints.
Goal: test hypotheses on µ ≡ EX1, or find confidence intervals
for µ.
Tool: empirical likelihood ratios (ELR)
2.1 Tests
Consider the hypotheses
H0 : µ = µ0
Let L(p1, · · · , pn) =
ELR
T =
Q
i pi .
vs
H1 : µ 6= µ0.
We reject H0 for large values of the
L(n−1, · · · , n−1)
max L(p1, · · · , pn)
=
,
maxH0 L(p1, · · · , pn)
L(pe1, · · · , pen)
where {pe1} are the constrained MELEs for {pi} under H0.
Two problems:
Theorem 1. For µ ∈ (x(1), x(n)),
(i) pei =?
(ii) What is the distribution of T under H0?
pi(µ) =
(i) The constrained MELEs pei = pi(µ0), where {pi(µ)} are the
solution of the maximisation problem:
max
n
X
{pi} i=1
subject to the conditions
n
X
pi ≥ 0,
i=1
n
X
i
i=1
n
X
xj − µ
=0
n
−
λ(xj − µ)
j=1
(2)
Proof. We use the Lagrange multiplier technique to solve this
optimisation problem. Put
pixi = µ.
The solution for the above problem is given in the theorem below. Note
x(1) ≡ min xi ≤
(1)
in the interval ( x n−µ , x n−µ ).
(1)
(n)
i=1
n
X
1 ≤ i ≤ n,
where λ is the unique solution of the equation
log pi
pi = 1,
1
> 0,
n − λ(xi − µ)
Q=
X
log pi + ψ
i
pi xi ≤ max xi ≡ x(n) .
X
i
X
pi − 1) + λ(
i
pixi − µ).
Letting the partial derivatives of Q w.r.t. pi, ψ and λ equal 0, we
i
It is natural we require x(1) ≤ µ ≤ x(n) .
have
Now let g(λ) be the function on the LHS of (2). Then
p−1
i + ψ + λxi = 0
X
pi = 1
i
X
p i xi = µ
i
(3)
(4)
(5)
lim
By (3),
pi = −1/(ψ + λxi).
(6)
Hence , 1 + ψpi + λxipi = 0, which implies ψ = −(n + λµ). This
together with (6) imply (1). By (1) and (5),
X
xi
= µ.
(7)
n
−
λ(x
i − µ)
i
It follows (4) that
X
X
µ
.
n
−
λ(x
i − µ)
i
i
This together with (7) imply (2).
µ=µ
pi =
(xi − µ)2
> 0.
2
i {n − λ(xi − µ)}
Hence g(λ) is a strictly increasing function. Note
g(λ)
˙
=
λ↑ x n −µ
(n)
X
g(λ) = ∞,
lim
λ↓ x n −µ
(1)
g(λ) = −∞,
Hence g(λ) = 0 has a unique solution between in the interval
n
n
,
.
x(1) − µ x(n) − µ
Note for any λ in this interval,
1
> 0,
n − λ(x(1) − µ)
1
> 0,
n − λ(x(n) − µ)
and 1/{n − λ(x − µ)} is a monotonic function of x. It holds that
pi(µ) > 0 for all 1 ≤ i ≤ n.
¯ λ = 0, and
Remarks. (a) When µ = x
¯ = X,
pi(µ) = 1/n,
i = 1, · · · , n.
It may be shown for µ close to E(Xi), and n large
pi(µ) ≈
(c) The likelihood function L(µ) may be calculated using R-code
and Splus-code, downloaded at
1
1
,
·
n 1 + x¯−µ (xi − µ)
http://www-stat.stanford.edu/∼owen/empirical/
S(µ)
1 P (x − µ)2.
where S(µ) = n
i i
(ii) The asymptotic theorem for the classic likelihood ratio tests
(i.e. Wilks’ Theorem) still holds for the ELR tests.
(b) We may view
L(µ) = L{p1(µ), · · · , pn(µ)}.
as a profile empirical likelihood for µ.
Let X1, · · · , Xn i.i.d., and µ = E(X1). To test
Hypothetically consider an 1-1 parameter transformation from
{p1, · · · , pn} to {µ, θ1, · · · , θn−1}. Then
H0 : µ = µ0
vs
H1 : µ 6= µ0,
L(µ) = max L(µ, θ1, · · · , θn−1) = L{µ, θb1(µ), · · · , θbn−1(µ)}
{θi }
the ELR statistic is
(1/n)n
max L(p1, · · · , pn)
=
T =
maxH0 L(p1, · · · , pn)
L(µ0)
n n
o
Y
λ
1
=
1 − (Xi − µ0) .
n
i=1 npi(µ0)
i=1
where λ is the unique solution of
=
n
Y
n
X
Xj − µ 0
= 0.
n
−
λ(Xj − µ0)
j=1
Theorem 2. Let E(X12) < ∞. Then under H0,
2 log T = 2
n
X
i=1
n
log 1 −
o
λ
(Xi − µ0) → χ2
1
n
in distribution as n → ∞.
A sketch proof. Under H0, EXi = µ0. Therefore µ0 is close to
¯ for large n. Hence the λ, or more precisely, λn ≡ λ/n is small,
X
which is the solution of f (λn) = 0, where
f (λn) =
n
Xj − µ 0
1 X
.
n j=1 1 − λn(Xj − µ0)
By a simple Taylor expansion 0 = f (λn) ≈ f (0) + f˙(0)λn,
.
¯ − µ0 )
λn ≈ −f (0) f˙(0) = −(X
1X
(Xj − µ0)2.
n j
Now
λ2
{−λn(Xi − µ0) − n (Xi − µ0)2}
2
i
¯ − µ0)2
X
n(X
¯ − µ0) − λ2
.
(Xi − µ0)2 ≈ −1 P
= −2λnn(X
n
2
n
i(Xi − µ0)
i
√
P
¯−
By the LLN, n−1 i(Xi − µ0)2 → Var(X1). By the CLT, n(X
µ0) → N (0, Var(X1)) in distribution. Hence 2 log T → χ2
in
distri1
bution.
2 log T ≈ 2
X
Example 3. Darwin’s data: gains in height of plants from crossfertilisation
2.2 Confidence intervals for µ.
For a given α ∈ (0, 1), since we will not reject the null hypothesis
X = height(Cross-F) - height(Self-F)
H0 : µ = µ0
2
2
iff 2 log T < χ2
1,1−α , where P {χ1 ≤ χ1,1−α } = 1 − α. For α = 0.05,
2
χ1,1−α = 3.84.
15 observations:
6.1, -8.4, 1.0, 2.0, 0.7, 2.9, 3.5, 5.1, 1.8, 3.6, 7.0, 3.0,
9.3, 7.5, -6.0
Hence a 100(1 − α)% confidence interval for µ is
n
=
µ − 2 log{L(µ)nn} < χ2
1,1−α
n
n
X
µ
o
o
log pi(µ) > −0.5χ2
1,1−α − n log n
i=1
n
n X
o
= µ
log{npi(µ)} > −0.5χ2
1,1−α .
i=1
¯ = 2.61, the standard error s = 4.71.
The sample mean X
Is the gain significant?
Intuitively: YES, if no two negative observations -8.4 and -6.0.
Let µ = EXi.
(i) Standard approach: Assume {X1, · · · , X15} is a random sample
from N (µ, σ 2)
10
5
QQ-plot:
Quantile
of
N (0, 1) vs Quantile of the
empirical distribution.
0
H1 : µ > 0
−5
vs
Observed quantiles
(a) Normal plot
H0 : µ = 0
¯ = 2.61
b=X
MLE: µ
−1
0
1
Normal quantiles
The t-test statistic:
¯ = 2.14
nX/s
Is N (µ, σ 2) an appropriate assumption? as the data do not
appear to be normal (with a heavy left tail); see Fig(a).
(b) Lk likelihood, k=1,2,4,8
0.8
The profile likelihood lk (µ) is
plotted against µ for k = 1
(solid), 2 (dashed), 4 (dotted), and 8 (dot-dashed).
0.4
Since T ∼ t(14) under H0, the p-value is 0.06 — significant but
not overwhelming.
0.0
√
Profile likelihood
T =
−5
0
µ
5
10
(ii) Consider a generalised normal family
fk (x|µ, σ) =
2−1−1/k
exp
Γ(1 + 1/k)σ
1 x − µ k
− ,
2 σ If we use the distribution functions with k = 1 to fit the data,
the p-value for the test is 0.03 – much more significant than that
under the assumption of normal distribution.
which has the mean µ. When k = 2, it is N (µ, σ 2).
(iii) The empirical likelihood ratio test statistic 2 log T = 3.56,
which rejects H0 with the p-value 0.04.
To find the profile likelihood of µ, the ‘MLE’ for σ is
The 95% confidence interval is
Hence
n
k X
bk ≡ σ
b (µ)k =
σ
|Xi − µ|k .
2n i=1
1
n
1
b ) = −n log Γ(1 + ) − n(1 + ) log 2 − n log σ
b− .
lk (µ) = lk (µ, σ
k
k
k
b = µ(k)
b
varies with respect to k. In fact
Fig.(b) shows the MLE µ
b
µ(k) increases as k decreases.
15
X
{µ i=1
log pi(µ) > −1.92 − 15 log(15)} = [0.17, 4.27].
1 e−|x−µ|/σ . With µ fixed, the
The DE density is of the form 2σ
P
−1
MLE for σ is n
i |Xi − µ|. Hence the parametric log (profile)
likelihood is
−n log
X
i
|Xi − µ|.
−4
−2
Let X1, · · · , Xn be i.i.d. random vectors from distribution F .
Similar to the univariate case, we assume
−6
log−likelihood
0
3. Empirical likelihood for random vectors
pi = F (Xi), i = 1, · · · , n,
P
where pi ≥ 0 and i pi = 1. The empirical likelihood is
n
Y
pi
L(p1, · · · , pn) =
i=1
0
2
4
6
µ
Parametric log-likelihood (solid curve) based on the DE distribution, and the empirical log-likelihood (dashed curve). (Both curves
were shifted vertically by their own maximum values.)
Without any further constraints, the MELEs are
pbi = 1/n,
i = 1, · · · , n.
3.1 EL for multivariate means
The profile empirical likelihood for µ = E X1 is
L(µ) = max
Y
n
i=1
pi pi ≥ 0,
n
X
pi = 1,
i=1
n
X
i=1
piXi = µ ≡
n
Y
pi(µ),
i=1
where pi(µ) is the MELE of pi with the additional constraint E Xi =
µ. Define the ELR
T ≡ T (µ) =
L(1/n, · · · , 1/n)
=1
L(µ)
Y
n
i=1
n
X
i=1
log{npi(µ)} → χ2
d
in distribution.
(iv) Bootstrap calibration. Since (ii) and (iii) are based on an
asymptotic result. When n is small and d is large, χ2
d,1−α may be
replaced by the [Bα]-th largest value among 2 log T1∗, · · · ,
∗ which are computed as follows.
2 log TB
(a) Draw i.i.d. sample X∗1, · · · , X∗n from the uniform distribution on {X1, · · · , Xn}. Let
T∗ = 1
n
. Y
i=1
(ii) The null hypothesis H0 : µ = µ0 will be rejected at the significance level α iff
n
X
{npi(µ)}.
Theorem 3. Let X1, · · · , Xn be d × 1 i.i.d. with mean µ and finite
covariance matrix Σ and |Σ| 6= 0. Then as n → ∞,
2 log{T (µ)} = −2
Remarks. (i) In the case that |Σ| = 0, there exists an integer q < d
for which, Xi = AYi where Yi is a q × 1 r.v. with |Var(Yi)| 6= 0,
and A is a d × q constant matrix. The above theorem still holds
with the limit distribution replaced by χ2
q.
¯ )},
{np∗i (X
¯ = 1 Pn Xi, and p∗(µ) is obtained in the same
where X
i
n i=1
manner as pi(µ) with {X1, · · · , Xn} replaced by {X∗1, · · · , X∗n}.
(b) Repeat (a) B times, denote the B values of T ∗ as
∗.
T1∗, · · · , TB
We may draw an X∗ from the uniform distribution on {X1, · · · , Xn}
i
as follows: draw Z ∼ U (0, 1), define X∗ = Xi if Z ∈ [ i−1
n , n ).
i=1
log{npi(µ0)} ≤ −0.5χ2
d,1−α ,
2
where P {χ2
d ≤ χd,1−α } = 1 − α.
(iii) A 100(1 − α)% confidence region for µ is
X
n
µ i=1
log{npi(µ)} ≥ −0.5χ2
d,1−α .
Since the limiting distribution is free from the original distribution
of {Xi}, we may draw Xi∗ from any distribution {π1, · · · , πn} instead
¯)
of the uniform distribution used above. Of course now p∗i (X
P
∗
e where µ
e = i πiXi.
should be replaced by p (µ),
(v) Computing pi(µ).
Assumptions: |Var(Xi)| 6= 0, and µ is an inner point of the convex
hull spanned by the observations, i.e.
µ∈
n
n X
i=1
piXi pi > 0,
n
X
o
pi = 1 .
i=1
This ensures the solutions pi(µ) > 0 exist.
Step 1:
We solve the problem in 3 steps:
Similar to Theorem 1, the LM method entails
1. Transform the constrained n-dim problem to a constrained d-dim problem.
2. Transform the constrained problem to an unconstrained
problem.
3. Apply a Newton-Raphson algorithm.
pi(µ) =
1
,
n − λτ (Xi − µ)
i = 1, · · · , n,
where λ is the solution of
n
X
Xj − µ
= 0.
τ
j=1 n − λ (Xj − µ)
(8)
Hence
Put
l(µ) ≡ log L(µ) =
= max
n
X
log pi(µ)
i=1
n
n
X
X
log pi pi ≥ 0,
pi = 1,
piXi =
i=1
i=1
i=1
X
n
l(µ) = −
µ .
i = 1, · · · , n.
(10)
The original n-dimensional optimisation problem is equivalent to
a d-dimensional problem of minimising M (λ) subject to the constraints (10).
Let Hλ be the set consisting all the values of λ satisfying
i = 1, · · · , n.
log{n − λτ (Xi − µ)} ≡ M (λ).
(9)
n
X
∂ 2M (λ)
(Xi − µ)(Xi − µ)τ
=
> 0.
τ
τ
2
∂ λ∂ λ
i=1 {n − λ (Xi − µ)}
P
Note. (10) and (8) together imply i pi(µ) = 1.
n − λτ (Xi − µ) > 1,
i=1
Note ∂∂λ M (λ) = 0 leads to (8), and
Thus M (·) is a convex function on any connected sets satisfying
n − λτ (Xi − µ) > 0,
n
X
Then Hλ a convex set in Rd, which contains the minimiser of the
convex function M (λ). (See ‘Note’ above)
Step 2: We extend M (λ) outside of the set Hλ such that it is still
a convex function on the whole Rd.
Define
log⋆(z) =
(
log z
−1.5 + 2z − 0.5z 2
z ≥ 1,
z < 1.
It is easy to see that log⋆(z) has two continuous derivatives on R.
Put
M⋆(λ) = −
Pn
τ
i=1 log⋆ {n − λ (Xi − µ)}.
Then
• M⋆(λ) = M (λ) on Hλ.
• M⋆(λ) is a convex function on whole Rd.
Unfortunately M (λ) is not defined on the sets
{λ | n − λτ (Xi − µ) = 0},
i = 1, · · · , n.
Hence, M⋆(λ) and M (λ) share the same minimiser which is the
solution of (8).
Step 3: We apply a Newton-Raphson algorithm to compute λ
iteratively:
n
¨ ⋆(λk )
λk+1 = λk − M
o−1
˙ ⋆(λk ).
M
3.2 EL for smooth functions of means
A convenient initial value would λ0 = 0, corresponding to pi = 1/n.
Basic idea. Let Y1, · · · , Yn be i.i.d. random variables with variance
σ 2. Note
σ 2 = EYi2 − (EYi)2 = h(µ),
Remarks. (i) S-code “el.S”, available from
www-stat.stanford.edu/∼owen/empirical
where µ = E Xi, and Xi = (Yi, Yi2). We may deduce a confidence
interval for σ 2 from that of µ.
calculates the empirical likelihood ratio
n
X
log{npi(µ)}
i=1
and other related quantities.
Theorem 4. Let X1, · · · , Xn be d × 1 i.i.d. r.v.s with mean µ0
and |Var(X1)| 6= 0. Let θ = h(µ) be a smooth function from Rd
to Rq (q ≤ d), and θ 0 = h(µ0). We assume
|GGτ | 6= 0,
G=
∂θ
.
∂ µτ
n
X
n
and
C3,r =
Then as n → ∞,
i=1
o
log{npi(µ)} ≥ −0.5r ,
θ 0 + G(µ − µ0) n
o
µ ∈ C1,r .
P {θ ∈ C3,r } → P (χ2
q ≤ r).
(ii) Under more conditions, P {θ ∈ C2,r } → P (χ2
q ≤ r), where
n
For any r > 0, let
C1,r = µ Remarks. (i) The idea of bootstrap calibration may be applied
here too.
o
C2,r = h(µ) µ ∈ C1,r .
C2,r is a practical feasible confidence set, while C3,r is not since
µ0 and θ0 are unknown in practice. Note for µ close to µ0,
θ 0 + G(µ − µ0) ≈ h(µ).
(iii) In general, P {µ ∈ C1,r } ≤ P {θ ∈ C2,r }.
(By Theorem 3, P {µ ∈ C1,r } → P (χ2
d ≤ r))
Example 4. S&P500 stock index in 17.8.1999 — 17.8.2000 (256
trading days)
(iv) The profile empirical likelihood function of θ is
L(θ ) = max
= max
n
n Y
i=1
n
n Y
pi(µ) h(µ) =
θ
Let Yi be the price on the i-th day,
o
n
X
pi h
piXi =
i=1
i=1
n
o
X
pi = 1 ,
i=1
Xi = log(Yi/Yi−1) ≈ (Yi − Yi−1)/Yi−1,
θ , pi ≥ 0,
which may be calculated directly using the Lagrange multiplier
method. The computation is more involved for nonlinear h(·).
which is the return, i.e. the percentage of the change on the i-th
day.
By treating Xi i.i.d., we construct confidence intervals for the
annual volatility
σ = {255Var(Xi)}1/2.
The simple point-estimator is
b =
σ
1/2
X
255 255
¯ 2
(Xi − X)
= 0.2116.
255 i=1
1450
1500
S&P500 index
1350
1400
The 95% confidence intervals are:
1250
1300
Method
EL
Normal
0
50
100
150
200
250
The EL confidence interval is 41.67% wider than the interval based
on normal distribution, which reflects the fact that the returns
have heavier tails.
QQ plot
0.0
-0.02
-0.04
-0.06
S&P500 return quantile
0.02
0.04
•
•
••
•••
••••
••
•••
•••••••
•••••
••••
•••••••
•
•
•
•
•
•
••
•
••••
••••••
••••••••
•••••••
•••••••
••••••
•••••••
•••••
•••••••••
•••••••••
•
•
•
•
•••••
•••••
•••••
••••••••
•••
••
•• •
•
•
•
-3
-2
-1
0
Normal quantile
1
C.I.
[0.1895, 0.2422]
[0.1950, 0.2322]
2
3
4. Estimating equations
A natural estimator for θ is determined by the estimating equation
n
1 X
b ) = 0.
m(Xi, θ
n i=1
4.1 Estimation via estimating equations
Let X1, · · · , Xn be i.i.d. from a distribution F . We are interested
in some characteristic θ ≡ θ (F ), which is determined by equation
E{m(X1, θ )} = 0,
where θ is q × 1 vector, m is a s × 1 vector-valued function.
For example,
Obviously, in case F is in a parametric family and m is the score
b is the ordinary MLE.
function, θ
b may be uniquely determined by (11)
Determined case q = s: θ
Underdetermined case q > s: the solutions of (11) may form a
(q − s)-dimensional set
θ = EX1 if m(x, θ) = x − θ,
θ = E(X1k ) if m(x, θ) = xk − θ,
θ = P (X1 ∈ A) if m(x, θ) = I(x ∈ A) − θ,
θ is the α-quantile if m(x, θ) = I(x ≤ θ) − α.
Example 5. Let {(Xi, Yi), i = 1, · · · , n} be a random sample. Find
a set of estimating equations for estimating γ ≡ Var(X1)/Var(Y1).
In order to estimate γ, we need to estimate µx = E(X1), µy =
E(Y1) and σy2 = Var(Y1). Put θ τ = (µx, µy , σy2, γ), and
m1(X, Y, θ ) = X − µx,
(11)
m2(X, Y, θ ) = Y − µy ,
m3(X, Y, θ ) = (Y − µy )2 − σy2,
m4(X, Y, θ ) = (X − µx)2 − σy2γ,
and m = (m1, m2, m3, m4)τ . Then E{m(Xi, Yi, θ )} = 0, leading to
the estimating equation
n
1 X
m(Xi, Yi, θ ) = 0,
n i=1
b for θ .
the solution of the above equation is an estimator θ
Remark. Estimating equation method does not facilitate hypothesis tests and interval estimation for θ .
Overdetermined case q < s: (11) may not have an exact solution,
approximating solutions are sought. One such an example is socalled the generalised method of moments estimation which is
very popular in Econometrics.
4.2 EL for estimating equations
Aim: construct statistical tests and confidence intervals for θ
The profile empirical likelihood function of θ :
L(θ ) = max
n
Y
i=1
n
n
X
X
pi pi m(Xi, θ ) = 0, pi ≥ 0,
pi = 1
i=1
i=1
The following theorem follows from Theorem 2 immediately.
Theorem 5. Let X1, · · · , Xn be i.i.d., m(x, θ ) be an s × 1 vectorvalued function. Suppose
E{m(X1, θ 0)} = 0,
Then as n → ∞,
in distribution.
Var{m(X1, θ 0)} 6= 0.
−2 log{L(θ 0)} − 2n log n → χ2
s
The theorem above applies in all determined, underdetermined
and overdetermined cases.
Remarks (i) In general L(θ ) can be calculated using the method
for EL for multivariate means in §3.1, treating m(Xi, θ ) as a random vector.
Let X1, · · · , Xn be i.i.d. For a given α ∈ (0, 1), let
m(x, θα) = I(x ≤ θα) − α.
b which is the solution of
(ii) For θ = θ
Then E{m(Xi, θα)} = 0 implies θα is the α quantile of the distribution of Xi. We assume the true value of θα is between X(1) and
X(n).
n
1 X
b ) = 0,
m(Xi, θ
n i=1
b ) = (1/n)n.
L(θ
Example 6. (Confidence intervals for quantiles)
The estimating equation
(iii) For θ determined by E{m(X1, θ )} = 0, we will reject the null
hypothesis H0 : θ = θ 0 iff
log{L(θ 0)} + n log n ≤ −0.5χ2
s,1−α .
(iii) An (1−α) confidence set for θ determined by E{m(X1, θ )} = 0
is
{θ
log{L(θ )} + n log n > −0.5χ2
s,1−α }
n
X
i=1
entails
m(Xi, θbα) =
n
X
i=1
I(Xi ≤ θα) − nα = 0
θbα = X(nα),
where X(i) denotes the i-th smallest value among X1, · · · , Xn. We
assume nα is an integer to avoid insignificant (for large n, e.g.
n = 100) technical details.
In fact L(θα) can be computed explicitly as follows.
Let
L(θα) = max
n
n Y
i=1
X
n
pi i=1
pi ≥ 0,
piI(Xi ≤ θα) = α,
n
X
i=1
o
pi = 1 .
An (1 − β) confidence interval for the α quantile is
Θα = {θα log{L(θα)} > −n log n − 0.5χ2
1,1−β }.
Note L(θbα) = (1/n)n ≥ L(θα) for any θα. It is always true that
θbα ∈ Θα.
Let r = r(θα) be the integer for which
X(i) ≤ θα for i = 1, · · · , r, and
X(i) > θα for i = r + 1, · · · , n.
Thus
L(θα) = max
n
n Y
i=1
pi pi ≥ 0,
r
X
pi = α,
i=1
= (α/r)r {(1 − α)/(n − r)}n−r .
n
X
i=r+1
pi = 1 − α
o
5. Empirical likelihood for estimating conditional distributions
References on kernel regression:
Hence
Θα = {θα log{L(θ α)} > −n log n − 0.5χ2
1,1−α }
n(1 − α)
nα
= θα r log
+ (n − r) log
> −0.5χ2
1,1−α ,
r
n−r
which can also be derived directly based on a likelihood ratio test
for a binomial distribution.
• Simonoff, J. S. (1996).
Springer, New York.
Smoothing Methods in Statistics.
• Wand, M.P. and Jones, M.C. (1995).
Chapman and Hall, London.
Kernel Smoothing.
References on nonparametric estimation for distribution functions:
• Hall, P., Wolff, R.C.L. and Yao, Q. (1999). Methods for
estimating a conditional distribution function. Journal of the
American Statistical Association, 94, 154-163.
• Fan, J. and Yao, Q. (2003). Nonlinear Time Series: Nonparametric and Parametric Methods. Springer, New York. Sections 10.3 (also Section 6.5).
5.1 From global fitting to local fitting
Put β = (β1, · · · , βd)τ
Consider linear regression model
With observations {(Yi, Xi), 1 ≤ i ≤ n}, where Xi = (Xi1, · · · , Xid)τ ,
the LSE minimises
where ε ∼ (0, σ 2).
Y = X1β1 + · · · + Xdβd + ε,
(12)
This model is linear wrt unknown coefficients β1, · · · , βd as the
variable X1, · · · , Xd may be
• quantitative inputs
• transformations of quantitative inputs, such as log, squareroot etc
• interactions between variables, e.g. X3 = X1X2
• basis expansions, such as X2 = X12, X3 = X13, · · ·
• numeric or “dummy” coding of the levels if qualitative
inputs
n X
i=1
Yi − Xτi β
2
,
(13)
resulting to
b = (Xτ X)−1Xτ Y ,
β
where Y = (Y1, · · · , Yn)τ , and X = (X1, · · · , Xn)τ is an n×d matrix.
The fitted model is
b.
c = Xβ
Y
This is a global fitting, since the model is assumed to be true
b is obtained
everywhere in the sample space and the estimator β
using all the available data.