Lecture Notes 9 Asymptotic (Large Sample) Theory 1 Review of o, O, etc. 1. an = o(1) mean an → 0 as n → ∞. P 2. A random sequence An is op (1) if An −→ 0 as n → ∞. P 3. A random sequence An is op (bn ) if An /bn −→ 0 as n → ∞. √ √ P 4. nb op (1) = op (nb ), so n op (1/ n) = op (1)−→ 0. 5. op (1) × op (1) = op (1). 1. an = O(1) if |an | is bounded by a constant as n → ∞. 2. A random sequence Yn is Op (1) if for every > 0 there exists a constant M such that limn→∞ P (|Yn | > M ) < as n → ∞. 3. A random sequence Yn is Op (bn ) if Yn /bn is Op (1). 4. If Yn Y , then Yn is Op (1). (potential test qustion: prove this.) √ √ 5. If n(Yn − c) Y then Yn = OP (1/ n). (potential test qustion: prove this.) 6. Op (1) × Op (1) = Op (1). 7. op (1) × Op (1) = op (1). 2 Distances Between Probability Distributions Let P and Q be distributions with densities p and q. We will use the following distances between P and Q. 1. Total variation distance TV(P, Q) = supA |P (A) − Q(A)|. R 2. L1 distance d1 (P, Q) = |p − q|. qR √ √ ( p − q)2 . 3. Hellinger distance h(P, Q) = 4. Kullback-Leibler distance K(P, Q) = R 5. L2 distance d2 (P, Q) = (p − q)2 . R p log(p/q). Here are some properties of these distances: 1. TV(P, Q) = 12 d1 (P, Q). (prove this!) 1 2. h2 (P, Q) = 2(1 − R√ pq). p 3. TV(P, Q) ≤ h(P, Q) ≤ 2TV(P, Q). 4. h2 (P, Q) ≤ K(P, Q). p 5. TV(P, Q) ≤ h(P, Q) ≤ K(P, Q). p 6. TV(P, Q) ≤ K(P, Q)/2. 3 Consistency θbn = T (X n ) is consistent for θ if P θbn −→ θ as n → ∞. In other words, θbn − θ = op (1). Here are two common ways to prove that θbn consistent. Method 1: Show that, for all ε > 0, P(|θbn − θ| ≥ ε) −→ 0. Method 2. Prove convergence in quadratic mean: MSE(θbn ) = Bias2 (θbn ) + Var(θbn ) −→ 0. qm p If bias → 0 and var → 0 then θbn → θ which implies that θbn → θ. P Example 1 Bernoulli(p). The mle pb has bias 0 and variance p(1 − p)/n → 0. So pb−→ p and is consistent. Now let ψ = log(p/(1−p)). Then ψb = log(b p/(1− pb)). Now ψb = g(b p) where P b g(p) = log(p/(1 − p)). By the continuous mapping theorem, ψ −→ ψ so this is consistent. Now consider X +1 pb = . n+1 Then p−1 Bias = E(b p) − p = − →0 n(1 + n) and Var = p(1 − p) → 0. n So this is consistent. 2 Example 2 X1 , . . . , Xn ∼ Uniform(0, θ). Let θbn = X(n) . By direct proof (we did it earlier) P we have θbn −→ θ. Method of moments estimators typically consistent. Consider one parameter. Recall Pare n −1 −1 b exists and is continuous. So that µ(θ) = m where m = n i=1 Xi . Assume that µ P −1 b θ = µ (m). By the WLLN m−→ µ(θ). So, by the continuous mapping Theorem, P θbn = µ−1 (m)−→ µ−1 (µ(θ)) = θ. 4 Consistency of the MLE Under regularity conditions, the mle is consistent. Let us prove this in a special case. This will also reveal a connection between the mle and Hellinger distance. Suppose that the model consists of finitely many distinct densities {p0 , p1 , . . . , pN }. The likelihood function is L(pj ) = n Y pj (Xi ). i=1 The mle pb is the density pj that maximizes L(pj ). Without loss of generality, assume that the true density is p0 . Theorem 3 P(b p 6= p0 ) → 0 as n → ∞. Proof. Let us begin by first proving an inequality. Let j = h(p0 , pj ). Then, for j 6= 0, s ! ! n n Y Y L(pj ) p (X ) p (X ) 2 4 2 j i j i P > e−nj /2 = P > e−nj /2 = P > e−nj /2 L(p0 ) p (X ) p (X ) i 0 i i=1 0 i=1 s s ! ! n n Y Y p (X ) p (X ) 2 2 j i j i ≤ enj /4 E = enj /4 E p (X ) p (X 0 i 0 i) i=1 i=1 Z n n n 2j h2 (p0 , pj ) √ n2j /4 n2j /4 n2j /4 = e pj p0 =e 1− =e 1− 2 2 2 j 2 2 2 2 ≤ enj /4 e−nj /2 = e−nj /2 . = enj /4 exp n log 1 − 2 3 R√ We used the fact that h2 (p0 , pj ) = 2 − 2 p0 pj and also that log(1 − x) ≤ −x for x > 0. Let = min{1 , . . . , N }. Then L(pj ) −n2j /2 >e for some j P(b p 6= p0 ) ≤ P L(p0 ) N X L(pj ) −n2j /2 ≤ >e P L(p0 ) j=1 ≤ N X 2 e−nj /2 ≤ N e−n 2 /2 → 0. j=1 We can prove a similar result using Kullback-Leibler distance as follows. Let X1 , X2 , . . . be iid Fθ . Let θ0 be the true value of θ and let θ be some other value. We will show that L(θ0 )/L(θ) > 1 with probability tending to 1. We assume that the model is identifiable; this means that θ1 6= θ2 implies that K(θ1 , θ2 ) > 0 where K is the Kullback-Leibler distance. Theorem 4 Suppose the model is identifiable. Let θ0 be the true value of the parameter. For any θ 6= θ0 L(θ0 ) >1 →1 P L(θ) as n → ∞. Proof. We have 1 (`(θ0 ) − `(θ)) = n n n 1X 1X log p(Xi ; θ0 ) − log p(Xi ; θ) n i=1 n i=1 p → E(log p(X; θ0 )) − E(log p(X; θ)) Z Z = (log p(x; θ0 ))p(x; θ0 )dx − (log p(x; θ))p(x; θ0 )dx Z p(x; θ0 ) = p(x; θ0 )dx log p(x; θ) = K(θ0 , θ) > 0. So P L(θ0 ) >1 = P (`(θ0 ) − `(θ) > 0) L(θ) 1 = P (`(θ0 ) − `(θ)) > 0 → 1. n This is not quite enough to show that θbn → θ0 . 4 Example 5 Inconsistency of an mle. In all examples so far n → ∞, but the number of parameters is fixed. What if the number of parameters also goes to ∞? Let Y11 , Y12 Y21 , Y22 .. . Yn1 , Yn2 Some calculations show that ∼ N (µ1 , σ 2 ) ∼ N (µ2 , σ 2 ) . ∼ .. ∼ N (µn , σ 2 ). n X 2 X (Yij − Y i )2 σ b = . 2n i=1 j=1 2 It is easy to show (good test question) that p σ b2 −→ σ2 . 2 Note that the modified estimator 2b σ 2 is consistent. The reason why consistency fails is because the dimension of the parameter space is increasing with n. Theorem 6 Under regularity conditions on the model {p(x; θ) : θ ∈ Θ}, the mle is consistent. 5 Score and Fisher Information The score and Fisher information are the key quantities in many aspects of statistical inference. (See Section 7.3.2 of CB.) Suppose for now that θ ∈ R. • L(θ) = p(xn ; θ) • `(θ) = log L(θ) • S(θ) = ∂ ∂θ `(θ) ← score function. Recall that the value θb that maximizes L(θ) is the maximum likelihood estimator (mle). Equivalently, θb maximizes `(θ). Note that θb = T (X1 , . . . , Xn ) is a function of the data. Often, we get θb by differentiation. In that case θb solves b = 0. S(θ) We’ll discuss the mle in detail later. 5 Some Notation: Recall that Z Eθ (g(X)) ≡ g(x)p(x; θ)dx. Theorem 7 Under regularity conditions, Eθ [S(θ)] = 0. In other words, Z Z ··· ∂ log p(x1 , . . . , xn ; θ) ∂θ p(x1 , . . . , xn ; θ)dx1 . . . dxn = 0. That is, if the expected value is taken at the same θ as we evaluate S, then the expectation is 0. This does not hold when the θ’s mismatch: Eθ0 [S(θ1 )] 6= 0. Proof. ∂ log p(xn ; θ) p(xn ; θ) dx1 · · · dxn ∂θ Z Z ∂ p(xn ; θ) ∂θ = ··· p(xn ; θ) dx1 · · · dxn n p(x ; θ) Z Z ∂ · · · p(xn ; θ) dx1 · · · dxn = ∂θ | {z } Z Eθ [S(θ)] = Z ··· 1 = 0. Example 8 Let X1 , . . . , Xn ∼ N (θ, 1). Then S(θ) = n X (Xi − θ). i=1 Warning: If the support of f depends on θ, then R ··· R and ∂ ∂θ cannot be switched. The next quantity of interest is the Fisher Information or Expected Information. The information is used to calculate the variance of quantities that arise in inference problems b It is called information because it tells how much information is in the such as the mle θ. 6 likelihood about θ. The definition is: I(θ) = Eθ [S(θ)2 ] 2 = Eθ [S(θ)2 ] − Eθ [S(θ)] = Varθ (S(θ)) since Eθ [S(θ)] = 0 2 ∂ = Eθ − 2 `(θ) ← easiest way to calculate ∂θ We will prove the final equality under regularity conditions shortly. I(θ) grows linearly in n, so for an iid sample, a more careful notation would be In (θ) " n # X ∂2 ∂2 In (θ) = E − 2 l(θ) = E − log p(Xi ; θ) 2 ∂θ ∂θ i=1 2 ∂ log p(X1 ; θ) = nI1 (θ). = −nE ∂θ2 Note that the Fisher information is a function of θ in two places: • The derivate is w.r.t. θ and the information is evaluated at a particular value of θ. • The expectation is w.r.t. θ also. The notation only allows for a single value of θ because the two quantities should match. A related quantity of interest is the observed information, defined as n X ∂2 ∂2 Ibn (θ) = − 2 `(θ) = − log p(Xi ; θ). 2 ∂θ ∂θ i=1 P By the LLN n1 Ibn (θ)−→ I1 (θ). So observed information can be used as a good approximation to the Fisher information. h 2 i −∂ 2 Let us prove the identity: Eθ [S(θ) ] = Eθ ∂θ2 `(θ) . For simplicity take n = 1. First note that 00 Z Z Z Z 00 p p 0 00 p=1 ⇒ p =0 ⇒ p =0 ⇒ p=0 ⇒ E = 0. p p Let ` = log p and S = `0 = p0 /p. Then `00 = (p00 /p) − (p0 /p)2 and 0 2 p 2 2 2 V (S) = E(S ) − (E(S)) = E(S ) = E p 0 2 00 p p = E −E p p 00 0 2 ! p p − = −E p p = −E(`00 ). 7 b ≈ 1/In (θ). Why is I(θ) called “Information”? Later we will see that Var(θ) The Vector Case. Let θ = (θ1 , · · · , θK ). L(θ) and `(θ) are defined as before. h i ∂`(θ) • S(θ) = ∂θi a vector of dimension K i=1,··· ,K • Information I(θ) = Var[S(θ)] is the variance-covariance matrix of S(θ) = [Iij ]ij=1,··· ,k where Iij = −Eθ ∂ 2 `(θ) . ∂θi ∂θj b (This is the inverse of the matrix, evaluated at • I(θ)−1 is the asymptotic variance of θ. the proper component of the matrix.) Example 9 X1 , · · · , Xn ∼ N (µ, γ) n Y −n −1 −1 1 2 2 √ (xi − µ) ∝ γ 2 exp Σ(xi − µ) exp L(µ, γ) = 2γ 2γ 2πγ i=1 n 1 `(µ, γ) = K − log γ − Σ(xi − µ)2 2 2γ 1 Σ(xi − µ) γ S(µ, γ) = n + 2γ12 Σ(xi − µ)2 − 2γ −n −1 Σ(xi − µ) γ γ2 I(µ, γ) = −E −1 Σ(xi − µ) 2γn2 − γ13 Σ(xi − µ)2 γ2 n 0 γ = 0 2γn2 You can check that Eθ (S) = (0, 0)T . 6 Efficiency and Asymptotic Normality √ If n(θbn − θ) N (0, v 2 ) then we call v 2 the asymptotic variance of θbn . This is not the same as the limit of the variance which is limn→∞ nVar(θbn ). Consider X n . In this case, the asymptotic variance is σ 2 . We also have that limn→∞ nVar(X n ) = 2 σ . In this case, they are the same. In general, the latter may be larger (or even infinite). 8 Example 10 (Example 10.1.10) Suppose we observe Yn ∼ N (0, 1) with probability pn and Yn ∼ N (0, σn2 ) with probability 1 − pn . We can write this as a hierachical model: Wn ∼ Bernoulli(pn ) Yn |Wn ∼ N (0, Wn + (1 − Wn )σn2 ). Now, Var(Yn ) = VarE(Yn |Wn ) + EVar(Yn |Wn ) = Var(0) + E(Wn + (1 − Wn )σn2 ) = pn + (1 − pn )σn2 . Suppose that pn → 1, σn → ∞ and that (1 − pn )σn2 → ∞. Then Var(Yn ) → ∞. Then P(Yn ≤ a) = pn P(Z ≤ a) + (1 − pn )P(Z ≤ a/σn ) → P(Z ≤ a) and so Yn N (0, 1). So the asymptotic variance is 1. Suppose we want to estimate τ (θ). Let v(θ) = where I(θ) = Var ∂ log p(X; θ) ∂θ |τ 0 (θ)|2 I(θ) = −Eθ ∂2 log p(X; θ) . ∂θ2 We call v(θ) the Cramer-Rao lower bound. Generally, any well-behaved estimator will √ have a limiting variance bigger than or equal to v(θ). We say that Wn is efficient if n(Wn − τ (θ)) N (0, v(θ)). Theorem 11 Let X1 , X2 , . . . , be iid. Assume that the model satisfies the regularity conditions in 10.6.2. Let θb be the mle. Then √ b − τ (θ)) n(τ (θ) N (0, v(θ)). b is consistent and efficient. So τ (θ) We will now prove the asymptotic normality of the mle. Theorem 12 √ 1 N 0, . I(θ) n(θbn − θ) Hence, θbn = θ + OP 9 1 √ n . Proof. By Taylor’s theorem b = `0 (θ) + (θb − θ)`00 (θ) + · · · . 0 = `0 (θ) Hence √ n(θb − θ) ≈ Now √1 `0 (θ) n − n1 `00 (θ) = A . B n √ √ 1 1X A = √ `0 (θ) = n × S(θ, Xi ) = n(S − 0) n i=1 n where S(θ, Xi ) is the score function based on Xi . Recall that p E(S(θ, Xi )) = 0 and Var(S(θ, Xi )) = I(θ). By the central limit theorem, A N (0, I(θ)) = I(θ)Z where Z ∼ N (0, 1). By the WLLN, P B −→ − E(`00 ) = I(θ). By Slutsky’s theorem p I(θ)Z 1 Z = N 0, =p . I(θ) I(θ) I(θ) A B So √ n(θb − θ) 1 N 0, . I(θ) Theorem 11 follows by the delta method: √ n(θbn − θ) N (0, 1/I(θ)) implies that √ N (0, (τ 0 (θ))2 /I(θ)). n(τ (θbn ) − τ (θ)) The standard error of θb is s se = s 1 = nI(θ) The estimated standard error is s se b = 1 b In (θ) 1 . In (θ) . b is The standard error of τb = τ (θ) s se = |τ 0 (θ)| = nI(θ) 10 s |τ 0 (θ)| . In (θ) The estimated standard error is s se b = b |τ 0 (θ)| . b In (θ) Example 13 X1 , · · · , Xn iid Exponential (θ). Let t = x. So: p(z; θ) = θe−θx , L(θ) = , I(θ) = e−nθt+n ln θ , l(θ) = −nθt + n ln θ, S(θ) = nθ − nt ⇒ θb = 1t = X1 , l00 (θ) = −n θ2 2 E[−l00 (θ)] = n2 , θb ≈ N θ, θ . θ n Example 14 X1 , . . . , Xn ∼ Bernoulli(p). The mle is pb = X. The Fisher information for n = 1 is 1 . I(p) = p(1 − p) So √ n(b p − p) N (0, p(1 − p)). Informally, p(1 − p) pb ≈ N p, . n The asymptotic variance is p(1 − p)/n. This can be estimated by pb(1 − pb)/n. That is, the estimated standard error of the mle is r pb(1 − pb) se b = . n Now suppose we want to estimate τ = p/(1 − p). The mle is τb = pb/(1 − pb). Now 1 ∂ p = ∂p 1 − p (1 − p)2 The estimated standard error is r se(b b τ) = 7 1 pb(1 − pb) × = n (1 − pb)2 s pb . n(1 − pb)3 Relative Efficiency If √ n(Wn − τ (θ)) √ n(Vn − τ (θ)) 2 N (0, σW ) 2 N (0, σV ) then the asymptotic relative efficiency (ARE) is ARE(Vn , Wn ) = 11 2 σW . σV2 Example 15 (10.1.17). Let X1 , . . . , Xn ∼ Poisson(λ). The mle of λ is X. Let τ = P(Xi = 0). So τ = e−λ . Define Yi = I(Xi = 0). This suggests the estimator n Wn = 1X Yi . n i=1 Another estimator is the mle Vn = e−λ . b The delta method gives Var(Vn ) ≈ λe−2λ . n We have √ n(Wn − τ ) √ n(Vn − τ ) N (0, e−λ (1 − e−λ )) N (0, λe−2λ ). So ARE(Wn , Vn ) = eλ λ ≤ 1. −1 Since the mle is efficient, we know that, in general, ARE(Wn , mle) ≤ 1. 8 Robustness The mle is efficient only if the model is right. The mle can be bad if the model is wrong. That is why we should consider using nonparametric methods. One can also replace the mle with estimators that are more robust. Suppose we assume that X1 , . . . , Xn ∼ N (θ, σ 2 ). The mle is θbn = X n . Suppose, however that we have a perturbed model Xi is N (θ, σ 2 ) with probability 1 − δ and Xi is Cauchy with probability δ. Then, Var(X n ) = ∞. Consider the median Mn . We will show that ARE(median, mle) = .64. But, under the perturbed model the median still performs well while the mle is terrible. In other words, we can trade efficiency for robustness. Let us now find the limiting distribution of Mn . √ Let Yi = I(Xi ≤ µ + a/ n). Then Yi ∼ Bernoulli(pn ) where √ a 1 a pn = P (µ + a/ n) = P (µ) + √ p(µ) + o(n−1/2 ) = + √ p(µ) + o(n−1/2 ). 2 n n 12 Also, P i Yi has mean npn and standard deviation p σn = npn (1 − pn ). Note that, a Mn ≤ µ + √ n if and only if X Yi ≥ i n+1 . 2 Then, ! X √ n+1 a =P P( n(Mn − µ) ≤ a) = P Mn ≤ µ + √ Yi ≥ 2 n i P n+1 − npn i Yi − npn ≥ 2 . = P σn σn Now, n+1 2 − npn → −2ap(µ) σn and hence √ Z Z P( n(Mn − µ) ≤ a) → P(Z ≥ −2ap(µ)) = P − ≤a =P ≤a 2p(µ) 2p(µ) so that √ n(Mn − µ) N For a standard Normal, (2p(0))2 = .64. 13 1 0, (2p(µ))2 .