Economics 520, Fall 2005 Lecture Note 15: Large Sample Properties of Maximum Likelihood Estimators, CB 10.1.1-10.1.3 Previously, we showed that if there is an Minimum Variance Unbiased Estimator with variance equal to the Cramer–Rao bound, then the MVUE is equal to the Maximum Likelihood Estimator (MLE). However, the conditions were fairly restrictive. In models which do not satisfy those conditions, the MLE is not unbiased in general, so it cannot be the MVUE. Nevertheless, it turns out that in a certain approximate sense, the maximum likelihood estimator is unbiased and minimum variance. The approximations hold when the sample size is large, and we refer to these as asymptotic or large sample approximations. Example Let X1 , . . . , Xn be a random sample from an exponential distribution with arrival rate λ∗ : fX (x; λ∗ ) = λ∗ exp(−xλ∗ ). The Cram´er-Rao bound for the variance is λ∗2 /N . The log likelihood function is N X L(λ) = ln λ − xi λ, i=1 ˆ = 1/¯ and the maximimum likelihood estimator is λ x. What can we say about the large sample properties of this estimator? Using the law of large numbers we have p x ¯ −→ E[X] = 1/λ∗ , so p ˆ = 1/¯ λ x −→ 1/E[X] = λ∗ . Using the central limit theorem we also have √ d N · (¯ x − 1/λ∗ ) −→ N (0, 1/λ∗2 ). Then we can use the delta method to establish that √ d N · (g(¯ x) − g(1/λ∗ )) −→ N (0, g 0 (1/λ∗ )2 /λ∗2 ). Applying this with g(a) = 1/a, and thus g 0 (a) = −1/a2 , we get √ d N · (1/¯ x) − λ∗ ) −→ N (0, λ∗4 /λ∗2 ) = N (0, λ∗2 ). Hence, approximately, ˆ ∼ N (λ∗ , λ∗2 /N ). λ So, approximately, in large samples, this maximum likelihood estimator is unbiased, and has variance approximately equal to the Cram´er-Rao bound. This is true in general for maximum likelihood estimators. 2 Result 1 Let X1 , . . . , Xn be a random sample from fX (x; θ∗ ). Assume that the regularity conditions in CB 10.6.2 hold, and let θˆ be the maximum likelihood estimator: θˆ = argmaxθ N X i=1 1 ln fX (xi ; θ). Then θˆ is consistent for θ∗ : p θˆ −→ θ∗ , and θˆ has asymptotically a normal distribution: √ d N (θˆ − θ∗ ) −→ N (0, I(θ∗ )−1 ), where I(θ∗ ) is the single observation information matrix: 2 2 ∂ ln fX ∂ ln fX ∗ ∗ ∗ (X; θ ) . (X; θ ) = −E I(θ ) = E ∂θ ∂θ2 2 First let us interpret this result using the Cramer–Rao bound. The CR bound implies that no unbiased estimator has a variance smaller than I(θ∗ )−1 /N. The maximum likelihood estimator has a limiting normal distribution √ d N (θˆ − θ∗ ) −→ N (0, I(θ∗ )−1 ), implying that for fixed, large N , √ N (θˆ − θ∗ ) ≈ N (0, I(θ∗ )−1 ). This in turn implies that θˆmle ≈ N (θ∗ , I(θ∗ )−1 /N ). Now, if this was the exact distribution of the MLE, it would be the minimum variance unbiased estimator. Although this is only the approximate distribution in large samples, it seems reasonable to think of the MLE is “approximately optimal.”1 Example To illustrate what this means consider an example we have looked at before, where the maximum likelihood estimator differs from the minimum variance unbiased estimator. Suppose X1 , . . . , XN are a random sample from a normal distribution with unknown mean µ and unknown variance σ 2 . We are interested in the variance σ 2 . The minimum variance unbiased estimator is N 1 X ¯ 2. W1 = (Xi − X) N −1 i=1 The maximum likelihood estimator is N 1 X ¯ 2 = N − 1 · W1 . (Xi − X) W2 = N N i=1 As the sample gets large, the two estimators get close to each other. They are both consistent and have the same large sample distributions. √ d N · (W1 − σ 2 ) −→ N (0, 2 · σ 4 ), 1 This reasoning can be made more formal. One such result is a statement that any other estimator that is asymptotically unbiased has higher asymptotic variance, at almost all points in the parameter space. 2 and √ d N · (W2 − σ 2 ) −→ N (0, 2 · σ 4 ), 2 Sketch of Proof of Result 1: For each value of θ, we can apply a law of large numbers so that N 1 1 X p L(θ) = ln fX (Xi ; θ) −→ E[ln fX (X; θ)]. N N i=1 In addition we know from Jensen’s inequality that θ∗ = argmaxE[ln fX (X; θ)]. To get the result that 1 1 L(θ) = θ∗ , argmax L(θ) = argmaxE N N we need that the convergence is not just pointwise, but uniform in θ, that is, 1 p 1 sup L(θ) − E L(θ) −→ 0. N N θ This implies that the convergence to the limit is not much weaker for some values of θ than for others. It requires stronger regularity conditions than pointwise convergence. (Sufficient but not necessary is that ln fX (x; θ) ≤ k(x), with E[k(X)] < ∞.) In large samples at the maximum likelihood estimator the derivative of the log likelihood function must be equal to zero: ∂L ˆ (θ) = 0. ∂θ Now expand the derivative of the log likelihood function around the true value of theta: 0= ∂L ∗ ∂2L ˜ ˆ ∂L ˆ (θ) = (θ ) + (θ) · (θ − θ∗ ), ∂θ ∂θ ∂θ2 ˆ In large samples θˆ → θ∗ , and therefore θ˜ → θ∗ . Rearranging for some θ˜ between θ∗ and θ. this gives 2 ∂ L ˜ −1 ∂L ∗ ˆ θ−θ = (θ) · (θ ), ∂θ2 ∂θ or 2 −1 √ ∂ L ˜ ∂L ∗ √ ˆ N · (θ − θ) = (θ) N · (θ ) N . ∂θ2 ∂θ In large samples N ∂2L ˜ 1 X ∂ 2 ln fX p ˜ −→ − 2 (θ) N ≈ − (xi ; θ) I(θ∗ ), 2 ∂θ N ∂θ i=1 converges in probability to the information matrix I(θ∗ ). The second part, N ∂L ∗ √ 1 X ∂ ln fX d (θ ) N=√ (xi ; θ∗ ) −→ N (0, I(θ∗ )), ∂θ N i=1 ∂θ 3 because it satisfies a central limit theorem with variance equal to the information matrix. This completes the argument. Random Vectors and Multiple Parameters In many models, the parameter θ may be a vector. For example, in the normal model with mean µ and variance σ 2 , we can think of the parameter as a 2-vector θ = (µ, σ)0 . It turns out that everything extends very easily to the case with a vector parameter, but we need to introduce a bit of additional notation to state it clearly. First, let us take a step back and consider random vectors. Suppose X is a k × 1 random vector X = (X1 , . . . , Xk )0 . Here, the X1 , . . . , Xk are (scalar) random variables, not necessarily independent or identically distributed. Define the mean of X as     µ1 E(X1 )     .. E(X) = µ =  ...  :=  . . µk E(Xk ) The variance matrix (or variance-covariance matrix) is   E[(X1 − µ1 )2 ] E[(X1 − µ1 )(X2 − µ2 )] · · ·  E[(X2 − µ2 )2 ] ···  V (X) = E[(X −µ)(X −µ)0 ] = E  E[(X2 − µ2 )(X1 − µ1 )] . .. .. .. . . . So, the (i, j) element of V (X) is the covariance between Xi and Xj . The CDF of the random vector X can be defined as before. For x ∈ Rk , FX (x) := P (X ≤ x), where now X ≤ x means that the inequality holds for every element: Xi ≤ xi for each i. Now consider a sequence of random vectors X1 , X2 , . . .. (Be careful of notation: now each Xn is a k−dimensional random vector.) Then Xn converges in distribution to a random vector X if our previous definition holds: FXn (x) → FX (x), at each continuity point of FX . For convergence in probability, we need to modify our previous definition only slightly. For a vector x ∈ Rk , its length is defined as ||x|| := k X !1/2 x2i . i=1 This is just the usual “Euclidean length” of a vector. Now, a sequence of random vectors Xn converges in probability to a constant vector c ∈ Rk if, for every > 0, P (||Xn − c|| > ) → 0. 4 The standard Law of Large Numbers and the Central Limit Theorem extend to the vector case. For example, if X1 , X2 , . . . are IID random vectors with mean µ and variance matrix Σ (note: this allows elements within a vector to be nonindependent and have different distributions), then √ d ¯ − µ) −→ n(X N (0, Σ), where N (0, Σ) is the multivariate normal distribution with mean (0, 0, . . . , 0)0 and variance matrix Σ. Finally, having developed this extra notation, we can consider the large sample properties of MLE. Suppose that our model f (x; θ) now depends on a vector parameter θ = (θ1 , . . . , θk )0 . Result 1 extends to this case as follows: Result 2 Let X1 , . . . , Xn be a random sample from fX (x; θ∗ ), where θ∗ is k×1. Let θˆ be the maximum likelihood estimator: θˆ = (θˆ1 , . . . , θˆk )0 = arg max θ Then θˆ is consistent for θ∗ : N X ln fX (xi ; θ). i=1 p θˆ −→ θ∗ , and θˆ is asymptotically normally distributed: √ d N (θˆ − θ∗ ) −→ N 0, I(θ∗ )−1 , where I(θ∗ ) is the single observation information matrix: 2 ∂ ln f (X; θ∗ ) ∂ ln f (X; θ∗ ) ∂ ln f (X; θ∗ ) · = −E . I(θ∗ ) = E ∂θ ∂θ0 ∂θ∂θ0 (Note: the term ∂ ln f (X;θ∗ ) ∂θ is k × 1 and the information matrix I(θ) is k × k. ) 5