EMPIRICAL PROCESSES: THEORY AND APPLICATIONS Dalle lezioni del “Corso Estivo di Statistica e Calcolo delle Probabilit´ a” Torgnon (Aosta) Luglio 2003 Jon A. Wellner, University of Washington Moulinath Banerjee, University of Michigan A cura di Sergio Venturini con la collaborazione di D. Ait Aoudio, S. Antignani, R. Argiento, A. Barla, S. Bianconcini, G. Cappelletti, B. Casella, M. Copetti, P. De Blasi, V. Edefonti, G. Esposito, A. Farcomeni, B. Martinucci, E. Masiello, C. May, P. Nastro, L. Sangalli, C. Valerio 2 Contents I Empirical Processes: Theory 9 1 Introduction 11 1.1 Some History . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11 1.2 Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14 2 Weak convergence: the fundamental theorems 2.1 Exercises 17 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 Maximal Inequalities and Chaining 30 31 3.1 Orlicz norms and the Pisier inequality . . . . . . . . . . . . . . . . . . . . 31 3.2 Gaussian and sub-Gaussian processes via Hoeffding’s Inequality . . . . . . 41 3.3 Bernstein’s inequality and ψ1 - Orlicz norms for maxima . . . . . . . . . . 44 3.4 Exercises 47 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 Inequalities for sums of independent processes 53 4.1 Symmetrization inequalities . . . . . . . . . . . . . . . . . . . . . . . . . . 53 4.2 The Ottaviani Inequality . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57 4.3 Levy’s Inequalities . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58 4.4 Hoffman-Jørgensen Inequalities . . . . . . . . . . . . . . . . . . . . . . . . 58 5 Glivenko-Cantelli Theorems 61 5.1 Glivenko-Cantelli classes F . . . . . . . . . . . . . . . . . . . . . . . . . . 61 5.2 Universal and Uniform Glivenko-Cantelli classes . . . . . . . . . . . . . . 67 5.3 Preservation of the GC property . . . . . . . . . . . . . . . . . . . . . . . 69 5.4 Exercises 73 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 Donsker Theorems: Uniform CLT’s 79 6.1 Uniform Entropy Donsker Theorem . . . . . . . . . . . . . . . . . . . . . . 79 6.2 Bracketing Entropy Donsker Theorems . . . . . . . . . . . . . . . . . . . . 85 3 CONTENTS 4 6.3 Donsker Theorem for Classes Changing with Sample Size . . . . . . . . . 90 6.4 Universal and Uniform Donsker Classes . . . . . . . . . . . . . . . . . . . 92 6.5 Exercises 95 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7 VC-theory: bounding uniform covering numbers 7.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.2 Convex Hulls . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110 8 Bracketing Numbers 99 113 8.1 Smooth Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 114 8.2 Monotone Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 117 8.3 Convex Functions and Convex Sets . . . . . . . . . . . . . . . . . . . . . . 117 8.4 Lower layers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 118 8.5 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 120 9 Multiplier Inequalities and CLT II 99 125 9.1 The unconditional multiplier CLT . . . . . . . . . . . . . . . . . . . . . . 125 9.2 Conditional multiplier CLT’s . . . . . . . . . . . . . . . . . . . . . . . . . 131 Empirical Processes: Applications 10 Consistency of Maximum Likelihood Estimators 10.1 Exercises 133 135 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 148 11 M -Estimators: the Argmax Continuous Mapping Theorem 155 12 Rates of convergence 161 13 M -Estimators and Z -Estimators 173 13.1 M -Estimators, continued . . . . . . . . . . . . . . . . . . . . . . . . . . . . 173 13.2 Z -Estimators: Huber’s Z -Theorem . . . . . . . . . . . . . . . . . . . . . . 177 13.3 Z -Estimators: van der Vaart’s Z -Theorem . . . . . . . . . . . . . . . . . . 186 14 Bootstrap Empirical Processes 191 14.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 191 14.1.1 The general idea . . . . . . . . . . . . . . . . . . . . . . . . . . . . 191 14.1.2 Consistency of the Bootstrap Estimator . . . . . . . . . . . . . . . 193 14.2 The Empirical Bootstrap . . . . . . . . . . . . . . . . . . . . . . . . . . . 196 CONTENTS 5 14.2.1 Basic definitions and results . . . . . . . . . . . . . . . . . . . . . . 196 14.2.2 The Delta Method for the Empirical Bootstrap . . . . . . . . . . . 199 14.3 The Exchangeable Bootstrap . . . . . . . . . . . . . . . . . . . . . . . . . 206 15 Semiparametric Models 209 15.1 Tangent spaces and Information . . . . . . . . . . . . . . . . . . . . . . . . 210 15.2 Lower Bounds . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 213 15.3 Efficient Score Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . 216 15.4 Semiparametric models and Empirical Processes . . . . . . . . . . . . . . 217 15.5 Efficient MLE in Semiparametric Mixture Models . . . . . . . . . . . . . . 218 15.6 Example: Errors in variables . . . . . . . . . . . . . . . . . . . . . . . . . 221 III Special topics 223 16 Cube root asymptotics 225 16.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 225 16.2 Limiting processes and relevant functionals. . . . . . . . . . . . . . . . . . 233 17 Asymptotic Theory for Monotone Functions 247 18 Split Point Estimation in Decision Trees 263 18.1 Split Point Estimation in Non Parametric Regression . . . . . . . . . . . . 263 18.2 Split Point Estimation for a Hazard Function . . . . . . . . . . . . . . . . 268 Bibliography 273 6 CONTENTS List of Figures 16.1 The Greatest Convex Minorant G1,1 of W (t) + t2 . . . . . . . . . . . . . . . . . 234 R 2 16.2 The unconstrained one-sided convex minorants GL 1,1 and G1,1 of W (t) + t . . . . 236 RC 2 16.3 The constrained one-sided convex minorants GLC 1,1 and G1,1 of W (t) + t . . . . . 236 RC 16.4 The minorants G1,1 , G01,1 , GLC 1,1 and G1,1 . . . . . . . . . . . . . . . . . . . . . 238 RC 16.5 Close-up view of G1,1 , G01,1 , GLC 1,1 and G1,1 . . . . . . . . . . . . . . . . . . . . 238 17.1 Cusum diagram and greatest convex minorant. . . . . . . . . . . . . . . . 250 17.2 The universality of D. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 260 7 8 LIST OF FIGURES Part I Empirical Processes: Theory 9 Chapter 1 Introduction 1.1 Some History Empirical process theory began in the 1930’s and 1940’s with the study of the empirical distribution function Fn and the corresponding empirical process. If X1 , . . . , Xn are i.i.d. real-valued random variables with distribution function F (and corresponding probability measure P on R), then the empirical distribution function is 1 1(−∞,x] (Xi ), n n Fn (x) = x ∈ R, i=1 and the corresponding empirical process is Zn (x) = √ n(Fn (x) − F (x)). Two of the basic results concerning Fn and Zn are the Glivenko-Cantelli theorem and the Donsker theorem: Theorem 1.1 (Glivenko-Cantelli (1933)) Fn − F = sup −∞<x<∞ |Fn (x) − F (x)| →a.s. 0. Theorem 1.2 (Donsker (1952)) Zn ⇒ Z ≡ U(F ) in D(R, · ∞ ) where U is a standard Brownian bridge process on [0, 1]. Remember that a standard Brownian bridge process on [0, 1] is a zero-mean Gaussian process U with covariance function E(U(s)U(t)) = s ∧ t − st, 11 s, t ∈ [0, 1]. CHAPTER 1. INTRODUCTION 12 With the symbol ⇒ we denote weak convergence of stochastic processes in the sense that will be specified in Chapter 2. Remark 1.1 In the statement of Donsker’s theorem we have ignored measurability difficulties related to the fact that D(R, · ∞ ) is a nonseparable Banach space. For the most part of this text (the exception is in Chapters 2 and 3) we will continue to ignore these difficulties. For a complete treatment of the necessary weak convergence theory, see van der Vaart and Wellner (1996), Part I - Stochastic Convergence. The occasional stars as superscripts on P ’s and functions refer to outer measures in the first case, and minimal measurable envelopes in the second case. We recommend ignoring the ∗’s on a first reading. The need for generalizations of Theorems 1.1 and 1.2 became apparent in the 1950’s and 1960’s. In particular, it became apparent that when the observations are in a more general sample space X (such as Rd , or a Riemannian manifold, or some space of functions, etc.), than the empirical distribution function is not as natural. It becomes much more natural to consider the empirical measure Pn indexed by some class of subset C of the sample space X , or, more generally yet, Pn indexed by some class of real-valued functions F defined on X . Suppose now that X1 , . . . , Xn are i.i.d. P on X . Then the empirical measure Pn is defined by 1 Pn = δXi , n n i=1 thus for any Borel set A ⊂ X 1 #{i ≤ n : Xi ∈ A} 1A (Xi ) = . n n n Pn (A) = i=1 For a real valued function f on X , we write Pn (f ) = 1 f (Xi ). n n f dPn = i=1 If C is a collection of subsets of X , then {Pn (C) : C ∈ C} is the empirical measure indexed by C. If F is a collection of real-valued functions defined on X , then {Pn (f ) : f ∈ F} is the empirical measure indexed by F. 1.1. SOME HISTORY 13 The empirical process Gn is defined by Gn = √ n(Pn − P ), thus {Gn (C) : C ∈ C} is the empirical process indexed by C, while {Gn (f ) : f ∈ F} is the empirical process indexed by F. (Of course the case of sets is a special case of indexing by functions by taking F = {1C : C ∈ C}). Note that the classical empirical distribution function for real-valued random variables can be viewed as the special case of the general theory for which X = R , C = {(−∞, x] : x ∈ R}. Two central questions for the general theory are: (i) For what classes of sets C or functions F does a natural generalization of the Glivenko-Cantelli Theorem 1.1 hold? (ii) For what classes of sets C or functions F does a natural generalization of the Donsker Theorem 1.2 hold? If F is a class of functions for which Pn − P ∗F = (supf ∈F |Pn (f ) − P (f )|)∗ →a.s. 0 then we say that F is a P–Glivenko-Cantelli class of functions. If F is a class of functions for which Gn = √ n(Pn − P ) ⇒ G in ∞ (F), where G is a mean-zero P –Brownian bridge process with (uniformly) continuous sample paths with respect to the semi-metric ρP (f, g) defined by ρ2P (f, g) = VarP (f (X) − g(X)), then we say that F is a P–Donsker class of functions. Here ∞ (F) = {x : F → R such that xF = supf ∈F |x(f )| < ∞}, and G is a P–Brownian bridge process on F if it is a mean-zero Gaussian process with covariance function E{G(f )G(g)} = P (f g) − P (f )P (g). Answer to these questions began to emerge during the 1970’s, especially in the work of Vapnik and Chervonenkis (1971) and Dudley (1978), with notable contributions in CHAPTER 1. INTRODUCTION 14 the 1970’s and 1980’s by David Pollard, Evarist Gin´e, Joel Zinn, Michel Talagrand, Peter Gaenssler, and many others. We will give statements of some of generalizations of Theorems 1.1 and 1.2 in later chapters. As will become apparent however, the methods developed apply beyond the specific context of empirical processes of i.i.d. random variables. Many of the maximal inequalities and inequalities for processes apply much more generally. The tools developed will apply to maxima and suprema of large families of random variables in considerable generality. Main focus in the second half of these notes will be on applications of these results to problem in statistics. Thus we briefly consider several examples in which the utility of the generality of the general theory becomes apparent. The third part is instead dedicated to an overview of some recent research topics that involve the theory of empirical processes. 1.2 Examples A commonly recurring theme in statistics is that we want to prove consistency or asymptotic normality of some statistic which is not a sum of independent random variables, but can be related to some natural sum of random functions indexed by a parameter in a suitable (metric) space. The following examples illustrate the basic idea. Example 1.1 Suppose that X, X1 , . . . , Xn , . . . are i.i.d. random variables with E|X1 | < ∞, and let µ = E(X). Consider the absolute deviations about the sample mean 1 Dn = Pn |X − X n | = |Xi − X n |, n n i=1 as an estimator of scale. This is an average of dependent random variables |Xi X|. There are several routes available for showing that Dn →a.s. d ≡ E|X − E(X)|, (1.1) but the methods we will develop in these notes lead to study of the random functions Dn (t) = Pn |X − t| for | t − µ |≤ δ for δ > 0. Note that this is just the empirical measure indexed by the collection of functions F = {x →| x − t | : | t − µ |≤ δ}, and Dn (X n ) = Dn . As we will see, this collection of functions is a VC–subgraph class of functions with an integrable envelope function F , and hence empirical process theory can be used to establish the desired convergence. 1.2. EXAMPLES 15 We might try showing (1.1) directly, but the corresponding central limit theorem is trickier. (By the way, this example was one of the illustrative examples considered by Pollard (1989)). Example 1.2 Suppose that (X1 , Y1 ), . . . , (Xn , Yn ), . . . are i.i.d. F0 on R2 , and let Fn denote their (classical!) empirical distribution function 1 Fn (x, y) = 1(−∞,x]×(−∞,y](Xi , Yi ). n n i=1 Consider the empirical distribution function of the random variables Fn (Xi , Yi ), i = 1, . . . , n, 1 Kn (t) = 1[Fn (Xi ,Yi )≤t] , n n t ∈ [0, 1]. i=1 Once again the random variables {Fn (Xi , Yi )}ni=1 are dependent. In this case we are already studying a stochastic process indexed by t ∈ [0, 1]. The empirical process method leads to study of the process Kn indexed by t ∈ [0, 1] and F ∈ F2 , the class of all distribution functions on R2 1 Kn (t, F ) = 1[Fn (Xi ,Yi )≤t] = Pn 1[F (Xi ,Yi )≤t] , n n t ∈ [0, 1], F ∈ F2 , i=1 or perhaps with F2 replaced by the smaller class of functions F2,δ = {F ∈ F2 : F − F0 ∞ ≤ δ}. Note that this is the empirical distribution indexed by the collection of functions F = {(x, y) → 1[F (x,y)≤t] : t ∈ [0, 1], F ∈ F2 }, or the subset thereof obtained by replacing F2 by F2,δ , and Kn (t, Fn ) = Kn (t). Can we prove that Kn (t) →a.s. K(t) = P (F0 (X, Y ) ≤ t) uniformly in t ∈ [0, 1]? 16 CHAPTER 1. INTRODUCTION Chapter 2 Weak convergence: the fundamental theorems In this chapter we give a characterization of convergence in law of sample bounded processes. Let T be a set and let {Xn (t), t ∈ T }n∈N be a sequence of stochastic processes indexed by the set T , with Xn defined on the probability space (Ω, A, P ). Assume that the processes have versions with almost all their trajectories bounded and let us continue denoting by Xn their sample bounded versions. Then Xn (·) ∈ ∞ (T ) almost surely, where ∞ (T ) is the space of all bounded functions on T . The space ∞ (T ), equipped with the sup norm · T , is a Banach space that is separable only if T is finite. We do not assume that the finite-dimensional laws of the processes Xn correspond to the finite-dimensional laws of (individually) tight Borel measures on ∞ (T ). (Recall that a Borel probability measure µ is called tight if for every > 0 there exists a compact set K with µ(K) ≥ 1 − ; a random variable X is called tight if its law µ ◦ X −1 is tight). Now let X(t), t ∈ T , with X defined on the probability space (Ω, A, P ), be a sample bounded process whose finite-dimensional laws do correspond to the finite-dimensional laws of a tight Borel probability measure on ∞ (T ). Then, we say that Xn converges in law (or, weakly) to X uniformly t ∈ T , and write Xn ⇒ X in ∞ (T ), (2.1) if E ∗ H(Xn ) → EH(X) in ∞ (T ) for all bounded continuous functions H : ∞ (T ) → R. As with usual convergence in law, if F is a continuous function on ∞ (T ) with values in another metric space and F (Xn ) is measurable, then (2.1) implies F (Xn ) ⇒ F (X) in the usual way. 17 CHAPTER 2. WEAK CONVERGENCE: THE FUNDAMENTAL THEOREMS 18 The following theorem, known as Portmanteau theorem, gives equivalent ways of describing weak convergence in a metric space (D, d). Theorem 2.1 (Portmanteau theorem) Let Xn , n ∈ N, and X be random variables that take values in a metric space (D, d). Then the following are equivalent: (i) Xn ⇒ X in D; (ii) Ef (Xn ) → Ef (X) for all real bounded, uniformly continuous f on D; (iii) lim supn P ∗ (Xn ∈ F ) ≤ P (X ∈ F ) for all closed sets F ⊂ D; (iv) lim inf n P ∗ (Xn ∈ G) ≥ P (X ∈ G) for all open sets G ⊂ D; (v) limn P ∗ (Xn ∈ A) = P (X ∈ A) for all Borel sets A with P (∂A) = 0. Proof. The implication (i) ⇒ (ii) is trivial. (ii) ⇒ (iii). Consider F closed and let f (x) := 1− d(x,F ) + , where d(x, F ) := inf y∈F d(x, y) and > 0. The f defined is bounded and continuous, even uniformly d(x,y) . continuous, because |f (x) − f (y)| ≤ And x ∈ F implies f (x) = 1, while x ∈ / F := {z : d(z, F ) < } implies d(x, F ) ≥ , and hence f (x) = 0. Therefore we have 1F (x) ≤ f (x) ≤ 1F (x) and hence lim sup P ∗ (Xn ∈ F ) ≤ lim sup E ∗ f (Xn ) = Ef (X) ≤ P (X ∈ F ). n n Since F is closed, letting ↓ 0 we get (iii). (iii) ⇔ (iv). This equivalence follows easily by complementation. (iii) & (iv) ⇒ (v). Let A be a Borel set with P (∂A) = 0, and denote by A◦ its interior and by A its closure; then conditions (iii) and (iv) together imply P (X ∈ A) ≥ lim sup P ∗ (Xn ∈ A) ≥ lim sup P ∗ (Xn ∈ A) ∗ n ∗ n ≥ lim inf P (Xn ∈ A) ≥ lim inf P (Xn ∈ A◦ ) ≥ P (X ∈ A◦ ). n n Since P (∂A) = 0, the extreme terms here coincide with P (X ∈ A) and (v) follows. (v) ⇒ (i). Without loss in generality we may assume that the bounded f satisfies 0 < f < 1. Then Ef (X) = ∞ 0 P {f (X) > t}dt = 1 0 P {f (X) > t}dt, and similarly for E ∗ f (Xn ). If f is continuous, then ∂{f (X) > t} ⊂ {f (X) = t}, and hence P (∂{f > t}) = 0 except for countably many t. By condition (v) and the bounded convergence theorem, we get 1 1 E ∗ f (Xn ) = P ∗ {f (Xn ) > t}dt → P {f (X) > t}dt = Ef (X). 0 0 2 19 The following proposition gives a description of the sample bounded processes X that do induce a tight Borel measure on ∞ (T ). Proposition 2.1 (de la Pena and Gin´ e (1999), van der Vaart and Wellner (1996)) Let X(t), t ∈ T be a sample bounded stochastic process. Then the finite-dimensional distributions of X are those of a tight Borel probability measure on ∞ (T ) if and only if there exists a pseudometric ρ on T for which (T, ρ) is totally bounded and such that X has a version with almost all its sample paths uniformly continuous for ρ. Proof. Let us assume the probability law of X is a tight Borel on ∞ (T ). Then there exists a sequence Km , m ∈ N, of compact sets in ∞ (T ) such that ∞ P X∈ Km = 1, m=1 and let K = ∞ m=1 Km . Then we will show that the pseudometric ρ defined on T by ρ(s, t) = ∞ 2−m (1 ∧ ρm (s, t)) m=1 with ρm (s, t) = sup{|x(s) − x(t)| : x ∈ Km } makes (T, ρ) totally bounded. To show this, given > 0, let k be such that ∞ 2−m < m=k+1 4 and let {x1 , . . . , xr } be a finite subset of km=1 Km , /4-dense in km=1 Km for the sup norm, that is, for each x ∈ km=1 Km there is an integer i ≤ r such that x − xi T ≤ /4. Such a finite set of functions exists by the compactness of km=1 Km . The subset A of Rr defined by {x1 (t), . . . , xr (t) : t ∈ T } is bounded since x1 , . . . , xr are bounded functions. Therefore A is totally bounded (in Rr bounded is the same as totally bounded). Hence there exists a finite set T = {tj : 1 ≤ j ≤ N } such that, for each t ∈ T , there is a j ≤ N for which max1≤i≤r |xi (t) − xi (tj )| ≤ /4. Then, T is -dense in T for the pseudo-metric ρ: if t and tj are as above, then, for x ∈ Km , m ≤ k, it follows that |x(t) − x(tj )| ≤ |x(t) − xi (t)| + |xi (t) − xi (tj )| + |xi (tj ) − x(tj )| ≤ 2x − xi T + |xi (t) − xi (tj )| and choosing i such that x − xi T ≤ /4 we get |x(t) − x(tj )| ≤ 3 , 4 CHAPTER 2. WEAK CONVERGENCE: THE FUNDAMENTAL THEOREMS 20 thus ρm (t, tj ) = sup |x(t) − x(tj )| ≤ x∈Km 3 4 and hence ρ(t, tj ) ≤ k m=1 ≤ ∞ 2−m ρm (t, tj ) + 2−m m=k+1 k 3 2−m + < . 4 m=1 4 This proves (T, ρ) is totally bounded. Moreover, the functions x ∈ Km are uniformly ρ-continuous, since, if x ∈ Km , then |x(s) − x(t)| ≤ ρm (s, t) ≤ 2m ρ(s, t) for all s, t ∈ T with ρ(s, t) ≤ 1. Since P (X ∈ K) = 1, the identity map of (∞ (T ), B, P ◦ X −1 ) (where B denotes the Borel σ-algebra on ∞ (T )) yields a version of X with almost all its sample paths in K, hence in Cu (T, ρ), the space of bounded uniformly ρ-continuous functions on T . This proves the direct half of the proposition. Conversely, let X(t), t ∈ T , be a process with a version whose sample paths are almost all in Cu (T, ρ) for a metric or pseudo-metric ρ on T for which (T, ρ) is totally bounded, and let us continue denoting X such a version. We can assume all the trajectories of X are uniformly continuous. The map X : Ω → Cu (T, ρ) is Borel measurable because the random vectors (X(t1 ), . . . , X(tk )), ti ∈ T, k ∈ N, are Borel measurable and the Borel σ-algebra of Cu (T, ρ) is generated by the “finite-dimensional sets”{x ∈ Cu (T, ρ) : (x(t1 ), . . . , x(tk )) ∈ A} for all Borel sets A of Rk , ti ∈ T, k ∈ N. Hence, the probability law of X is a Borel measure in Cu (T, ρ). This space is complete, being ∞ (T ) complete and Cu (T, ρ) closed in ∞ (T ) (the uniform limit of uniformly continuous functions is still uniformly continuous), and it is separable, being (T, ρ) totally bounded and thus separable. Ulam theorem says that if S is complete and separable, then each probability measure on (S, S) is tight (see for example Billingsley (1968), Theorem 1.4 page 10). Thus the induced probability law of X is tight on Cu (T, ρ). But a tight Borel measure on Cu (T, ρ) is tight also on ∞ (T ), since a compact set in Cu (T, ρ) is compact also in ∞ (T ). 2 The following theorem characterizes weak convergence in ∞ (T ) in terms of asymptotic equicontinuity and convergence of finite-dimensional distributions. Definition 2.1 A sequence {Xn } in ∞ (T ) is said to be asymptotically uniformly equicontinuous in probability with respect to the pseudometric ρ if for every , η > 0 there exists a δ > 0 such that lim sup P ∗ n sup |Xn (s) − Xn (t)| > ρ(s,t)<δ < η. (2.2) 21 Theorem 2.2 The following are equivalent: (i) All the finite-dimensional distributions of the sample bounded processes Xn converge in law and there exists a pseudometric ρ on T such that both (T, ρ) is totally bounded and the processes Xn are asymptotically uniformly equicontinuous in probability with respect to ρ;(ii) There exists a process X whose law is a tight Borel probability measure on ∞ (T ) and such that Xn ⇒ X ∞ (T ). in If (i) holds, then the process X in (ii), which is completely determined by the limiting finite-dimensional distributions of {Xn }, has a version with sample paths in Cu (T, ρ). If X in (ii) has a version with almost all its trajectories in Cu (T, γ) for some pseudometric γ for which (T, γ) is totally bounded, then (i) holds with the pseudometric ρ taken to be γ. Proof. Suppose (i) holds. Let T∞ be a countable ρ-dense subset of T , and let Tk , k ∈ N, be finite subsets of T satisfying Tk T∞ . Such sets exist since any totally bounded set is separable. The limit laws of the finite-dimensional distributions of the processes Xn are consistent and thus define a stochastic process X on T . Moreover, by the Portmanteau Theorem for finite-dimensional convergence in law, for every > 0, P{ |X(s) − X(t)| > } sup s,t∈Tk :ρ(s,t)≤δ ≤ lim inf P∗ { sup ≤ lim inf P∗ { sup n→∞ s,t∈Tk :ρ(s,t)≤δ n→∞ |Xn (s) − Xn (t)| > } s,t∈T∞ :ρ(s,t)≤δ |Xn (s) − Xn (t)| > }. Taking the limit as k → ∞ of the left term side of this inequality we have lim P { k→∞ sup s,t∈Tk :ρ(s,t)≤δ = P{ ( k |X(s) − X(t)| > } sup s,t∈Tk :ρ(s,t)≤δ = P{ sup s,t∈T∞ :ρ(s,t)≤δ |X(s) − X(t)| > )} |X(s) − X(t)| > )}. Thus we get P{ sup s,t∈T∞ :ρ(s,t)≤δ ≤ lim inf P∗ { n→∞ |X(s) − X(t)| > )} sup s,t∈T∞ :ρ(s,t)≤δ |Xn (s) − Xn (t)| > }. Taking the limit as δ → 0 and using the asymptotic equicontinuity condition we have CHAPTER 2. WEAK CONVERGENCE: THE FUNDAMENTAL THEOREMS 22 that lim P { δ→0 |X(s) − X(t)| > )} sup s,t∈T∞ :ρ(s,t)≤δ ≤ lim lim inf P∗ { δ→0 n→∞ sup s,t∈T∞ :ρ(s,t)≤δ ≤ lim lim sup P ∗ { δ→0 n→∞ |Xn (s) − Xn (t)| > } sup s,t∈T∞ :ρ(s,t)≤δ |Xn (s) − Xn (t)| > } = 0. Thus we can find a sequence δm 0 such that P{ sup s,t∈T∞ :ρ(s,t)≤δm |X(s) − X(t)| > 2−m } ≤ 2−m . Hence it follows by Borel-Cantelli that P{ sup s,t∈T∞ :ρ(s,t)≤δm |X(s) − X(t)| > 2−m infinitely often} = 0 that is, there exists m(ω) < ∞ almost surely such that sup s,t∈T∞ :ρ(s,t)≤δm |X(s, ω) − X(t, ω)| ≤ 2−m ∀ m > m(ω). Therefore X(t, ω) is a ρ-uniformly continuous functions of t ∈ T∞ for almost every ω; T being totally bounded, the restriction to T∞ of X(t, ω) is also bounded. The extension to T by uniform continuity of the restriction of X to T∞ (only the ω set where X is uniformly continuous needs be considered) yields a version of X with sample paths all in Cu (T, ρ). Then, it follows from Proposition 2.1 that the law of X exists as a tight Borel measure on ∞ (T ). To prove weak convergence note that, since (T, ρ) is totally bounded, for every δ > 0 there exists a finite set of points {t1 , . . . , tN (δ) } that is δ-dense in (T, ρ), i.e. T ⊂ N (δ) i=1 B(ti , δ), where B(ti , δ) is the open ball with center ti and radius δ. Thus, for each t ∈ T we can choose πδ (t) ∈ {t1 , . . . , tN (δ) } so that ρ(πδ (t), t) < δ. Then we define processes Xn,δ , n ∈ N, and Xδ by Xn,δ (t) = Xn (πδ (t)) Xδ (t) = X(πδ (t)), t ∈ T. These are approximations of Xn and X that take at most a finite number N (δ) of values. Convergence of the finite-dimensional distributions of Xn to those of X implies that Xn,δ ⇒ Xδ in ∞ (T ). (2.3) Furthermore, uniformly continuity of the sample paths of X yields lim X − Xδ T = 0 δ→0 a.s. (2.4) 23 Indeed, by uniformly continuity of the sample paths of X we get X − Xδ T = sup |X(t) − X(πδ (t))| ≤ sup α ρ(t, πδ (t)) t∈T a.s., for some α > 0; t∈T thus, if δ → 0 (and hence πδ (t) → t), then X − Xδ T → 0. Now let H : ∞ (T ) → R be bounded and continuous. Then, using the triangle inequality we have that |E ∗ H(Xn ) − EH(X)| ≤ |E ∗ H(Xn ) − EH(Xn,δ )| + |EH(Xn,δ ) − EH(Xδ )| +|EH(Xδ ) − EH(X)| ≡ In + IIn,δ + IIIδ . In order to prove the convergence part of (ii) we can show that the limδ→0 lim supn→∞ of each of this quantities is zero. This is true for IIn,δ by (2.3). Next we show it for IIIδ . Given > 0, let K ⊂ ∞ (T ) be a compact set such that P (X ∈ K c ) < /(6H∞ ). By Exercise 2.1, there exists a τ > 0 such that, if x ∈ K and y ∈ ∞ (T ) with x − yT < τ , then |H(x) − H(y)| < /6. Let δ1 > 0 be such that P (Xδ − XT ≥ τ ) < /(6H∞ ) for all δ < δ1 ; this can be done by virtue of (2.4). Then it follows that |EH(Xδ ) − EH(X)| ≤ 2H∞ P ([X ∈ K c ] [Xδ − XT ≥ τ ]) + + sup{|H(x) − H(y)| : x ∈ K, x − yT < τ } + + < , ≤ 2H∞ 6H∞ 6H∞ 6 so that limδ→0 IIIδ = 0 holds. To show that limδ→0 lim supn→∞ In,δ = 0, choose , τ , and K as above. Then we have |E ∗ H(Xn ) − H(Xn,δ )| ≤ 2H∞ {P ∗ (Xn − Xn,δ T ≥ τ /2)| + P (rXn,δ ∈ (Kτ /2 )c )} + sup{|H(x) − H(y)| : x ∈ K, x − yT < τ }, (2.5) where Kτ /2 is the τ /2 open neighborhood of the set K for the sup norm. The inequality in the previous display can be checked as follows: if Xn,δ ∈ Kτ /2 and Xn − Xn,δ < τ /2, then there exists x ∈ K such that x − Xn,δ T < τ /2 and x − Xn T < τ . Now, the asymptotic equicontinuity hypothesis implies that there is a δ2 such that lim sup P ∗ {Xn − Xn,δ T ≥ τ /2} < n→∞ 6H∞ ∀ δ < δ2 , and finite-dimensional convergence yields lim sup P ∗ {Xn,δ ∈ (Kτ /2 )c } ≤ P ∗ {Xδ ∈ (Kτ /2 )c } ≤ n→∞ . 6H∞ 24 CHAPTER 2. WEAK CONVERGENCE: THE FUNDAMENTAL THEOREMS Hence we conclude from (2.5) that, for δ < δ1 ∧ δ2 , lim sup |E ∗ H(Xn ) − EH(Xn,δ )| < , n→∞ and this completes the proof that (i) implies (ii). Let us now prove the converse implication. If (ii) holds, then by Proposition 2.1 there is a pseudometric ρ on T which makes (T, ρ) totally bounded and such that X has a version (that we still denote by X) with all its sample paths in Cu (T, ρ). Now consider the closed set Fδ, defined by Fδ, = {x ∈ ∞ (T ) : sup s,t∈T :ρ(s,t)≤δ |x(s) − x(t)| ≥ }. Applying the portmanteau theorem we have that lim sup P ∗ { sup s,t∈T :ρ(s,t)≤δ n→0 ≤ P{ sup s,t∈T :ρ(s,t)≤δ |Xn (s) − Xn (t)| ≥ } |X(s) − X(t)| ≥ }. Taking limits as δ → 0 yields asymptotic equicontinuity in view of the ρ-uniform continuity of the sample paths of X. Thus (ii) implies (i). 2 The following is an obvious corollary of Theorem 2.2 for the empirical process Gn indexed by a class of measurable real-valued functions F on the probability space (X , A, P ), with the pseudo-metric ρp defined by ρ2p (f, g) = Varp (f (X) − g(X)) = P (f − g)2 − [P (f − g)]2 . Corollary 2.1 Let F be a class of measurable functions on (X , A). Then the following are equivalent: (i) (F, ρp ) is totally bounded and Gn is asymptotically uniformly equicontinuous in probability with respect to ρp ; (ii) F is P –Donsker, i.e. Gn ⇒ G in ∞ (T ) where G is a mean-zero P –Brownian bridge with uniformly continuous sample paths with respect to ρp . Proof. (i) ⇒ (ii). From Theorem 2.2, all we need to show is that the finite dimensional distributions of Gn converge to those of G (recall that G is a mean-zero Gaussian process process with covariance function E{G(f )G(g)} = P (f g) − P (f )P (g)). But this follows 25 directly from the Multivariate Central Limit Theorem ⎛ ⎞ ⎡ Gn f 1 (1/n) ni=1 f1 (Xi ) − P (f1 ) ⎜ ⎟ ⎢ ⎜ Gn f2 ⎟ √ ⎢ (1/n) n f1 (Xi ) − P (f1 ) i=1 ⎜ ⎟ ⎢ ⎜ . ⎟ = n⎢ .. ⎜ .. ⎟ ⎢ . ⎝ ⎠ ⎣ n (1/n) i=1 f1 (Xi ) − P (f1 ) Gn f k ⎤ ⎥ ⎥ ⎥ ⎥ ⇒ N [0, C] ⎥ ⎦ where C = [ci,j ]i=1,...,k;j=1,...,k with ci,j = Cov(fi (X1 ), fj (X1 )); but Cov(fi (X1 ), fj (X1 )) = E[fi (X1 )fj (X1 )] − E[fi (X1 )]E[fj (X1 )] = P (fi fj ) − P (fi )P (fj ). (ii) ⇒ (i). From Theorem 2.2, all we need to show is that (F, ρp ) is totally bounded. Being G tight, from Proposition 2.1 it follows that there exists a pseudometric ρ on F for which (F, ρ) is totally bounded and such that G has a version with almost all its sample paths uniformly continuous for ρ: for every couple f, g ∈ F, |G(f ) − G(g)| ≤ α ρ(f, g) a.s. for some α > 0. Thus |G(f ) − G(g)|2 ≤ (α ρ(f, g))2 and E|G(f ) − G(g)|2 1/2 a.s. ≤ α ρ(f, g). (2.6) But E|G(f ) − G(g)|2 = Var(G(f )) + Var(G(g)) + 2 Cov(G(f ), G(g)) = Var(f (X1 )) + Var(g(X1 )) + 2 Cov(f (X1 ), g(X1 )) = Var(f (X1 ) − g(X1 )) = ρp (f, g). Hence, equation (2.6) implies that if f ∈ Bρ (fi , ) then f ∈ Bρp (fi , α ), where f ∈ Bρ (fi , ) and f ∈ Bρp (fi , ) denote the open balls of center fi and radius in (F, ρ) and (F, ρp ) respectively. This shows that also (F, ρp ) is totally bounded. 2 We close this chapter by defining asymptotic tightness and showing two characterizations of this property. Definition 2.2 A sequence {Xn } in ∞ (T ) is said to be asymptotically tight if for every > 0 there exists a compact set K ⊂ ∞ (T ) such that lim inf P∗ (Xn ∈ K δ ) ≥ 1 − , n→∞ ∀δ > 0. Here K δ = {y ∈ ∞ (T ) : d(y, K) < δ} is the “δ-enlargement” of K. CHAPTER 2. WEAK CONVERGENCE: THE FUNDAMENTAL THEOREMS 26 Asymptotic tightness can be given a more concrete form, either through finite approximation or connecting tightness to (asymptotic, uniform, equi-) continuity of the sample paths. The idea of finite approximation is that for any > 0 the index set T can be partitioned into finitely many subsets Ti such that (asymptotically) the variation of the sample paths t → Xn (t) is less than on every one of the sets Ti . More precisely, it is assumed that for every , η > 0, there exists a partition T = ki=1 Ti such that lim sup P ∗ n sup sup |Xn (s) − xn (t)| > 1≤i≤k s,t∈Ti < η. (2.7) Under this condition the asymptotic of the process can be described within error margin , η > 0 by the behaviour of the marginal (Xn (t1 ), . . . , Xn (tk )) for arbitrary fixed points ti ∈ Ti . If the process can thus be reduced to a finite set of coordinates for any , η > 0 and the sequences of marginal distributions are tight, then the sequence Xn is asymptotically tight. We are now ready to give the two characterizations. Theorem 2.3 The following are equivalent: (i) The sequence {Xn } is asymptotically tight; (ii) {Xn (t)} is asymptotically tight in R for every t in T and, for all , η > 0, there exists a finite partition T = ki=1 Ti such that (2.7) holds; (iii) {Xn (t)} is asymptotically tight in R for every t in T and there exists a pseudometric ρ on T such that (T, ρ) is totally bounded and {Xn } is asymptotically uniformly ρequicontinuous in probability. Proof. (ii) ⇒ (i). For any partition T = k i=1 Ti , the norm Xn T is bounded by supi supt∈Ti |Xn (t) − Xn (ti )| + supi |Xn (ti )|. Indeed, Xn T = sup |Xn (t)| = sup sup |Xn (t)| i t∈T t∈Ti ≤ sup{sup |Xn (t) − Xn (ti )| + |Xn (ti )|} i t∈Ti ≤ sup sup |Xn (t) − Xn (ti )| + sup |Xn (ti )|. i t∈Ti (2.8) i Let us choose a partition such that (2.7) holds. Note that supi |Xn (ti )| is asymptotically tight, being the maximum of finitely many asymptotically tight sequences of real variables, that is, for every ξ > 0 there exists a constant M such that lim inf P∗ (sup |Xn (ti )| ≤ M + δ) ≥ 1 − ξ n→∞ i ∀δ > 0. (2.9) 27 From (2.8) we have lim inf P∗ (Xn T ≤ (M + ) + δ) n→∞ ≥ lim inf P∗ (sup sup |Xn (t) − Xn (ti )| + sup |Xn (ti )| ≤ M + + δ) n→∞ i i t∈Ti ≥ lim inf P∗ ({sup |Xn (ti )| ≤ M + δ} ∩ {sup sup |Xn (t) − Xn (ti )| ≤ }) n→∞ i ∗ i t∈Ti = 1 − lim sup P ({sup |Xn (ti )| > M + δ} ∪ {sup sup |Xn (t) − Xn (ti )| > }) n→∞ i i ∗ t∈Ti ∗ ≥ 1 − lim sup{P (sup |Xn (ti )| > M + δ) − P (sup sup |Xn (t) − Xn (ti )| > )} n→∞ i i t∈Ti ∗ ∗ ≥ 1 − lim sup P (sup |Xn (ti )| > M + δ) − lim sup P (sup sup |Xn (t) − Xn (ti )| > ) n→∞ n→∞ ∗ i i t∈Ti = lim inf P∗ (sup |Xn (ti )| ≤ M + δ) − lim sup P (sup sup |Xn (t) − Xn (ti )| > ) n→∞ n→∞ i i t∈Ti and, from (2.9) and (2.7) we get lim inf P∗ (Xn T ≤ (M + ) + δ) ≥ 1 − ξ − η, n→∞ ∀δ > 0, that is, the sequence Xn T is asymptotically tight in R. Fix ζ and a sequence n ↓ 0. Take a constant M such that lim sup P ∗ (Xn T > M ) < ζ, and for each = m and η = 2−m ζ, take a partition T = ki=1 Ti as in (2.7). For the moment m is fixed and we do not let it appear in the notation. Let {z1 , . . . , zp } be the set of all functions in ∞ (T ) that are constant on each Ti and take only the values 0, ±m , . . . , ±M/m m (there is a finite number of such functions). Let Km be the union of the p closed balls of radius m around the zi . Then, by construction, the two conditions Xn ≤ M, sup sup |Xn (s) − Xn (t)| ≤ m 1≤i≤k s,t∈Ti imply that Xn ∈ Km . This is true for each fixed m. Let K = ∞ m=1 Km . Then K is closed and totally bounded (by construction of the Km and because m ↓ 0) and hence compact. Furthermore, for every δ > 0, there is a j with K δ ⊃ jm=1 Km . If not, then there would be a sequence yj not in K δ , but with yj ∈ jm=1 Km for every j. This would have a subsequence contained in one of the balls making up K1 , a further subsequence eventually contained in one of the balls making up K2 , and so on. The “diagonal” sequence, formed be taking the first of the first subsequence, the second of the second subsequence and so on, would eventually be contained in a ball of radius j for every j; hence Cauchy. Its limit would be in K, contradicting the fact that d(yj , K) ≥ δ for every j. 28 CHAPTER 2. WEAK CONVERGENCE: THE FUNDAMENTAL THEOREMS Conclude that if Xn is not in K δ , then it is not in jm=1 Km for some fixed m. Then j lim sup P ∗ Xn ∈ / K δ ≤ lim sup P ∗ Xn ∈ / Km n→∞ n→∞ ≤ lim sup P ∗ ≤ lim sup P ∗ {Xn T > M } n→∞ n→∞ ≤ζ+ j { sup sup |Xn (s) − Xn (t)| > m } m=1 1≤i≤k s,t∈Ti Xn T > M m=1 j j + m=1 lim sup P ∗ n→∞ sup sup |Xn (s) − Xn (t)| > m 1≤i≤k s,t∈Ti 2−m ζ < 2ζ. (2.10) m=1 Thus, for every > 0 there exists a compact set K ⊂ ∞ (T ) such that δ lim inf P∗ Xn ∈ K ≥1− ∀δ > 0, n→∞ that is, Xn is asymptotically tight. (iii) ⇒ (ii). For every , η > 0 take δ > 0 such that (2.2) holds. Since T is totally bounded, it can be covered with finitely many balls of radius δ. Construct a partition by disjointifying these balls. ) ≥ (i) ⇒ (iii). Let K1 ⊂ K2 ⊂ · · · be compacts in ∞ (T ) with lim inf P∗ (Xn ∈ Km 1 − 1/m for every > 0. For every fixed m, define a semimetric ρm on T by ρm (s, t) = sup |z(s) − z(t)|, s, t ∈ T. z∈Km Then (T, ρm ) is totally bounded. Indeed, cover Km by finitely many balls of (arbitrarily small) radius η, centered at z1 , . . . , zk . Partition Rk into cubes of edge η, and for every cube pick up at most one t ∈ T such that (z1 (t), . . . , zk (t)) is in the cube. Since z1 , . . . , zk are uniformly bounded this gives finitely many points t1 , . . . , tp . The quantity ρm (t, ti ) can be bounded by 2 sup inf z − zj T + sup |zj (ti ) − zj (t)|. z∈Km j j Indeed, |z(t) − z(ti )| ≤ |z(t) − zj (t)| + |zj (t) − zj (ti )| + |zj (ti ) − z(ti )| ≤ 2z − zj T + |zj (t) − zj (ti )| ; since this holds for every j, |z(t) − z(ti )| ≤ inf {2z − zj T + |zj (t) − zj (ti )|} j ≤ inf {2z − zj T + sup |zj (t) − zj (ti )|} j j ≤ 2 inf z − zj T + sup |zj (t) − zj (ti )|, j j 29 and, taking the sup over z ∈ Km , ρm (t, ti ) ≤ 2 sup inf z − zj T + sup |zj (ti ) − zj (t)|. z∈Km j j Since for every t there is a ti for which (z1 (t), . . . , zk (t)) and (z1 (ti ), . . . , zk (ti )) fall in the same cube, the balls {t : ρm (t, ti ) < 3η} cover T . Next set ρ(s, t) = ∞ 2−m (ρm (s, t) ∧ 1). m=1 Fix η > 0. Take a natural number k with 2−k < η. Cover T with finitely many ρk -balls of radius η. Let t1 , . . . , tp be their centers. Since ρ1 ≤ ρ2 ≤ . . . , for every t there is a ti with ρ(t, ti ) < 2η. Indeed, ρ(t, ti ) ≤ k ∞ 2−m ρm (t, ti ) + m=1 ≤ ρk (t, ti ) 2−m m=k+1 k −m 2 m=1 −k ≤ ρk (t, ti ) + 2 −k +2 ∞ 2−m m=1 < ρk (t, ti ) + η. Thus (T, ρ) is totally bounded for ρ, too. It is clear from the definitions that |z(s) − z(t)| ≤ ρm (s, t) for every z ∈ Km and that ρm (s, t) ∧ 1 ≤ 2m ρ(s, t). Thus, for any ≤ 1, if ρ(s, t) < 2−m then |z(s) − z(t)| ≤ , for every z ∈ Km . Moreover, by triangle inequality, for any pair s, t ∈ T , |z(s) − z(t)| ≤ |z(s) − z0 (s)| + |z0 (s) − z0 (t)| + |z0 (t) − z(t)| ≤ 2 z − z0 T + |z0 (s) − z0 (t)|. then there exists a z ∈ K such that z − z < . Thus Finally, if z ∈ Km 0 m 0 T Km ⊂ {z : sup ρ(s,t)<2−m |z(s) − z(t)| ≤ 3}. Hence, for given and m, and for δ < 2−m , lim inf P∗ ( sup |Xn (s) − Xn (t)| ≤ 3) n→∞ ρ(s,t)<δ ≥ lim inf P∗ ( n→∞ ≥ sup |Xn (s) − Xn (t)| ≤ 3) ρ(s,t)<2−m lim inf P∗ (Km )≥1− n→∞ 1/m. Thus {Xn } is asymptotically uniformly ρ-equicontinuous in probability. 2 30 2.1 CHAPTER 2. WEAK CONVERGENCE: THE FUNDAMENTAL THEOREMS Exercises Exercise 2.1 Show that if H : ∞ (T ) → R is bounded and continuous, and K ⊂ ∞ (T ) is compact, then for every > 0 there is a δ > 0 such that, if x ∈ K and y ∈ ∞ (T ) with y − xT < δ, then |H(x) − H(y)| < . Solution. Since H is continuous, for any fixed z ∈ K and for every > 0 there exists a δ(z) > 0 such that: if v ∈ ∞ (T ) with v − zT < δ(z), then |H(v) − H(z)| < /2. Denoting by B(z; δ(z)/2) the open ball of center z and radius δ(z)/2, we have that K⊂ B(z; δ(z)/2). z∈K Being K compact, there exist z1 , . . . , zn such that K⊂ n B(zi ; δ(zi )/2). i=1 Let δ := min{δ(z1 )/2, δ(z2 )/2, . . . , δ(zn )/2}. Then δ does the job. Indeed, take x ∈ K and y ∈ ∞ (T ) with y − xT < δ. Since x ∈ K there exists zk such that x ∈ B(zk ; δ(zk )/2). Thus, by triangular inequality, y − zk T ≤ y − xT + x − zk T < δ + δ(zk )/2 ≤ δ(zk ). Finally, using (triangular inequality and) the continuity of H, we get that |H(x) − H(y)| ≤ |H(x) − H(zk )| + |H(y) − H(zk )| < /2 + /2 = . 2 Exercise 2.2 Prove that if Xn ⇒ X in ∞ (T ) and g : ∞ (T ) → D for a metric space (D, d) is continuous, then g(Xn ) ⇒ g(X) in (D, d). Solution. Since Xn ⇒ X in ∞ (T ), we have that, for every bounded and continuous function f : ∞ (T ) → R, E[f (Xn )] → E[f (X)]. (2.11) We want to prove that, given g : ∞ (T ) → D continuous, for every bounded and continuous h : D → R, E[h(g(Xn ))] → E[h(g(X))]. (2.12) But, thanks to the continuity of g and h and the boundedness of h, we have that h ◦ g is bounded and continuous. Hence (2.11) implies (2.12). 2 Chapter 3 Maximal Inequalities and Chaining 3.1 Orlicz norms and the Pisier inequality Let ψ be a Young modulus, that is, a convex increasing unbounded function ψ : [0, ∞) → [0, ∞) satisfying ψ (0) = 0. For any random variable X, the Lψ -Orlicz norm of X is |X| Xψ = inf c > 0 : Eψ ≤1 . c defined as The function p ψp (x) = ex − 1 (3.1) is a Young modulus for each p ≥ 1. Moreover, it is easy to see that for every p ≥ 1 there exists cp < ∞ such that the inequality Xp ≤ cp Xψ1 holds for any random variable X. In detail, we can show that the previous relationship holds for cp = (Γp + 1)1/p . Proof. Without loss of generality, we can assume that X ≥ 0. In this case, by defini- tion, we get that Xψ1 = inf c > 0 : E eX/c − 1 ≤ 1 , while, on the other side, we have that Xp = E (X p )1/p . Thus, it remains to show that E (X p )1/p ≤ (Γp + 1)1/p inf c > 0 : E eX/c − 1 ≤ 1 . 31 CHAPTER 3. MAXIMAL INEQUALITIES AND CHAINING 32 In order to do it, the following inequality has a crucial importance: xp ≤ Γ (p + 1) (ex − 1) , for x ≥ 0, p ≥ 1. From it we get, for any c > 0, p X ≤ Γ (p + 1) eX/c − 1 , c while, taking the expectation, ! p " X E ≤ Γ (p + 1) E eX/c − 1 c and also E (X p ) X/c ≤ E e − 1 . Γ (p + 1) cp Now, we can consider the set D = c > 0 : E eX/c − 1 ≤ 1 . If c ∈ D, we get E (X p ) ≤1 Γ (p + 1) cp and E (X p ) ≤ cp · Γ (p + 1) . Thus, taking the infimum of both sides, p E (X p ) ≤ inf cp Γ (p + 1) = Xψ1 Γ (p + 1) , c∈D and finally Xp ≤ Xψ1 (Γ (p + 1))1/p . Obviously, if p is an integer, the relationship becomes simply Xp ≤ Xψ1 (Γ(p + 1))1/p . In order to conclude the proof, it remains only to prove the crucial inequality used in it. First of all, we can note that, for m ≥ 1, m integer, it holds ex − 1 ≥ = xm m! xm , Γ(m + 1) from which it follows xm ≤ (Γ(m + 1)) (ex − 1). 3.1. ORLICZ NORMS AND THE PISIER INEQUALITY 33 Then, for any p ≥ 1, if x ≥ 1, we have xp ≤ xp+1 ≤ Γ (p + 2) (ex − 1) = Γ (p + 1) (ex − 1), while, if x ≤ 1, we have xp ≤ x ≤ Γ (p + 1) (ex − 1). 2 We say that a Young modulus is of exponential type if the following two conditions are satisfied: ψ −1 (xy) <∞ −1 −1 min{x,y}→∞ ψ (x)ψ (y) lim sup and lim sup x→∞ ψ −1 (x2 ) < ∞. ψ −1 (x) (It is actually the second of these two conditions which forces the exponential type; the first condition is satisfied by Young functions of the form ψ(x) = xp , p ≥ 1). Note that ψp defined in (3.1) satisfies these conditions (since ψp−1 = log (x + 1)1/p ). In what follows, if a variable X is not necessarily measurable, we write X∗ψ for |X|∗ ψ , where |X|∗ is the measurable envelope of |X|. The following lemma gives a simple way of bounding Xψp . Lemma 3.1 Suppose that X is a random variable with P (|X| > x) ≤ K exp(−Cxp ) for all x > 0 and some positive constants K and C, with p ≥ 1. Then the ψp Orlicz norm satisfies Xψp ≤ ((1 + K)/C)1/p . Proof. Without loss of generality, we can assume that X > 0. By definition of ψp Orlicz norm, we have p p Xψp = inf λ > 0 : E eX /λ − 1 ≤ 1 . At the same time, we have p X p ψ1 = inf ξ > 0 : E eX /ξ − 1 ≤ 1 , from which it follows p X p ψ1 = Xψp . Thus, it suffices to show that X p ψ1 ≤ 1+K . c Now, let Z = X p . It remains to prove that Zψ1 ≤ 1+K , c CHAPTER 3. MAXIMAL INEQUALITIES AND CHAINING 34 and, for this purpose, it can be useful to note that P (Z > z) = P (X > z 1/p ) ≤ K exp{−Cz}. Moreover, by definition, Zψ1 = inf α > 0 : E eZ/α − 1 ≤ 1 , so it suffices to show that with α0 = E eZ/α0 − 1 ≤ 1, 1+K c . Thus, we have 1 ∞ Z/α0 Z/α0 E e = P e ≥ y dy + P eZ/α0 ≥ y dy 0 1 ∞ ≤ 1+ P eZ/α0 ≥ y dy 1 ∞ = 1+ P (Z > α0 log y) dy 1 ∞ 1+K ≤ 1+ K exp −C · log y dy C 1 ∞ ∞ K = 1+ dy = 1 + −y −K 1 = 2. (1+K) y 1 Then, it follows E eZ/α0 − 1 = E eZ/α0 − 1 ≤ 2 − 1 = 1. (For more details, it is also possible to refer to van der Vaart and Wellner (1996), page 2 96). Once we have knowledge of (or bounds for) the individual Orlicz norms of some family of random variables {Xk }, then we can also control the Orlicz norm of a particular weighted supremum of the family. This is the content of the following proposition. Proposition 3.1 (de la Pe˜ na and Gin´ e) Let ψ be a Young modulus of exponential type. Then there exists a finite constant Cψ such that for every sequence of random variables {Xk } # # #sup # k Proof. # |Xk | # # ≤ Cψ sup Xk . ψ ψ −1 (k) #ψ k (3.2) We can delete a finite number of terms from the supremum on the left side as long as the number of terms deleted depends only on ψ. Furthermore, by homogeneity it suffices to prove that the inequality holds in the case that supk Xk ψ = 1. Let M ≥ 1/2 and let a > 0, b > 0 be constants such that (a) ψ −1 (xy) ≤ aψ −1 (x)ψ −1 (y) 3.1. ORLICZ NORMS AND THE PISIER INEQUALITY 35 and ψ −1 (x2 ) ≤ bψ −1 (x) for all x, y ≥ M. Define 1 −1 k0 = max 5, ψ ψ (M ) , M , b −1 ψ (M 2 ) ψ −1 (M ) c = max , ,b , ψ −1 (1/2) ψ −1 (1/2) γ = abc. For this choice of c we have, by the properties of ψ, that ψ(cψ −1 (t)) ≥ t2 for t ≥ 1/2; this is easy for t ≥ M since c ≥ b and hence x2 ≤ ψ(bψ −1 (x)) ≤ ψ(cψ −1 (x)), while, for 1/2 ≤ t < M ψ(cψ −1 (t)) ≥ ψ(cψ −1 (1/2)) ≥ M 2 > t2 . Thus for t ≥ 1/2 we have $ % $ % |Xk | |Xk | P r ψ sup ≥t = P r sup ≥1 −1 −1 −1 k≥k0 γψ (k) k≥k0 γψ (k)ψ (t) ≤ = ≤ ≤ ≤ ∞ k=k0 ∞ k=k0 ∞ k=k0 ∞ k=k0 ∞ k=k0 & ' P r |Xk | ≥ γψ −1 (k)ψ −1 (t) & ( )' P r ψ (|Xk |) ≥ ψ γψ −1 (k)ψ −1 (t) 1 ψ (γψ −1 (k)ψ −1 (t)) 1 ψ (bψ −1 (k)) ψ (cψ −1 (t)) 1 1 ≤ 2. k 2 t2 4t ( ) ( ) using k0 ≥ 5 at the last step and taking x = ψ b ψ −1 (k) , y = ψ c ψ −1 (t) in (a) to get the next to last inequality. Hence it follows that $ $ % % ∞ |Xk | |Xk | 1 E ψ sup P r ψ sup ≤ + ≥ t dt −1 −1 2 k≥k0 γψ (k) k≥k0 γψ (k) 1/2 1 1 ∞ −2 1 1 ≤ t dt = + = 1. + 2 4 1/2 2 2 36 CHAPTER 3. MAXIMAL INEQUALITIES AND CHAINING Thus we have proved that # # # |Xk | # # # # sup −1 # ≤ γ = γψ . #k≥k0 ψ (k) # ψ To complete the proof, note that # # # |Xk | # # # #sup −1 # = # k≥1 ψ (k) # ψ # # # |Xk | |Xk | # # # ∨ sup # sup −1 # #k<k0 ψ (k) k≥k0 ψ −1 (k) # ψ # # # # # # # # |X | |X | # k # k # ≤ # sup + sup # # # # −1 #k≥k0 ψ −1 (k) # k<k0 ψ (k) ψ ψ 1 ≤ + γψ ≡ Cψ . ψ −1 (k) k<k0 2 The following corollary of the proposition is a result similar to van der Vaart and Wellner (1996), Lemma 2.2.2, page 96. Corollary 3.1 If ψ is a Young function of the exponential type and {Xk }m k=1 is any finite collection of random variables, then # # # # # # # sup |Xk |# ≤ Cψ ψ −1 (m) sup Xk ψ #1≤k≤m # 1≤k≤m (3.3) ψ where Cψ is a finite constant depending only ψ. To apply these basic inequalities to processes {X(t) : t ∈ T }, we need to introduce several notions concerning the size of the index set T . For any > 0, the covering number N (, T, d) of the metric or pseudo-metric space (T, d) is the smallest number of open balls of radius at most and centers in T needed to cover T ; that is $ N (, T, d) = min k : there exist t1 , . . . , tk ∈ T such that T ⊂ k % B(ti , ) . i=1 The packing number is the largest k for which there exist k points t1 , . . . , tk in T at least apart for the metric d; i.e. d(ti , tj ) ≥ if i = j. The metric entropy or -entropy of (T, d) is log N (, T, d), and the -capacity is log D(, T, d). Covering numbers and packing numbers are equivalent in the following sense: D(2, T, d) ≤ N (, T, d) ≤ D(, T, d) as can be easily checked. (3.4) 3.1. ORLICZ NORMS AND THE PISIER INEQUALITY Proof. 37 Let k = N (, T, d). To prove the second inequality, we can fix a point t1 ∈ T and consider the ball of radius around it. Since k > 1, there exists a point t2 ∈ T which is not in this ball (i.e. d(t1 , t2 ) ≥ ). Consider now balls of radius centered at t1 and t2 . Since k > 2, there exists t3 ∈ T such that t3 ∈ / B (t1 ) and t3 ∈ / B (t2 ). Proceeding in this way, we get t1 , t2 , . . . , tk−1 which are all -separated (by construction) and finally get tk that is -separated from the rest. So, we have found k points t1 , t2 , . . . , tk in T such that d(ti , tj ) ≥ for i = j, from which it follows N (, T, d) ≤ D(, T, d). To prove the first inequality, we can take k + 1 points in T , say s1 , s2 , . . . , sk+1 . By definition of N (, T, d), we can find k points in T , say t1 , t2 , . . . , tk , such that T ⊆ k B (ti ). i=1 Then there exist two points si , sj both lying in some ball B (ti ), that is d(si , sj) < 2. This shows that we cannot find k + 1 points in T which are 2-separated and this prove 2 the first inequality. As is well-known, if T ⊂ Rm is totally bounded and d is equivalent to the Euclidean metric, then K m for some constant K. For example, if T is the ball B(0, R) in Rm with radius R, then N (, T, d) ≤ the bound in the last display holds with K = (6R)m . As we will see in next chapters, there are a variety of interesting cases in which the set T is a space of functions and a bound of the same form as the Euclidean case holds (and hence such classes are called “Euclidean classes” by some authors). On the other hand, for many spaces of functions T , the covering numbers grow exponentially fast as 0; for these classes we will typically have a bound of the form log N (, T, d) ≤ K r for some finite constant K and r > 0; in these cases the value of r will turn out to be crucial, as we will show later. The following theorem is our first result involving a chaining argument. Its proof is simpler than the corresponding result in van der Vaart and Wellner (1996) (Theorem 2.2.4, page 98), but it holds only for Young functions of exponential type. CHAPTER 3. MAXIMAL INEQUALITIES AND CHAINING 38 Theorem 3.1 (de la Pe˜ na and Gin´ e) Let (T, d) be a pseudo-metric space, let {X(t) : t ∈ T } be a stochastic process indexed by T , and let ψ be a Young modulus of exponential type such that X(t) − X(s)ψ ≤ d(s, t), s, t ∈ T. (3.5) Then there exists a constant K dependent only on ψ such that, for all finite subsets S ⊂ T , t0 ∈ T , and δ > 0, the following inequalities hold # # D # # #max |X(t)|# ≤ X(t0 ) + K ψ −1 (N (, T, d))d, ψ # # t∈S where D is the diameter of (T, d), and # # # # # # max |X(t) − X(s)|# ≤ K # s,t∈S, d(s,t)≤δ Proof. (3.6) 0 ψ δ ψ −1 (N (, T, d))d. (3.7) 0 ψ If (T, d) is not totally bounded, then the right hand side of (3.6) and (3.7) are infinite. Hence we can assume that (T, d) is totally bounded and has diameter less than 1. For a finite set S ⊂ T and t0 ∈ T , the set S ∪{t0 } is also finite and we have t0 ∈ S. We can also assume that X(t0 ) = 0(if not, consider the process Y (t) = X(t) − X(t0 )). For each non-negative integer k let sk1 , . . . , skNk ≡ Sk ⊂ S be the centers of Nk ≡ N (2−k , S, d) open balls of radius at most 2−k and centers in S that cover S. Note that S0 consists of just one point, which we may take to be t0 . For each k, let πk : S → Sk be a function satisfying d(s, πk (s)) < 2−k for all s ∈ S; such a function clearly exists by definition of the set Sk . Furthermore, since S is finite, there is an integer ks such that for k ≥ ks and s ∈ S we have d(πk (s), s) = 0. Then by (3.5) it follows that X(s) = X(πk (s)) a.s.. Therefore, for s ∈ S X(s) = ks (X(πk (s)) − X(πk−1 (s))) k=1 almost surely. Now by the triangle inequality for the metric d we have d(πk (s), πk−1 (s)) ≤ d(πk (s), s) + d(s, πk−1 (s)) < 2−k + 2−(k−1) = 3 · 2−k . It therefore follows from Proposition (3.1) that # # # # #max |X(s)|# # # s∈S ψ # ks # # # # # max ≤ |X(t) − X(s)| #t∈Sk ,s∈Sk−1 # k=1 ≤ 3Cψ ks 2−k ψ −1 (Nk Nk−1 ) k=1 ≤ K ks k=1 2−k ψ −1 (Nk ), ψ 3.1. ORLICZ NORMS AND THE PISIER INEQUALITY 39 where we used the second condition defining a Young modulus of exponential type in the last step. This implies (3.6) since N (2, S, d) ≤ N (, T, d) for every > 0 (to see this, note that, if an -ball with center in T intersects S, it is contained in a 2-ball with center in S), and then by bounding the sum in the last display by the integral in (3.6). To prove (3.7), for δ > 0 set V = {(s, t) : s, t ∈ T, d(s, t) ≤ δ}, and for v ∈ V define the process Y (v) = X(tv ) − X(sv ) where v = (sv , tv ). For u, v ∈ V define the pseudo-metric ρ(u, v) = Y (u) − Y (v)ψ . We can assume that δ ≤ diam(T ); also note that diamρ (V ) = sup ρ(u, v) ≤ 2 max Y (v)ψ ≤ 2δ, v∈V u,v∈V and furthermore ρ(u, v) ≤ X(tv ) − X(tu )ψ + X(sv ) − X(su )ψ ≤ d(tv , tu ) + d(sv , su ). It follows that, if t1 , . . . , tN are the centers of a covering of T by N = N (, T, d) open balls of radius at most , then the set of open balls with centers in {(ti , tj ) : 1 ≤ i, j ≤ N } and ρ-radius 2 cover V . Not all the (ti , tj ) need to be in V , but if the 2 ball about (ti , tj ) has a non-empty intersection with V , then it is contained in a ball of radius 4 centered at a point in V . Thus we have N (4, V, ρ) ≤ N 2 (, T, d). Thus the process {Y (v) : v ∈ V } satisfies (3.5) for the metric ρ. Thus we can apply (3.6) to the process Y to it with the choice v0 = (s, s) for any s ∈ S, and thus Y (v0 ) = 0. We therefore find that # # # # # |X(t) − X(s)|# #s,t∈S,max # d(s,t)≤δ ≤ K ψ 2δ 0 ≤ K ≤ K ψ −1 (N (r, V, ρ))dr 2δ ψ −1 (N 2 (r/4, T, d))dr 0 δ/2 ψ −1 (N (, T, d))d, 0 where we used the second property of a Young modulus of exponential type in the last step. 2 A process {X(t) : t ∈ T }, where (T, d) is a metric space (or a pseudo-metric space), is separable if there exists a countable set T0 ⊂ T and a subset Ω0 ⊂ Ω with P (Ω0 ) = 1 such CHAPTER 3. MAXIMAL INEQUALITIES AND CHAINING 40 that for all ω ∈ Ω, t ∈ T , and > 0, X(t, ω) is in the closure of {X(s, ω) : s ∈ T0 ∩B(t, )}. If X is separable, then it is easily seen that # # # # #sup |X(t)|# = sup # # t∈T and similarly for ψ S⊂T, S finite # # # # #max |X(t)|# # # t∈S ψ # # # # # # sup |X(s) − X(t)|# . # #d(s,t)≤δ, s,t∈T # ψ As is well known, if (T, d) is a separable metric or pseudo-metric space and X is uniformly continuous in probability for d, then X has a separable version. Since N (, T, d) < ∞ for all > 0 implies that (T, d) is totally bounded and the condition (3.5) implies that X that is uniformly continuous in probability, the following corollary is an easy consequence of the preceding theorem. Corollary 3.2 Suppose that (T, d) is a pseudo-metric space of diameter D, and let ψ be a Young modulus of exponential type such that D 0 ψ −1 (N (, T, d))d < ∞. (3.8) If {X(t) : t ∈ T } is a stochastic process satisfying (3.5), then, for a version of X with all sample paths in Cu (T, d), which we continue to denote by X, # # # # #sup |X(t)|# ≤ X(t0 ) + K ψ # # t∈T and ψ D ψ −1 (N (, T, d))d (3.9) 0 # # δ # # # # sup |X(t) − X(s)|# ≤ K ψ −1 (N (, T, d))d. # #s,t∈T, d(s,t)≤δ # 0 (3.10) ψ Corollary 3.3 (Gin´ e, Mason and Zaitsev (2003)) Let ψ be a Young modulus of exponential type, let (T, d) be a totally bounded pseudometric space, and let {Xt : t ∈ T } be a stochastic process indexed by T , with the property that there exist C < ∞ and 0 < γ < diam(T ) such that ||Xs − Xt ||ψ ≤ Cd(s, t) (3.11) whenever γ ≤ d(s, t) < diam(T ). Then there exists a constant L depending only on ψ such that, for any γ < δ ≤ diam(T ) ** **∗ ** **∗ δ ** ** ** ** ** ** ** ** ψ −1 (D(, T, d))d. ** sup |Xs − Xt |** ≤ 2 ** sup |Xs − Xt |** + CL γ **d(s,t)≤δ ** **d(s,t)≤γ ** ψ ψ 2 (3.12) 3.2. GAUSSIAN AND SUB-GAUSSIAN PROCESSES VIA HOEFFDING’S INEQUALITY41 Let Tγ be a maximal subset of T satisfying d(s, t) ≥ γ for s = t ∈ Tγ . Then, Proof. card(Tγ ) = D(T, d, γ). If s, t ∈ T and d(s, t) ≤ δ, let sγ and tγ be points in Tγ such that d(s, sγ ) < γ and d(t, tγ ) < γ, which exist by the maximality property of Tγ . Then, d(sγ , tγ ) < δ + 2γ < 3δ. Since * * * * * * |Xs − Xt | ≤ *Xs − Xsγ * + *Xt − Xtγ * + *Xsγ − Xtγ * , we obtain ** **∗ ** **∗ ** ** ** ** ** ** (a) **supd(s,t)≤δ |Xs − Xt |** ≤ 2 **supd(s,t)<λ |Xs − Xt |** +**maxd(s,t)<3δ;s,t∈Tγ |Xs − Xt |**ψ . ψ ψ Now, the process Xs restricted to the finite set Tγ satisfies inequality (3.11) for all s, t ∈ Tγ , and therefore we can apply Theorem (3.1) to the restriction to Tγ of Xs /C to conclude that (b) # # # # # max |Xs − Xt | /C # #d(s,t)<3δ;s,t∈T # γ ≤ L ψ 3δ 0 ≤ 3L δ 0 ψ −1 (D(, Tγ , d))d ψ −1 (D(, Tγ , d))d, where L is a constant that depends only on ψ. Now we note that D(, Tγ , d) ≤ D(, T, d) for all > 0 and that, moreover, D(, Tγ , d) = card(Tγ ) = D(γ, T, d) for all ≤ γ. Hence, δ δ −1 −1 ψ (D(, Tγ , d))d ≤ γψ (D(, T, d)) + ψ −1 (D(, T, d))d 0 ≤ 3 γ δ γ 2 ψ −1 (D(, T, d))d, and this, in combination with the previous inequalities (a) and (b), gives the corollary. 2 Corollary (3.2) gives an example of “restricted” or “stopped” chaining. Gin´e and 1 Zinn (1984) use restricted chaining with γ = n− 4 at stage n, but other choices are of 1 interest in the applications of Gin´e, Mason, and Zaitsev (2003): they take γ = ρn− 2 , ρ arbitrary. 3.2 Gaussian and sub-Gaussian processes via Hoeffding’s Inequality Recall that a process X(t), t ∈ T , is called Gaussian process if all the finite-dimensional distributions are multivariate normal. As indicated previously, the natural pseudometric ρX defined by ρ2X (s, t) = E[(X(s) − X(t))2 ], s, t ∈ T CHAPTER 3. MAXIMAL INEQUALITIES AND CHAINING 42 is very convenient and useful in this setting. Here is a further corollary of Corollary (3.2) due to Dudley (1967). Corollary 3.4 Suppose that X(t), t ∈ T , is a Gaussian process with D+ log N (, T, ρX )d < ∞. 0 Then there exists a version of X (which we continue to denote by X) with almost all of its sample paths in Cu (T, ρX ) which satisfies ** ** ** ** **sup |X(t)|** ≤ ||X(t0 )|| + K ψ2 ** ** t∈T D + 0 ψ2 for any fixed t0 ∈ T , and # # # # # # sup |X(t) − X(s)| # # #s,t∈T, ρX (s,t)≤δ # ≤K δ log N (, T, ρX )d + 0 ψ2 log N (, T, ρX )d (3.13) (3.14) for all 0 < δ ≤ D = diam(T). Proof. By direct computation, if Z ∈ N (0, 1) then E exp(Z 2 /c2 ) = , 1 1− 2 c2 <∞ + for c2 > 2. Choosing c2 = 8/3 yields E exp(Z 2 /c2 ) = 2. Hence ||Z||ψ2 = 8/3. By , homogeneity this yields ||ρZ||ψ2 = σ 83 . Thus it follows that '1 8& 8 2 2 ||X(t) − X(s)||ψ2 = E[(X(t) − X(s)) ] = ρX (s, t), 3 3 , so we can choose ψ = ψ2 and ρ = 83 ρX in Corollary (3.2). The inequalities (3.9) and (3.10) yield (3.13) and (3.14) for different constants K after noting two easy facts. First, ψ2−1 (x) = for an absolute constant C = + log(1 + x) ≤ C log(3) log(2) + log x, x ≥ 2, < 1.26; and N (·, T, ρ) is monotone decreasing with N (D/2, T, ρ) ≥ 2, N (D, T, ρ) = 1. It follows that for 0 < δ ≤ D/2 we have δ+ δ+ log(1 + N (, T, ρ))d ≤ C log N (, T, ρ)d, 0 0 and, for D/2 < δ ≤ D, δ+ log(1 + N (, T, ρ))d ≤ 2 0 D/2 + 0 ≤ 2C ≤ 2C D/2 + 0 0 log (1 + N (, T, ρ))d δ + log N (, T, ρ)d log(1 + N (, T, ρ))d. 3.2. GAUSSIAN AND SUB-GAUSSIAN PROCESSES VIA HOEFFDING’S INEQUALITY43 Second, for any positive constant b > 0, δ + log N (, T, bρ)d = b 0 δ/b + log N (, T, ρ)d 0 by an easy change of variables. Combining these facts with (3.9) and (3.10) yields the 2 claimed inequalities. The previous proof applies, virtually without change, to sub-Gaussian processes. First recall that a process X(t), t ∈ T , is sub-Gaussian with respect to the pseudometric d on T if x2 P r(|X(s) − X(t)| > x) ≤ 2 exp − 2 2d (s, t) s, t ∈ T, x > 0. , Moreover the process X is sub-Gaussian in this sense with d taken to be a constant multiple of ρX if and only if & '1 ||X(s) − X(t)||ψ2 ≤ C E[(X(t) − X(s))2 ] 2 = CρX (s, t) (3.15) for some C < ∞ and all s, t ∈ T . Example 3.1 Suppose that 1 , . . . , n are independent Rademacher random variables (that is P r(j = ±1) = 1/2 for j = 1, . . . , n), and let X(t) = n t i i , t = (t1 , . . . , tn ) ∈ Rn . i=1 Then it follows from Hoeffding’s inequality that P r(|X(s) − X(t)| > x) ≤ 2 exp − x2 2 ||s − t||2 , where ||·|| denotes the Euclidean norm. Hence for any subset T ⊂ Rn the process {X(t) : t ∈ T } is sub-Gaussian with respect to the Euclidean norm and we have ||X(t) − X(s)||ψ2 ≤ √ 6 ||s − t|| by Lemma (3.1). If T also satisfies D 0 + log(1 + N (, T, ||·||)d < ∞, (3.16) then {X(t) : t ∈ T } has bounded continuous sample paths on T . This example will play a key role in the development for empirical processes in next chapters where we will proceed by first symmetrizing the empirical process with Rademacher random variables and by conditioning on the Xi s generating the empirical process. CHAPTER 3. MAXIMAL INEQUALITIES AND CHAINING 44 Here is a statement of the results bonds for sub-Gaussian processes. Corollary 3.5 Suppose that X(t), t ∈ T , is a sub-Gaussian process with respect to the pseudo-metric d on T satisfying D+ 0 log N (, T, d)d < ∞. Then there exists a version of X (which we continue to denote by X) with almost all of its sample paths in Cu (T, d) which satisfies ** ** ** ** **sup |X(t)|** ≤ ||X(t0 )|| + K ψ2 ** ** t∈T ψ2 for any fixed t0 ∈ T , and # # # # # # sup |X(t) − X(s)|# # #s,t∈T, d(s,t)≤δ # ψ2 log N (, T, d)d (3.17) 0 ≤K + D δ + log N (, T, d)d (3.18) 0 for all 0 < δ ≤ D = diam(T ). 3.3 Bernstein’s inequality and ψ1 - Orlicz norms for maxima Suppose that Y1 , . . . , Yn are independent random variables with EYi = 0 and P (|Yi | ≤ M ) = 1 for i = 1, . . . , n. Bernstein’s inequality gives a bound on the tail of the absolute value of the sum ni=1 Yi . We will derive it from Bennett’s inequality. Lemma 3.2 (Bennett’s inequality) Suppose that Y1 , . . . , Yn are independent random variables with Yi ≤ M almost surely for all i = 1, . . . , n and zero means. Then n Mx x2 P ψ , Yi > x ≤ exp − 2V V i=1 n where V ≥ Var ( i=1 Yi ) = ni=1 Var(Yi ) and ψ is the function given by (3.19) ψ(x) = 2h(1 + x)/x2 with h(x) = x (log x − 1) + 1, x > 0. Lemma 3.3 (Bernstein’s inequality) If Y1 , . . . , Yn are independent random variables with |Yi | ≤ M almost surely for all i = 1, . . . , n and zero means. Then * * n * * x2 * * Yi * > x ≤ 2 exp − P * * * 2(V + M x/3) i=1 where V ≥ Var ( ni=1 Yi ) = ni=1 Var(Yi ). (3.20) 3.3. BERNSTEIN’S INEQUALITY AND ψ1 - ORLICZ NORMS FOR MAXIMA Proof. (a) P ( 45 Set σi2 = Var(Yi ), i = 1, . . . , n. For each r > 0 n i=1 Yi > x) ≤ e−rx .n rYi i=1 Ee = e−rx .n i=1 E & ' 1 + rYi + (1/2)r 2 Yi 2 g(rYi ) where g(x) = 2(ex − 1 − x)/x2 is non-negative increasing and convex for x ∈ R. Thus ' & ' & E 1 + rYi + (1/2)r 2 Yi 2 g(rYi ) = 1 + (1/2)r 2 E Yi 2 g(rY i ) ≤ 1 + (1/2)r 2 σi 2 g(rM ) for i = 1, . . . , n. Substituting this bound into (a) and then using 1 + u ≤ eu shows that the right side of (a) is bounded by n 2 g(rM ) r e−rx exp(r 2 σi 2 g(rM )/2) = exp −rx + σi 2 2 i=1 i=1 n erM − 1 − rM 2 = exp −rx + σi M2 i=1 erM − 1 − rM ≤ exp −rx + V M2 n / Minimizing this upper bound with respect to r shows that it is minimized by the choice r = M −1 log(1+M x/V ). Plugging this in and using the definition of ψ yields the claimed inequality. Lemma 3.3 follows by noting that ψ(x) ≥ (1 + x/3)−1 . 2 Note that for large x the upper bound in Bernstein’s inequality is of the form exp(−3x/2M ) while for x close to zero the bound is of the form exp(−x2 /2V ). This suggests that it might be possible to bound the maximum of random variables satisfying a Bernstein type inequality by a combination of the ψ1 and ψ2 Orlicz norms. The following proposition makes this explicit. Proposition 3.2 Suppose that X1 , . . . , Xm are arbitrary random variables satisfying the probability tail bound x2 P (|Xi | > x) ≤ 2 exp − 2(d + cx) , for all x > 0 and i = 1, . . . , m for fixed positive numbers c and d. Then there is a universal constant K < ∞ so that ** ** √ + ** ** ** max |Xi |** ≤ K c log(1 + m) + d log(1 + m) **1≤i≤m ** ψ1 Proof. Note that the hypothesis implies that $ ≤ 4 exp(−x2 /4d) if x ≤ d/c P (|Xi | > x) . ≤ exp(−x/4c) if x > d/c CHAPTER 3. MAXIMAL INEQUALITIES AND CHAINING 46 Hence it follows that the random variables |Xi | 1[|Xi |≤d/c] and |Xi | 1[|Xi |>d/c] satisfy, respectively, P (|Xi | 1[|Xi |≤d/c] > x) ≤ 2 exp(−x2 /4d), x>0 P (|Xi | 1[|Xi |>d/c] > x) ≤ 2 exp(−x/4c), x>0 and Then 3.1 implies that √ ** ** **|Xi | 1[|X |≤d/c] ** ≤ 12d i ψ 2 and ** ** **|Xi | 1[|X |>d/c] ** i ψ1 ≤ 12c for i = 1, . . . , m. This yields ** ** ** ** ** ** ** ** ** ** ** ** ** max |Xi |** ≤ **** max |Xi | 1[|Xi |≤d/c] **** + **** max |Xi | 1[|Xi |>d/c] **** **1≤i≤m ** 1≤i≤m 1≤i≤m ψ1 ψ1 ψ1 ** ** ** ** ** ** ** ** ≤ C **** max |Xi | 1[|Xi |≤d/c] **** + **** max |Xi | 1[|Xi |>d/c] **** 1≤i≤m ψ 1≤i≤m 2 √ + ≤ K d log(1 + m) + c log(1 + m) , ψ1 where the second inequality follows from the fact that for any random random variable V we have ||V ||ψ1 ≤ C ||V ||ψ2 for some constant C, and the third inequality follows from Corollary 3.1 applied with ψ = ψ2 and with ψ = ψ1 . 2 3.4. EXERCISES 3.4 47 Exercises Exercise 3.1 Show that the constant random variable X = 1 has Xψp = (log 2)−1/p for ψp (x) = exp(xp ) − 1. Solution. By definition of Lψ -Orlicz norm of X in the case proposed, we have 0 p p 1 Xψp = inf c > 0 : E eX /c − 1 ≤ 1 −p = inf c > 0 : ec − 1 ≤ 1 −p = inf c > 0 : ec ≤ 2 . It follows immediately ec −p ≤ 2 ⇔ c−p ≤ log 2 ⇔ 1 ≤ (log 2)1/p ⇔ c ≥ (log 2)−1/p , c and this means exactly that Xψp = (log 2)−1/p . 2 Exercise 3.2 Let ψ be a Young modulus. Show that if 0 ≤ Xn ↑ X almost surely, then Xn ψ ↑ Xψ . (Hint: use the monotone convergence theorem to show that lim Eψ(Xn /r Xψ ) > 1 for any r < 1). Solution. Since X1 ≤ X2 ≤ · · · ≤ X, we have that Xn ψ is a non-decreasing succes- sion and, moreover, that it is bounded above by Xψ . Thus Xn ψ converges. Let L be its limit and put L < Xψ (obviously, L ≤ Xψ ). Under this hypothesis, we can have L = r Xψ , for some 0 < r < 1. Now, we can note that Xn X ↑ a.s. r Xψ r Xψ and ψ Xn r Xψ ↑ψ X r Xψ a.s. By the monotone convergence theorem, we get 2 3 2 3 Xn X E ψ ↑ E ψ , r Xψ r Xψ and, by the definition of Orlicz norm and by the fact that r < 1, also 2 3 X E ψ > 1. r Xψ CHAPTER 3. MAXIMAL INEQUALITIES AND CHAINING 48 On the other hand, as r Xψ ≥ Xn ψ for all n, it follows that, for all n, 2 3 Xn E ψ ≤ 1. r Xψ and, as done before, it means also 2 E ψ X r Xψ 3 ≤ 1. Obviously, this leads to a contradiction. It follows that L = limn Xn ψ = Xψ and 2 the proof is complete. Exercise 3.3 Show that the infimum in the definition of an Orlicz norm is attained (at Xψ ). Solution. Without loss of generality, we can consider the case in which X > 0. By definition of Orlicz norm, we have ! " X Xψ = c˜ = inf c > 0 : E ψ ≤1 , c where ψ is a Young modulus. In order to solve the exercise, we have to show that c˜ itself belongs to the set introduced by the definition. For this purpose, we can take any succession cn strictly decreasing to c˜. By definition of c˜, ! " X E ψ ≤ 1 for each n. cn Moreover, we have that X X ↑ cn c˜ a.s., which leads, by continuity of ψ (due to its convexity), to X X ψ ↑ψ cn c˜ and to ! " ! " X X E ψ ↑E ψ cn c˜ by the monotone convergence theorem. Finally, this shows exactly that ! " X E ψ ≤ 1, c˜ and, then, c˜ indeed belongs to the set. 2 3.4. EXERCISES 49 Exercise 3.4 Let ψ be a Young modulus. 1. Show that its conjugate function ψ ∗ defined by ψ ∗ (y) = sup {xy − ψ(x)} , x>0 y≥0 is a Young modulus. 2. Moreover, show that ||X||1 ≤ √ 2 max ||X||ψ , ||X||ψ∗ . 1. ψ ∗ satisfies the properties of a Young modulus. Solution. i. ψ ∗ (0) = 0 ii. ψ ∗ is increasing function. Given any two real positive numbers y1 and y2 such that 0 ≤ y1 ≤ y2 , for each x > 0, xy1 − ψ(x) ≤ xy2 − ψ(x) which implies sup xy1 − ψ(x) ≤ sup xy2 − ψ(x) x>0 x>0 iii. ψ ∗ is a convex function. By the definition of ψ ∗ , ψ ∗ ((1 − α)y1 + αy2 ) = sup {(1 − α)xy1 + αxy2 − ψ(x)} x>0 Now, for any fixed x > 0, (1 − α)xy1 + αxy2 − ψ(x) = (1 − α) [xy1 − ψ(x)] + α [xy2 − ψ(x)] ≤ (1 − α)ψ ∗ (y1 ) + αψ ∗ (y2 ) Finally, for the convexity, notice that sup {(1 − α)xy1 + αxy2 − ψ(x)} = sup {x((1 − α)y1 + αy2 ) − ψ(x)} x>0 x>0 ≤ (1 − α)ψ ∗ (y1 ) + αψ ∗ (y2 ) 2. To prove the inequality, consider two sets C and D defined as follows: 0 C = Xψ , ∞ 0 D = Xψ∗ , ∞ . CHAPTER 3. MAXIMAL INEQUALITIES AND CHAINING 50 The two sets satisfy the following properties: ! " X C∈C ⇔ E ψ ≤1 C ! " Y D ∈ D ⇔ E ψ∗ ≤ 1. D Now, choose two random variables X, Y independent and identically distributed. Then for any C ∈ C and D ∈ D, we have ! " ! " ! " X Y Y Y X XY X ∗ · ≤ψ +ψ ⇒E ≤E ψ + E ψ∗ C D C D CD C D which implies E(X 2 ) ≤2 CD or X1 2 ≤ 2CD. Thus, choosing C = Xψ and D = Xψ∗ we obtain the desired inequality E(X 2 ) = X1 2 ≤ 2 Xψ Xψ∗ ≤ 2 max Xψ , Xψ∗ so that ||X||1 ≤ √ 2 max ||X||ψ , ||X||ψ∗ 2 Exercise 3.5 Suppose that Z is a standard Normal random variable. Show that for all z ≥ 0, Solution. z2 P (|Z| > z) ≤ exp − 2 If Z is a standard Normal random variable then Z 2 is a Chi-square random variable with one degree of freedom. Notice that & ' {|Z| > y} ⇔ Z 2 > z 2 and then for the corresponding probability it holds that P (|Z| > z) = P (Z 2 > z 2 ). Then, in order to prove the required inequality, it suffices to show that, for any λ > 0, P (χ21 > λ) ≤ exp((−1/2)λ). 3.4. EXERCISES 51 But P (χ21 > λ) ≤ P (χ22 > λ) = P (exp(1/2) > λ) 1 = e− 2 λ , where Exp(1/2) refers to a random variable with Exponential distribution with parameter 2 1/2. Exercise 3.6 Suppose that ( ) P r (|X(s) − X(t)| > x) ≤ K exp −Cx2 /d2 (s, t) for a given stochastic process X and certain positive constant K and C. Then the process X is sub-Gaussian for a multiple of the distance d. Solution. We look for a function D = f (k, c) such that 2 2 min 1, ke−cx ≤ 2e−Dx for every x ≥ 0. If such a function exists the conclusion would follow immediately, since we would have −Dx2 P r (|X(s) − X(t)| > x) ≤ 2 exp d2 (s, t) −x2 = 2 exp 2d∗ 2 (s, t) with d∗ actually a multiple of d 1 2d∗2 (s, t) = d2 (s, t)/D ⇒ d∗2 (s, t) = 1/2Dd2 (s, t) ⇒ d∗ (s, t) = (1/2D) 2 d(s, t) Now we prove the crucial assumption 2 ∃D = f (k, C) : min 1, ke−cx ≤ 2e−Dx 2 for every x ≥ 0. Let’s restrict for the moment to the case k > 2. We have: −cx2 ke −cx2 ≤1⇔e Thus −cx2 min 1, ke = - ≤ 1/k ⇔ cx2 ≥ log k ⇔ x ≥ ⎧ ⎪ ⎨ ke−Cx2 if x≥ ⎪ ⎩ if x< 1 log k C log k C log k . C 1 2 1 2 . In other words in order to show the inequality we need to ensure the existence of a function D satisfying the two following conditions: CHAPTER 3. MAXIMAL INEQUALITIES AND CHAINING 52 i. −Dx2 2e ≥1 with x< ii. −Dx2 2e −Cx2 ≥ ke x≥ with log k c log k c 1/2 1/2 . Now, it is easy to show that the two conditions are jointly satisfied when it holds: D≤C so that we can choose D = C. It remains to show that D = C satisfy the crucial inequality in the particular case k ≤ 1. But this is a trivial verification. If k ≤ 1 then 2 2 min 1, ke−cx = ke−cx . We need to find D such that for any x 2 2 ke−cx ≤ ke−Dx , that is k 2 ≤ e(C−D)x . 2 Since k/2 ≤ 1/2, it is sufficient to choose D = C. 2 Chapter 4 Inequalities for sums of independent processes In this chapter several inequalities for sums of independent stochastic processes are presented, in particular 1. symmetrization inequalities, 2. Ottaviani inequality, 3. Levy’s inequalities, 4. Hoffmann-Jørgensen inequalities. 4.1 Symmetrization inequalities Suppose X1 , X2 , . . . , Xn are i.i.d. random variables with probability distribution P on the measurable space (X , A). For some class of real-valued function F or X, consider the process (Pn − P )f = n (f (Xi ) − P f ), f ∈F i=1 Let 1 , . . . , n be i.i.d. Rademacher random variables, independent of (X1 , . . . , Xn ). It will be more useful to consider the symmetrized process 1 i f (Xi ), n n P0n f = f ∈ F, i=1 or 1 i (f (Xi ) − P f ) = P0n f − n P f , n n P†n f = i=1 53 f ∈ F. 54 CHAPTER 4. INEQUALITIES FOR SUMS OF INDEPENDENT PROCESSES It will be convenient in the following to generalize the treatment beyond the empirical process setting. Consider sums of independent stochastic processes {Zi (f ) : f ∈ F}. The processes Zi need not possess any measurability beyond the measurability of all marginal Zi (f ) but for computing outer expectations it will be understood that the . underlying probability space ni=1 (Xi , Ai , Pi ) × (Z, C, Q) and each Zi is a function of the i-th coordinate of (x, z) = (x1 , . . . , xn , z) only. The additional Rademacher or other random variables are understood to be functions of the (n + 1)-st coordinate z only. The empirical process corresponds clearly here to taking Zi (f ) = f (Xi ) − P f . Lemma 4.1 Let Z1 , . . . , Zn be independent stochastic processes with mean zero. Then for any nondecreasing, convex Φ : R → R and arbitrary function µi : F → R, # # # # n # n # n # # # # # # 1 # # # # # # E∗Φ ≤ E∗Φ # ≤ E∗Φ 2 # i Zi # Zi # i (Zi − µi )# # # # # # # 2# i=1 F i=1 F i=1 F Proof. Let Y1 , . . . , Yn be an independent copy of Z1 , . . . , Zn defined formally as the . coordinate projections on the last n coordinates in the product space ni=1 (Xi , Ai , Pi ) × . (Z, C, Q) × ni=1 (Xi , Ai , Pi ). Since EYi (f ) = 0, the left side of the proposition is an average of expressions of the type: # # n #1 # # # EZ∗ Φ # , ei (Zi (f ) − EYi (f ))# #2 # F i=1 where (e1 , . . . , en ) ∈ {−1, 1}n . By convexity of Φ and the norm ·F , it follows from Jensen’s inequality that this expression is bounded above by # # # n # n #1 # #1 # # # # # ∗ ∗ EZ,Y Φ # ei (Zi (f ) − Yi (f ))# Φ # (Zi (f ) − Yi (f ))# = EZ,Y #2 # #2 # F i=1 F i=1 Finally, apply the triangle inequality and convexity of Φ. To prove the inequality on the right, note that for fixed values of the Zi ’s we have : # n # * n * n * * # # * * * * # # * * * * Zi # = sup * (Zi (f ) − EYi (f ))* ≤ EY∗ sup * (Zi (f ) − Yi (f ))* # # # * * f ∈F * f ∈F * i=1 F i=1 i=1 where EY∗ is the outer expectation with respect to Y1 , . . . , Yn computed for P n for given fixed values of Z1 , . . . , Zn . this in combination with Jensen’s inequality yields: ⎛# # #∗Y ⎞ # n n # # # # # # # # Φ # ≤ EY Φ ⎝# Zi # (Zi (f ) − Yi (f ))# ⎠ # # # # i=1 F i=1 F where ∗Y denotes the minimal measurable majorant of the supremum with respect to Y1 , . . . , Yn , still with Z1 , . . . , Zn fixed. Because Φ is continuous and nondecreasing, the 4.1. SYMMETRIZATION INEQUALITIES 55 ∗Y inside Φ can be moved to EY∗ . Now take the expectation with respect to Z1 , . . . , Zn to get # # # n # n # # # # # # # # ∗ ∗ E Φ # ≤ EZ EY Φ # Zi # (Zi (f ) − Yi (f ))# # # # # ∗ F i=1 F i=1 Here the repeated outer expectation can be bounded above by the joint outer expectation E ∗ by Lemma 1.2.6 of van der Vaart and Wellner (1996). Note that adding a minus sign in front of a term [Zi (f )−Yi (f )] has the effect of exchanging Zi and Yi . By construction of the underlying probability space as product space, the outer expectation of any function f (Z1 , . . . , Zn , Y1 , . . . , Yn ) remain unchanged under permutations of its 2n arguments. The resulting expression # # n # # # # E ∗Φ # ei (Zi (f ) − Yi (f ))# # # F i=1 is the same for any n-tuple (e1 , . . . , en ) ∈ {−1, 1}n . Thus # # # n # n # # # # # # # # ∗ ≤ E EZ,Y . E∗Φ # Zi # Φ # i (Zi (f ) − Yi (f ))# # # # # i=1 F F i=1 Now add and subtract µi inside the right side and use the triangle inequality and convexity of Φ to show that the right side of the preceding display is bounded above by # # # n # n # # # # 1 1 # # # # ∗ ∗ + E EZ,Y Φ 2# i (Zi (f ) − µi (f ))# Φ 2# i (Yi (f ) − µi (f ))# E EZ,Y # # # # 2 2 F i=1 F i=1 ∗ Perfectness of coordinate projections implies that the expectation EZ,Y is the same as EZ∗ and EY∗ in the two terms, respectively. Finally, replace the repeated outer expectations by a joint outer expectation and note that the two resulting terms are equal. 2 Corollary 4.1 For every nondecreasing, convex Φ : R → R and class of measurable functions F # ⎛# # †# ⎞ # # #Pn # ( # 0# ) # # ∗ ⎝ ∗ ∗ ∗ F⎠ # # E Φ ≤ E Φ (Pn − P F ) ≤ E Φ 2 Pn F ∧ E Φ 2 #P†n # 2 F We will frequently use these symmetrization inequalities with the choice Φ(x) = x. Although the hypotheses that Φ is convex function rules out the choice Φ(x) = 1{x > a}, there is a corresponding symmetrization inequality for probabilities which is also useful. Lemma 4.2 For arbitrary stochastic processes Z1 , . . . , Zn and arbitrary functions µ1 , . . . , µn : F → R, # # # n # n # # # # # # # # βn (x)P ∗ # Zi # > x ≤ 2P ∗ 4 # i (Zi − µi )# > x # # # # i=1 F i=1 F 56 CHAPTER 4. INEQUALITIES FOR SUMS OF INDEPENDENT PROCESSES ( ) for every x > 0 and βn (x) ≤ inf F P | ni=1Zi (f )|< x2 . In particular this is true for 8 i.i.d. mean-zero processes, and βn (x) = 1 − 4n x2 supf Var [Z1 (f )]. Proof. Let Y1 , . . . , Yn be an independent copy of Z1 , . . . , Zn , suitably defined on a product space as previously. If ni=1 Zi F > x, then there is certainly some f for which | ni=1 Zi (f )| > x. Fix a realization Z1 , . . . , Zn and f for which both are the case. For this fixed realization * * n * * x * * β ≤ PY∗ * Yi (f )* < * * 2 i=1 * * n n * x * * * ≤ PY∗ * Yi (f ) − Zi (f )* > * 2 * i=1 i=1 # # n # # x # # ≤ PY∗ # (Yi − Zi )# > . # # 2 i=1 The far left and far right sides do not depend on the particular f . Integrate the two sides out with respect to Z1 , . . . , Zn over this set to obtain # # # n # n # # # # x # # # # βP ∗ # Zi # > x ≤ PZ∗ PY∗ # (Yi − Zi )# > . # # # # 2 i=1 F F i=1 By symmetry, the right side equals # # n # # x # # E PZ∗ PY∗ # i (Yi − Zi )# > . # # 2 F i=1 In view of the triangle inequality, this expression is not bigger than # # n # # x # ∗ # 2P i (Yi − µi )# > . # # # 4 F i=1 Processes Z1 , . . . , Zn with mean zero satisfy the condition for the given β in view of 2 Chebyshev’s inequality. Lemma 4.3 (Second symmetrization lemma for probabilities) Suppose that {Z(f ) : f ∈ F} and {Y (f ) : f ∈ F} are independent stochastic processes indexed by F. Suppose that x > > 0, then % $ βn ()P ∗ sup |Z(f )| > x f ∈F where βn () ≤ inf f ∈F P (|Y (f )| ≤ ). ≤P ∗ sup |Z(f ) − Y (f )| > x − , f ∈F 4.2. THE OTTAVIANI INEQUALITY Proof. 57 We suppose that Z and Y are defined on a product space (Ω × Ω , B × B ). If ZF > x, then there is some f ∈ F for which |Z(f )| > x. Fix an outcome ω ∈ Ω and f ∈ F so that |Z(f, ω)| > x. Then we have βn () ≤ PY∗ (|Y (f )| ≤ ) ≤ PY∗ (|Z(f, ω) − Y (f )| > x − ) ≤ PY∗ (Z(·, ω) − Y F > x − ) . The far left and far right sides do not depend on the particular f , and the inequality holds on the set {ZF > x}. Integration of the two sides with respect to Z over this 2 set yields the stated conclusion. 4.2 The Ottaviani Inequality Throughout this section Sn equals the partial sum X1 +· · ·+Xn of independent stochastic processes X1 , . . . , Xn . The processes Xi is called symmetric if Xi and −Xi have the same distributions. Independence of the stochastic processes X1 , . . . , Xn is understood in the sense that each of the processes is defined on a product probability space ∞ / (Ωi , Ai , Pi ) (4.1) i=1 with Xi depending on the i-th coordinate of (w1 , w2 , . . .) only. Proposition 4.1 (Ottaviani inequality) Let X1 , . . . , Xn be independent stochastic processes indexed by an arbitrary set. Then for λ, η > 0, P Proof. ∗ max Sk > λ + η k≤n ≤ P ∗ (Sn > λ) . 1 − maxk≤n P ∗ (Sn − Sk > η) Let Ak defined by Ak = inf {k > 0 : Sk > λ + η} = {S1 ∗ ≤ λ + η, . . . , Sk−1 ∗ ≤ λ + η, Sk ∗ > λ + η} . The event on the left is the disjoint union of A1 , . . . , An . Since Sn − Sk ∗ is independent of S1 ∗ , . . . , Sk ∗ , P (Ak ) min P (Sn − Sj ∗ ≤ η) ≤ P (Ak , Sn − Sk ∗ ≤ η) j≤n ≤ P (Ak , Sn ∗ > λ) , since Sk ∗ > λ + η on Ak . Summing up over k yields the result. 2 CHAPTER 4. INEQUALITIES FOR SUMS OF INDEPENDENT PROCESSES 58 4.3 Levy’s Inequalities Proposition 4.2 (Levy’s inequalities) Let X1 , . . . , Xn be independent, symmetric stochastic processes indexed by an arbitrary set. Then for every λ > 0 ∗ P max Sk > λ ≤ 2P ∗ (Sn > λ) , k≤n P ∗ max Xk > λ ≤ 2P ∗ (Sn > λ) . k≤n Let Ak be the event that Sk ∗ is the first Sj ∗ that is strictly greater than λ: Proof. Ak = {S1 ∗ ≤ λ, . . . , Sk−1 ∗ ≤ λ, Sk ∗ > λ} . The event on the left in the first inequality is the disjoint union of A1 , . . . , An . Write Tn for the sum of the sequence X1 , . . . , Xk , −Xk+1 , . . . , −Xn . By the triangle inequality, 2 Sk ∗ ≤ Sn ∗ + Tn ∗ . It follows that P (Ak ) ≤ P (Ak , Sn ∗ > λ) + P (Ak , Tn ∗ > λ) = 2P (Ak , Sn ∗ > λ) since X1 , . . . , Xn are symmetric. Summing up over k yields the first inequality. To prove the second inequality, let Ak be the event that Xk ∗ is the first Xj ∗ that is strictly greater than λ. Write Tn for the sum of the variables −X1 , . . . , −Xk−1 , Xk , −Xk+1 , . . . , −Xn . By the triangle inequality, 2 Xk ∗ ≤ Sn ∗ + Tn ∗ . The rest of 2 the proof goes exactly as before. 4.4 Hoffman-Jørgensen Inequalities Proposition 4.3 (Hoffman-Jørgensen inequalities) Let X1 , . . . , Xn be independent stochastic processes indexed by an arbitrary set. Then for any λ, η > 0, P ∗ max Sk > 3λ + η k≤n 2 ∗ ≤ P max Sk > λ + P max Xk > η . ∗ k≤n k≤n If X1 , . . . , Xn are independent and symmetric, then also 2 ∗ ∗ ∗ P max Sn > 2λ + η ≤ 4P (Sn > λ) + P max Xk > η . k≤n k≤n Proof. Let Ak = {S1 ∗ ≤ λ, . . . , Sk−1 ∗ ≤ λ, Sk ∗ > λ}. Then Ak ’s are disjoint and n ∗ k=1 Ak = {maxk≤n Sk > λ}. By the triangle inequality Sj ∗ ≤ Sk−1 ∗ + Xk ∗ + Sj − Sk ∗ , ∀j ≥ k. 4.4. HOFFMAN-JØRGENSEN INEQUALITIES 59 By construction of Ak , conclude that on Ak max Sj ∗ ≤ λ + max Xk ∗ + max Sj − Sk ∗ . j≥k k≤n j>k Since the processes Xj are independent, we obtain for every k ∗ ∗ P Ak , max Sk > 3λ + η ≤ P Ak , max Xk > η k≤n k≤n ∗ + P (Ak )P max Sm − Sk > 2λ m>k ∗ ≤ P Ak , max Xk > η k≤n ∗ + P (Ak )P max Sk > λ , k≤n since in the probability on the far right the variable maxm>k Sm − Sk ∗ can be bounded by 2 maxk≤n Sk ∗ . Next sum over k to obtain the first inequality of the proposition. To prove the second inequality, first use the same method as above to show that ∗ ∗ P (Ak , Sn > 2λ + η) ≤ P Ak , max Xk > η + P (Ak )P (Sn − Sk ∗ > λ) k≤n since Sn − Sk ∗ ≤ maxk≤n Sn − Sk ∗ . Then summation over k yields ∗ ∗ P (Sn > 2λ + η) ≤ P max Xk > η k≤n ∗ ∗ + P max Sk > λ P max Sn − Sk > λ . k≤n k≤n The processes Sk and Sn − Sk are the partial sums of the symmetric processes X1 , . . . , Xn and Xn , . . . , X2 respectively. Application of Levy’s inequality to both probabilities on the far right concludes the proof. 2 Proposition 4.4 (Hoffman-Jørgensen’s inequality for moments) Let 0 < p < ∞ and suppose that X1 , . . . , Xn are independent stochastic processes indexed by an arbitrary index set T . Then there exist constants Cp and 0 < up < 1 such that p p ∗ ∗ −1 p E max Sk ≤ Cp E max Xk + F (up ) , k≤n k≤n where F −1 is the quantile function of the random variable maxk≤n Sk ∗ . Furthermore, if X1 , . . . , Xn are symmetric, then there exist constants Kp and 0 < υp < 1 such that p p ∗ ∗ −1 p E Sn ≤ Kp E max Xk + G (υp ) , k≤n where G−1 is the quantile function of the random variable Sn ∗ . For p ≥ 1, the last inequality is also valid for mean-zero processes (with different constants). 60 CHAPTER 4. INEQUALITIES FOR SUMS OF INDEPENDENT PROCESSES Proof. Take λ = η = t in the first inequality of the preceding proposition to find that, for any x > 0 E ∗ max Sk p k≤n = 4p ∞ P max Sk ∗ > 4t d (tp ) k≤n 0 ∞ 2 max Sk ∗ > t d (tp ) + ≤ (4x)p + 4p ≤ 2 max Xk ∗ > t d (tp ) k≤n x ∗ p p (4x) + 4 P max Sk > x E ∗ max Sk p + 4p E ∗ max Xk p . + 4p P x ∞ k≤n P k≤n k≤n k≤n Now choose x such that 4p P (maxk≤n Sk ∗ > x) is bounded by 12 . By rearranging terms the first inequality follows. The second inequality can be proved in a similar way, this time using the second inequality of the preceding proposition. The inequality for zero-mean processes follows from the inequality for symmetric processes by symmetrization and desymmetrization: it follows from Jensen’s inequality that E ∗ Sn p is bounded by E ∗ Sn − Tn p where Tn is the sum of n independent copies of X1 , . . . , Xn . 2 Chapter 5 Glivenko-Cantelli Theorems 5.1 Glivenko-Cantelli classes F In this chapter we will prove two types of Glivenko-Cantelli theorems via symmetrization and the maximal inequalities developed in Chapter 3. To begin, we need to first define entropy with bracketing. Let (F·) be a subset of a normed space of real functions f : X → R; usually we will take · to be the supremum norm or the Lr (Q) norm for some r ≥ 1 and a probability measure Q on the measurable space (X , A). Given two functions l and u on X , the bracket [l, u] is the set of all functions f ∈ F with l ≤ f ≤ u. The functions l and u need not belong to F, but are assumed to have finite norms. An − bracket is a bracket [l, u] with u − l ≤ . The bracketing number N[ ] (, F| · ) is the minimum number of − brackets needed to cover F. The entropy with bracketing is the logarithm of the bracketing number. Theorem 5.1 Let F be a class of measurable functions such that N[ ] (, F, L1 (P )) < ∞ for every . Then F is P –Glivenko-Cantelli, that is Pn − Proof. P ∗F = ∗ sup |Pn f − P f | f ∈F →a.s. 0. Fix > 0. Choose finitely many − brackets [li , ui ], i = 1, . . . , m with m = N (, F, L1 (P )) whose union contains F and such that P (ui − li ) < for all 1 ≤ i ≤ m. Thus, for everyf ∈ F there is a bracket [li , ui ] such that (Pn − P )f ≤ (Pn − P )ui + P (ui − f ) ≤ (Pn − P )ui + . 61 CHAPTER 5. GLIVENKO-CANTELLI THEOREMS 62 Similarly (P − Pn )f ≤ (P − Pn )li + P (f − li ) ≤ (P − Pn )li + . It follows that sup |(Pn − P )f | ≤ max (Pn − P )ui ∨ max (P − Pn )li + f ∈F l≤i≤m l≤i≤m where the right converges almost surely to by the strong law of large numbers for real random variables (2m times). Thus lim supn Pn −P ∗F ≤ almost surely for every > 0. 2 We define an envelope function for a class of real functions F on a measurable space (X , A) any function F on X such that |f (x)| ≤ F (x) for all x ∈ X and all f ∈ F. The minimal envelope function is x → supf ∈F |f (x)|. From the theorem just proved it follows that any class F satisfying the bracketing hypothesis automatically has a measurable envelope function. One of the simplest settings to which this theorem applies involves a collection of functions f = f (·, t) indexed or parametrized by t ∈ T , a compact subset of a metric space (D, d). Here is the basic lemma; it goes back to Wald (1949) and Le Cam (1953). Lemma 5.1 Suppose that F = {f (·, t) : t ∈ T } where the functions f are s.t. f : X × T → R, are continuous in t for P –almost all x ∈ X . Suppose that T is compact and that the envelope function F defined by F (x) = supt∈T |f (x, t)| satisfies P ∗ F < ∞. Then N[ ] (, F, L1 (P )) < ∞ for every > 0, and hence F is P –Glivenko-Cantelli. Proof. Define, for x ∈ X , t ∈ T , and ρ > 0, ψ(x; t, ρ) := sup |f (x, s) − f (x, t)|. s∈T,d(s,t)<ρ Since f is continuous in t, it happens that for any countable set D dense in {s ∈ T : d(s, t) < ρ}, ψ(x; t, ρ) := sup |f (x, s) − f (x, t)|, s∈D,d(s,t)<ρ and hence ψ(·; t, ρ) is a measurable function for each t ∈ T and ρ > 0. Note that ψ(x; t, ρ) → 0 as ρ → 0 for P –almost every x and ψ(x; t, ρ) ≤ 2F ∗ (x) with P F ∗ < ∞, so the dominated convergence theorem yields P ψ(X; t, ρ) = ψ(x; t, ρ) dP (x) → 0 5.1. GLIVENKO-CANTELLI CLASSES F 63 as ρ → 0. Fix δ > 0. For each t ∈ T choose ρt so small that P ψ(X; t, ρt ) ≤ δ. This yields an open cover of T : the balls Bt := {s ∈ T : d(s, t) < ρt } work. By compactness of T there is a finite sub-cover Bt1 , . . . , Btk of T . In terms of this finite sub-cover, define brackets for F by lj (x) = f (x, tj ) − ψ(x; tj , ρtj ), uj (x) = f (x, tj ) + ψ(x; tj , ρtj ), j = 1, . . . , k. Then P (uj − lj ) = 2P ψ(X; tj , ρtj ) ≤ 2δ and for t ∈ Btj we have lj (x) ≤ f (x, t) ≤ uj (x). Hence N[ ] (δ, F, L1 (P )) ≤ k. 2 The next lemma further quantify the finiteness given by Lemma 5.1 by imposing a Lipschitz type condition rather than just continuity. Lemma 5.2 Suppose that {f (·, t) : t ∈ T } is a class of functions satisfying |f (x, t) − f (x, s)| ≤ d(s, t)F (x) ∀s, t ∈ T, x ∈ X for some metric d on the index set, and a function F on the sample space X . Then, for any norm · , N[ ] (2F , F, · ) ≤ N (, T, d). Proof. Let t1 , . . . , tk be an − net for T with respect to d. This can be done with k = N (, T, d) points. Then the brackets [f (·, tj ) − F, f (·, tj ) + F ] cover F, and are of size at most 2F . 2 Lemma 5.3 Suppose that for every θ in a compact subset U of Rd the class Fθ = {fθ,γ : γ ∈ Γ} satisfies 1 W for a constant W < 2 and K not depending on θ. Suppose in addiction that for every log N[ ] (, Fθ , L2 (P )) ≤ K θ1 , θ2 , and γ ∈ Γ |fθ1 ,γ − fθ2 ,γ | ≤ F |θ1 − θ2 | for a function F with P F 2 < ∞. Then F = ∪θ∈U Fθ satisfies W 1 log N[ ] (, F, L2 (P )) ≤ d log(1/) + K . CHAPTER 5. GLIVENKO-CANTELLI THEOREMS 64 Proof. Fix d = 2. Take a square Q with side L, s.t. U ⊂ Q. Since U ⊂ R2 is compact, this is doable. Find the smallest integer M such that into M2 subsquares. Let Si denote the i-th square. Now L M −1 L M < and chip Q ≥ implies M 2 ≤ 4L2 2 . Consider Si ∩ U for each i and suppose that it is no-empty. Fix θi ∈ U ∩ Si and consider − brackets [l1 , u1 ], . . . , [lNi , uNi ], where Ni ≤ exp K is the number of − brackets W needed to cover Fθi . Consider any other θ ∈ U ∩ Si . Now |f (θ, γ) − f (θi , γ)| ≤ F θ − θi , ∀γ and f (θi , γ) − F θ − θi ≤ f (θ, γ) ≤ f (θi , γ) + F θ − θi . Fix r. Then there exist an − bracket [lj , uj ] such that lj ≤ f (θi , γ) ≤ uj , (j ≤ Ni ). Thus lj − F θ − θi ≤ f (θ, γ) ≤ uj + F θ − θi . , √ 2 L ≤ 2 and hence Now, since θ and θi lie in a square of length M < , θ − θi ≤ 2L M2 √ √ lj − 2F ≤ f (θ, γ) ≤ uj + 2F . √ √ i It follows that [lj − 2F , uj + 2F ]N j=1 , form a bracket for the class Fθ . θ∈U ∩Si Furthermore note that, √ √ √ √ uj + 2F − lj + 2F L2 (P ) ≤ uj − lj L2 (P ) + 2 2F ≤ (1 + 2 2F ). K 2 4L2 brackets of size (1 + Since there are at most 4L 2 , it follows that at most 2 exp W √ 2 2F ) are needed to cover F = ∪θ∈U Fθ . Conclude that the number of − brackets needed is dominated by a constant times 12 exp K 2 W . Theorem 5.2 (Vapnik-Chervonenkis (1981), Pollard (1981), Gin´ e-Zinn (1984)) Let F be a P –measurable class of measurable functions that is L1 (P )–bounded. Then F is P –Glivenko-Cantelli if and only if (i) P ∗ F < ∞ and (ii) limn→∞ E ∗ log N (,FM ,L2 (Pn )) n =0 ∀M < ∞ and > 0 where FM is the class {f 1{F ≤ M } : f ∈ F}. Proof. By the symmetrization inequality given by Corollary 4.1, measurability of the class F, and Fubini’s theorem, 1 i f (Xi )F n n E ∗ Pn − P F ≤ 2EX E i=1 n 1 ≤ 2EX E i f (Xi )FM + 2P ∗ F 1{F > M }, n i=1 5.1. GLIVENKO-CANTELLI CLASSES F 65 by the triangle inequality, for every M > 0. For sufficiently large M the last term is arbitrarily small. To prove convergence in mean, it suffices show that the first term converges to zero for fixed M . To do this, fix X1 , . . . , Xn . If G is an –net over FM in L2 (Pn ), then it is also an -net in L1 (Pn ) (since L2 (Pn ) norms are larger than L1 (Pn ) norms via Cauchy-Schwarz). Hence it follows that 1 1 i f (Xi )FM ≤ E i f (Xi )G + . n n n E n i=1 (5.1) i=1 The cardinality of G can be chosen equal to N (, FM , L2 (Pn )). We now use the maximal inequality of Corollary 3.1 with ψ2 (x) = exp(x2 ) − 1, to conclude that the right side of the last display is bounded by a constant multiple of + 1 1 + log N (, FM , L2 (Pn )) sup i f (Xi )ψ2 |X + , f ∈G n n i=1 where the Orlicz norms · ψ2 |X are taken over 1 , . . . , n with X1 , . . . , Xn fixed. By + + Example 3.1, these ψ2 -norms can be bounded by 6/n(Pn f 2 )1/2 ≤ 6/nM since f ∈ G ⊂ FM . Hence the right side of the last display is bounded above by + + 1 + log N (, FM , L2 (Pn )) 6/nM + →p in outer probability. This shows that the left side of 5.1 converges to zero in probability. Since it is bounded by M , its expectation with respect to X1 , . . . , Xn converges to zero by the dominated convergence theorem. This concludes the proof that E ∗ Pn −P F → 0. To see that Pn −P ∗F also converges to zero almost surely, note that it is a reverse sub-martingale with respect to a suitable filtration, and hence almost sure convergence follows from the reverse sub-martingale 2 convergence theorem. Before treating examples, it is useful to specialize Theorem 5.2 to the class of indicator functions of some class of sets C. In this setting the random entropy condition can be restated in terms of a quantity which will arise naturally in Chapter 8 in the context of VC theory: for n points x1 , . . . , xn in X and a class C of subsets of X , set def ∆Cn (x1 , . . . , xn ) = #{C ∩ {x1 , . . . , xn } : C ∈ C}. Then the sufficiency part of the following theorem follows from Theorem 5.2. Theorem 5.3 (Vapnik-Chervonenkis-Steele GC theorem) If C is a P -measurable class of sets, then the following are equivalent: (i) Pn − P ∗C →a.s. 0, (ii) n−1 E log ∆C (X1 , . . . , Xn ) → 0. CHAPTER 5. GLIVENKO-CANTELLI THEOREMS 66 Proof. We first show that (ii) implies (i). Since F = {1C : C ∈ C} has constant envelope function 1, the first condition of Theorem 5.2 holds trivially and we need only show that (ii) implies the random entropy condition in this case. To see this, note that for any r > 0 N (, F, Lr (Pn )) ≤ N (r (a) −1 ∨1 , F, L∞ (Pn )) ≤ (2/r −1 ∨1 )n where f − gLr (Pn ) = {Pn |f − g|r }1/(r∨1) , f − gL∞ (Pn ) = max |f (Xi ) − g(Xi )|. 1≤i≤n Now if C1 , . . . , Ck are k = N (, C, L∞ (Pn )) form an -net for C for the L∞ (Pn ) metric, and < 1, the if C ∈ C satisfies max (1C−Cj (Xi ) + 1Cj −C (Xi )) = max |1C (Xi ) − 1Cj (Xi )| < 1≤i≤n 1≤i≤n for some j ∈ {1, . . . , k}, then the left side must be zero, and hence no Xi is in any C − Cj or Cj − C. Thus it follows that, k = #{{X1 , . . . , Xn } ∩ Cj , for some Cj , j = 1, . . . , k} = #{{X1 , . . . , Xn } ∩ C, C ∈ C}; in other words, for all < 1, (b) ∆Cn (X1 , . . . , Xn ) = N (, C, L∞ (Pn )). Combining (a) and (b), we see that condition (ii) of Theorem 5.3 implies the random entropy condition of Theorem 5.2, and sufficiency of (ii) follows. 2 Example 5.1 Suppose that X = Rd and F = {x → 1(−∞,t] (x) : t ∈ Rd } = {1C : C ∈ C} where C = {(−∞, t] : t ∈ Rd }. Then, as will be proved in Chapter 7, for all probability measure Q on (X , A) = (Rd , Bd ), N (, F, L1 (Q)) ≤ M (K/)d for constants M = Md and K and every > 0. Therefore log N (, F, L1 (Q)) ≤ log M + d log(K/), and the conditions of Theorem 5.2 hold easily with the constant envelope function F ≡ 1. Thus F is P –GC for all P on (Rd , Bd ). Note that for ft = 1(−∞,t] ∈ F, the corresponding 5.2. UNIVERSAL AND UNIFORM GLIVENKO-CANTELLI CLASSES 67 functions t → P (ft ) = P (X ≤ t) and t → Pn (ft ) = Pn (X ≤ t) are the classical distribution function of X ∼ P and the corresponding classical empirical distribution function. Thus the conclusion may restated as Pn (X ≤ ·) − P (X ≤ ·)∞ = sup |Pn (X ≤ t) − P (X ≤ t)| →a.s. 0. t∈Rd Example 5.2 Suppose that X = Rd and F = {x → 1(s,t] (x) : s, t ∈ Rd , s ≤ t} = {1C : C ∈ C} where C = {(s, t] : s, t ∈ Rd , s ≤ t}. Then, as will be proved in Chapter 7, for all probability measure Q on (X , A) = (Rd , Bd ), N (, F, L1 (Q)) ≤ M (K/)2d for constants M = Md and K and every > 0. Therefore log N (, F, L1 (Q)) ≤ log M + 2d log(K/), and the conditions of Theorem 5.2 again hold with the constant envelope function F ≡ 1. Thus F is P –GC for all P on (Rd , Bd ). Since F is in a one-to-one correspondence with the class of the sets C, the class of all (upper closed) rectangles in this case, we also say that C is P –Glivenko-Cantelli for all P . 5.2 Universal and Uniform Glivenko-Cantelli classes If F is P –Glivenko-Cantelli for all P on (X , A), then we say that F is a universal Glivenko-Cantelli class. A still stronger Glivenko-Cantelli property is formulated in terms of the uniformity of the convergence in probability measure P on (X , A). We let P = P(X , A) be the set of all probability measures on the measurable space (X , A). We say that F is a strong uniform Glivenko-Cantelli class if for all > 0 ∗ sup P rP sup Pm − P F > → 0 P ∈P(X ,A) n→∞ as m≥n where P(X , A) is the set of all probability measures on (X , A). For x = (x1 , . . . , xn ) ∈ X n , n = 1, 2, . . ., and r ∈ (0, ∞), we define on F the pseudodistances $ ex,r (f, g) = n−1 n %r−1 ∧1 |f (xi ) − g(xi )|r , i=1 ex,∞ (f, g) = max |f (xi ) − g(xi )|, 1≤i≤n f, g ∈ F. CHAPTER 5. GLIVENKO-CANTELLI THEOREMS 68 Let N (, F, ex,r ) denote the -covering number of (F, ex,r ), > 0. Then define, for n = 1, 2, . . . , > 0, and r ∈ (0, ∞], the quantities def Nn,r (, F) = sup N (, F, ex,r ). x∈X n Theorem 5.4 (Dudley, Gin´ e and Zinn (1991)) Suppose that F is a class of uniformly bounded functions such that F is image admissible Suslin. Then the following are equivalent: (a) F is a strong uniform Glivenko-Cantelli class. (b) log Nn,r (,F ) n Proof. → 0 for all > 0 for some (all) r ∈ (0, ∞]. We first show that (b) with r = 1 implies (a). Let i be a sequence of Rademacher random variables independent of Xi . By uniform boundedness of F, M = F ∞ < ∞. By Lemma 4.2 with x = n and boundedness of F it follows that for all > 0 and for all n sufficiently large we have # $# n % # # # # P r {Pn − P F > } ≤ 4P r # i f (Xi )# > n/4 . # # F i=1 For n = 1, 2, . . . , let xn (w) = (X1 (w), . . . , Xn (w)) ∈ X n. By definition of N (, F, ex,1 ) for each w there is a function πn = πnw : F → F with card{πn f : f ∈ F} = N (/8, F, exn (w),1 ) and exn (w),1 (f, πn f ) ≤ /8, f ∈ F. By Hoeffding’s inequality: P r { n i f (Xi )F > n/4} ≤ EP P r { i=1 n i πn f (Xi )F > n/8} i=1 ≤ 2E{N (/8, F, exn (w),1 )} exp(−n2 /(128M 2 )), where the interchange of EP and E is justified by the image admissible Suslin condition. By the hypothesis (b) with r = 1, for all n sufficiently large we have N (/8, F, ex,1 ) ≤ exp(n2 /(256M 2 )) for all x ∈ X n . Therefore we can conclude that P r{Pn − P F > } ≤ 8 exp (−n2 /(256M 2 )) for sufficiently large n. Summing up over n, it follows that there is an N so that for n ≥ N we have sup P ∈P k≥n P r{Pk − P F > } ≤ 8 ∞ exp(−k2 /(256M 2 )) k=n ≤ 8 exp(−n2 /(256M 2 )) , 1 − exp(−2 /(256M 2 )) where the right term goes to 0 as n → ∞. This completes the proof of (a). The proof that (a) implies (b) uses Gaussian symmetrization techniques, so it will be treated in Chapter 9. 2 5.3. PRESERVATION OF THE GC PROPERTY 5.3 69 Preservation of the GC property Now our goal is to present several results concerning the stability of the Glivenko-Cantelli property of one or more classes of functions under composition with functions φ. Theorem 5.5 Suppose that F1 , . . . , Fk are P –Glivenko-Cantelli classes of functions, and that φ : Rk → R is continuous. Then H ≡ φ(F1 , . . . , Fk ) is P –Glivenko-Cantelli provided that it has an integrable envelope function. Proof. We will prove the thesis for classes of functions Fi which are appropriately mea- surable. Let F1 , . . . , Fk and H be integrable envelopes for F1 , . . . , Fk and H respectively, and set F = F1 ∨ . . . ∨ Fk . For M ∈ (0, ∞), define HM ≡ {φ(f )1[F ≤M ] : f = (f1 , . . . , fk ) ∈ F1 × . . . × Fk ≡ F}. Now (Pn − P )φ(f )F ≤ (Pn + P )H 1[F >M ] + (Pn − P )hHM . The expectation of the first term on the right converges to 0 as M → ∞. Hence it suffices to show that HM is P –Glivenko-Cantelli for every fixed M . Let δ = δ() be the δ of Lemma 5.2 below for φ : [−M, M ]k → R, > 0, and · the L1 (Pn )-norm · 1 . Then for any (fj , gj ) ∈ Fj , j = 1, . . . , k, Pn |fj − gj |1[Fj ≤M ] ≤ δ , k j = 1, . . . , k implies that Pn |φ(f1 , . . . , fk ) − φ(g1 , . . . , gk )|1[F ≤M ] ≤ . It follows that N (, HM , L1 (Pn )) ≤ k / j=1 N δ , Fj 1[Fj ≤M ] , L1 (Pn ) . k Thus E ∗ log N (, HM , L1 (Pn )) = o(n) for every > 0, M < ∞. This implies that E ∗ log N (, (HM )N , L1 (Pn )) = o(n) for (HM )N the functions h1{H ≤ N } for h ∈ HM . Thus HM is strong Glivenko-Cantelli for P by Theorem 5.1. This concludes the proof that H = φ(F) is weak Glivenko-Cantelli. Because it has an integrable envelope, it is strong Glivenko-Cantelli. 2 Theorem 5.6 (Dudley (1998a)) Suppose that F is a P –Glivenko-Cantelli class for P with P F < ∞, J is a possible unbounded interval including the ranges of all f ∈ F, φ is continuous and monotone on J, and for some finite constants c, d, |φ(y)| ≤ c|y| + d for all y ∈ J. Then φ(F) is also a strong Glivenko-Cantelli class for P . CHAPTER 5. GLIVENKO-CANTELLI THEOREMS 70 Given classes F1 , . . . , Fk of functions such that fi : X → R and a function Proof. φ: Rk → R, let φ(F1 , . . . , Fk ) be the class of functions x → φ(f1 (x), . . . , fk (x)), where fi ∈ Fi , i = 1, . . . , k. With this assumption the thesis trivially follows from Theorem 5.5 2 Proposition 5.1 (Dudley (1998b)) Suppose that F is a strong Glivenko-Cantelli class for P with P F < ∞, and g is a fixed bounded function (g∞ = k < ∞, k > 0). Then the class of functions g · F ≡ {g · f : f ∈ F} is a strong Glivenko-Cantelli class for P . Proof. Take F1 = {g}, F2 = F, and φ : R2 → R given by φ(u, v) = u v (is continuous) in Theorem 5.5. Now F1 and F2 are GC and g · f = φ(f1 , f2 ) ≤ K F ∀f1 ∈ F1 , f2 ∈ F2 and P (K F ) < ∞. By Theorem 5.5 it follows that g · F is a strong Glivenko-Cantelli 2 class for P . Proposition 5.2 (Gin´ e and Zinn (1984)) Suppose that F is a uniformly bounded strong Glivenko-Cantelli class for P , and g ∈ L1 (P ) is a fixed function. Then the class of functions g · F ≡ {g · f : f ∈ F} is a strong Glivenko-Cantelli class for P . Proof. Take F1 = {g}, F2 = F, and φ : R2 → R given by φ(u, v) = u v (is continuous) in Theorem 5.5. Now F1 and F2 are GC and in view of Theorem 5.5, it suffices to check that φ(F1 , F2 ) has an integrable envelope. Since F is uniformly bounded, f ∞ ≤ K for all f ∈ F, for some K > 0 and it is easily checked that K · g is an integrable envelope for g · F. 2 Lemma 5.4 Any strong P –Glivenko-Cantelli class F is totally bounded in L1 (P ) if and only if P F < ∞. Furthermore for any r ∈ (1, ∞), if F has an envelope that is contained in Lr (P ), then F is also totally bounded. Proof. A class that is totally bounded is also bounded. Thus for the first statement we only need to prove that a strong Glivenko-Cantelli class F with P F < ∞ is totally bounded in L1 (P ). It is well-know that such a class has an integrable envelope (e.g. see Gin´e and Zinn (1983) to conclude first that P ∗ f − P f F < ∞). Next the claim follows from the triangle inequality f F ≤ f − P f F + P F . There is no loss of generality to assume that the class F possesses an envelope that is finite everywhere). Now, suppose that there exists a sequence of finitely discrete probability measures Pn such that Ln = sup{|(Pn − P )|f − g|| : f, g ∈ F} → 0 5.3. PRESERVATION OF THE GC PROPERTY 71 Then for every > 0, there exists n0 such that Ln0 < . For this n0 there exists a finite –net f1 , . . . , fN over F relative to the L1 (Pn0 )–norm, because restricted to the support of Pn0 the functions f are uniformly bounded by the finite envelope and hence covering F in L1 (Pn0 ) is like covering a compact in Rn0 . Now, for any f ∈ F there is an fi such that P |f − fi | ≤ Ln0 + Pn0 |f − fi | < 2. It follows that F is totally bounded in L1 (P ). To conclude the proof it is suffices to select a sequence Pn . This can be constructed as a sequence of realizations of the empirical measure if we know that the class |F − F| is P –GC. It is immediate from the definition of a Glivenko-Cantelli class that |F − F| is P –GC. Next, by Dudley’s theorem, and by previous propositions, the classes (F − F)+ and (F − F)− are P –Glivenko-Cantelli. Then the sum of these two classes is P –GC and hence the proof is complete. If F has an envelope in Lr (P ), then F is totally bounded in Lr (P ) if the class FM of functions f · 1{F ≤M } is totally bounded in Lr (P ) for every fixed M. We had proved that the class FM is P –GC and hence this class is totally bounded in L1 (P ). But the it is also bounded in Lr (P ), because P |f |r ≤ P |f |M r−1 for any f that is bounded by M and we can construct the –net over FM in L1 (P ) to consist of functions that are bounded by M. 2 Lemma 5.5 Suppose that ϕ : K → R is continuous and K ⊂ Rk is compact. Then for every > 0 there exists δ > 0 such that for all n and for all a1 , . . . , an , b1 , . . . , bn ∈ K ⊂ Rk , 1 1 ai − bi < δ ⇒ |ϕ(ai ) − ϕ(bi )| < . n n n n i=1 i=1 Here · con be any norm on Rk , in particular it can be xr = [1, +∞) or x∞ ≡ max1≤i≤k |xi | for x = (x1 , . . . , xn ) ∈ Proof. k r i=1 |xi | 1 r , r ∈ Rk . Let Un be uniform on {1, . . . , n}, and set Xn = aUn , Yn = bUn . Then we can write 1 ai − bi = EXn − Yn n n i=1 and 1 |ϕ(ai ) − ϕ(bi )| = E|ϕ(Xn ) − ϕ(Yn )|. n n i=1 Hence, it suffices to show that for every > 0 there exists δ > 0 such that for all (X, Y ), random vectors in K ⊂ Rk , EX − Y < δ ⇒ E|ϕ(X) − ϕ(Y )| < . CHAPTER 5. GLIVENKO-CANTELLI THEOREMS 72 Suppose not. Then for some > 0 and for all m = 1, 2, . . . there exists (Xm , Ym ) such that 1 E|ϕ(Xm ) − ϕ(Ym )| ≥ . m But, since {(Xm , Ym )} is tight, there exists (Xm , Ym ) →d (X, Y ). Then, it follows that EXm − Ym < EX − Y = lim E(Xm − Ym = 0 m →∞ so that X = Y a.s., while, on the other hand, 0 = E|ϕ(X) − ϕ(Y )| = lim E|ϕ(Xm ) − ϕ(Ym )| ≥ > 0. m →∞ 2 Another potentially useful preservation theorem is one based on building up GlivenkoCantelli classes from the restriction of a class of functions to elements of a partition of the sample space. the following theorem is related to the result of van der Vaart (1996) for Donsker classes. Theorem 5.7 Suppose that F is a class of functions on (X , A, P ), and {Xi } is a partition of X : ∞ i=1 Xi = X , Xi ∩ Xj = ∅ for i = j. Suppose that Fj ≡ {f · 1Xi : f ∈ F} is P –Glivenko-Cantelli for each j, and F has an integrable envelope function F . Then X is itself P –Glivenko-Cantelli. Proof. Since f =f ∞ 1Xj = j=1 it follows that ∗ E Pn − P F ≤ ∞ ∞ , j=1 E ∗ Pn − P Fj → 0 j=1 by the dominated convergence theorem since each term in the sum converges to zero by the hypothesis that each Fj is P –Glivenko-Cantelli, and we have, E ∗ Pn − P Fj ≤ E ∗ Pn (F · 1Xj ) + P (F · 1Xj ) ≤ 2P (F · 1Xj ) where ∞ j=1 P (F · 1Xj ) = P (F ) < ∞. 2 5.4. EXERCISES 5.4 73 Exercises Exercise 5.1 Show that, if F is a class of function satisfying the bracketing entropy hypothesis of Theorem 5.1, then F has a measurable envelope F satisfying P F < ∞. Solution. By the entropy hypothesis, N[ ] (, F, L1 (P )) < ∞ ⇒ F := max {|fj |, |gj |} 1≤j≤n such that |F |dP < ∞. By contradiction, ∞ = f dP = max {|fj |, |gj |}dP ⇒ ∃j : max {|fj |, |gj |} = ∞, 1≤j≤n 1≤j≤n but in the definition of [fj , gj ],fj , gj < ∞. 2 Exercise 5.2 Suppose that X = R and that X ∼ P . (i) For 0 < M < ∞ and a ∈ R, let f (x, t) = | x − t |, and F = Fa,M = {f (x, t) :| t − a |≤ M }. (ii) For a ∈ R, let, f (x, t) =| x − t | − | x − a |, and F = Fa = {f (x, t) :| t − a |≤ M }. Show that N[ ] (, F, L1 (P )) < ∞ for every > 0 for the classes F in (i) if E|X| < ∞, and in (ii) without the hypothesis E|X| < ∞. Compute the envelope function for the two classes. Solution. (i) We use Lemma 6.2, suppose that {f (·, t) : t ∈ T } is a class of function satisfying: |f (x, t) − f (x, s)| ≤ d(s, t)F (x) ∀s, t ∈ T, x ∈ X for some metric d on the index set. Then for any norm · , (we choose the L1 (P ) norm) N[ ] (2F , F, · ) ≤ N (, T, d) or N[ ] (, F, · ) ≤ N , T, d . F In the present case, T = [a − M, a + M ] |f (x, t) − f (x, s)| = ||x − t| − |x − s|| ≤ |t − s| = d(t, s) · 1 for (t, s) ∈ T . Also, N ( F , T, d) < ∞ (observe F ≡ 1 then F ≡ 1), T is compact with respect to the euclidean metric and the brackets are of the form 2 3 f (·, tj ) − , f (·, tj ) + , 2F 2F CHAPTER 5. GLIVENKO-CANTELLI THEOREMS 74 where t1 , t2 , . . . , tk is an 2F – net over T . This can be done with k = N F , T, d points. We check, P (|uj |), P (|lj |) ≤ ∞, * * P (|lj |) = P *f (x, tj ) − ** + E[|x − tj |] ≤ + tj + E[|x|] < ∞ * ≤ 2F 2F 2F and similary for uj = f (x, tj ) + 2F . (ii) In this case a is fixed, T = [a − M, a + M ]|f (x, t) − f (x, s)| = ||x − t| − |x − s|| ≤ |t − s| = d(t, s) · 1 and as in (i) we can get {[lj , uj ]}kj=1 with k = N F , T, d < ∞, s.t. ∀f ∈ Fa , ∃ : lj (x) ≤ f (x) ≤ uj (x), lj (x) = f (x, tj ) − Here t1 , t2 , . . . , tk is an 2F – 2 and uj (x) = f (x, tj ) + . 2 net over T and P (|lj |) ≤ + E[|x − tj | − |x − a|] < ∞. 2 Note that |x − tj | − |xa | is bounded in absolute magnitude by |tj − a|. So the hypothesis E[|X|] < ∞ is non needed to ensure that the edges of the bracket have finite L1 (P ) 2 norms. Exercise 5.3 Suppose that F is a P –Glivenko-Cantelli class of measurable functions; that is Pn − P ∗F →a.s. 0 as n → ∞. Show that this implies P ∗ f − P f F < ∞. Thus if P F = supf ∈F |P f | < ∞, P ∗ F < ∞ for an envelope function F . Solution. E[f (xj ) − P f F ] = ∞ 0 P ∗ (f (xj ) − P f F > t)dt and this is finite if and only if ∞ P ∗ (f (xj ) − P f F > n) < ∞. n=1 But, since the Xi are i.i.d., ∞ P ∗ (f (xj ) − P f F > n) = n=1 and ∞ n=1 ∞ P ∗ (f (xn ) − P f F > n) n=1 P ∗ (f (xn ) − P f F > n) < ∞ if P (lim sup An ) = 0, n 5.4. EXERCISES 75 where, in the present problem, An = {f (xn ) − P f F > n}. We want to use the second Borel-Cantelli lemma : If {An } is a sequence of independent events, then P (An ) = ∞ ⇒ P (lim sup An ) = 1, n→∞ consequently, P (lim sup An ) = 0 ⇒ n→∞ P (An ) = ∞. Since the Xn ’s are i.i.d., the {An } are indeed independent. #∗ # # # We need to show that P # n1 (f (Xn ) − P f )# > 1 i.o. = 0. This holds if we prove F # #∗ #1 # that # n (f (Xn ) − P f )# →a.s. 0. But, F n 1 n − 1 0n−1 1 1 1 1 0 (f (Xi ) − P f ) − (f (Xn ) − P f ) = (f (Xi ) − P f ) . n n n n−1 i=1 i=1 Since, #1 #∗ n−1 # # Pn−1 − P ∗F →a.s. 0, # (f (Xn ) − P f )# ≤ Pn − P ∗F + n n F by hypothesis, as n → 0. 2 Exercise 5.4 For a class of function F and 0 < M < ∞ the class FM = {f 1{F ≤M } : f ∈ F}. Show that the Lr (Q)-entropy numbers N (, FM , Lr (Q)) are smaller than those of F for any probability measure Q and for numbers M > 0 and r ≥ 1. Solution. Consider now, Θ ∈ U ⊂ Rd , U compact. Let FΘ = {fΘ,ν : ν ∈ Γ}, given log N[ ] (, FΘ , L2 (P )) ≤ k PF2 depend on Θ. Also: ∃F with F= Θ∈U hence FΘ 1 0 < w < 2, k does not < ∞ s.t. |fΘ1 ,ν − fΘ2 ,ν | ≤ F |Θ1 − Θ2 |, Then, w ∀ Θ1 , Θ2 , ν. w 1 1 log N[ ] (, F, L2 (P )) < d · log , +k , 0 log N[ ] (, F, L2 (P ))d ≤ ∞. (Do not care about the “” in the supremum of the integral because the bracketing numbers are going down and then the crucial point is the zero). 2 Exercise 5.5 Suppose that F, F1 , F2 are P –Glivenko-Cantelli classes of functions. Show that the following classes are also P –Glivenko-Cantelli: (i) {a1 f1 + a2 f2 : fi Fi , |ai | ≤ 1}; (ii) F1 + F2 (iii) the class of functions that are both the pointwise limit and the L1 (P )–limit of a sequence in F. CHAPTER 5. GLIVENKO-CANTELLI THEOREMS 76 Solution. (i) Now note that: |(Pn − P )(a1 f1 + a2 f2 )| ≤ |a1 ||(Pn − P )f1 | + |a2 ||(Pn − P )f2 | ≤ |(Pn − P )f1 | + |(Pn − P )f2 | (for any f1 , f2 ∈ F1 , F2 and |ai |, i = 1, 2 ≤ 1) and sup a1 f1 +a2 f2 :fi ∈Fi ,|ai |≤1 |(Pn − P )(a1 f1 + a2 f2 )| ≤ sup |(Pn − P )fi | 1,2 fi ∈Fi that is Pn − P F0 ≤ Pn − P F1 + Pn − P F2 thus Pn − P ∗F0 ≤ Pn − P ∗F1 + Pn − P ∗F2 with Pn − P ∗Fi →a.s. 0, and hence Pn − P ∗F0 →a.s. 0 showing that F0 is a GC class. (ii) Since Pn − P ∗F1 ∪F2 ≤ Pn − P ∗F1 + Pn − P ∗F2 (as before), the rest of the argument follows from the previous case. (iii) Take now g ∈ F. Then, ∃gn : gn (t) → g(t), ∀t and |gn − g|dP → 0 ⇒ P gn → P g, and note that 1 1 gm (xi ) − P gm → g(xi ) − P g n n n as m→∞ i=1 (n is fixed), hence n *1 * *1 * * * * * g(xi ) − P g* gm (xi ) − P gm * → * * n n i=1 n n *1 * *1 * * * * * ⇒ sup * gm (xi ) − P gm * ≥ sup * g(xi ) − P g* n n m≥1 m≥1 i=1 ⇒ (Pn − P )F ≥ |(Pn − P )g| ⇒ (Pn − P )Fˆ ≤ Pn − P F ⇒ (Pn − P Fˆ )∗ →a.s. 0 i=1 5.4. EXERCISES 77 (Note: Pn − P ∗Fˆ →a.s. 0). 2 78 CHAPTER 5. GLIVENKO-CANTELLI THEOREMS Chapter 6 Donsker Theorems: Uniform CLT’s In this chapter we will develop Donsker theorems, or equivalently, uniform Central Limit Theorems, for classes of functions and sets. The proofs of these theorems will rely heavily on the techniques developed in Chapter 3 and Chapter 4. An important by-product of these proofs will be some new bounds on the expectations of suprema of the empirical process indexed by functions or sets. 6.1 Uniform Entropy Donsker Theorem Suppose that F is a class of functions on a probability space (X , A, P ), and suppose that X1 , . . . , Xn are i.i.d. ∼ P. As in Chapter 1 we let {Gn (f ) : f ∈ F} denote the empirical process indexed by F: Gn (f ) = √ n(Pn − P )(f ), f ∈ F. To have convergence in law of all of all the finite-dimensional distributions, it suffices that F ⊂ L2 (P ). If also Gn ⇒ G in ∞ (F) where, necessarily, G is a P –Brownian bridge process with almost all sample paths in Cu (F, ρP ), then we say that F is a P –Donsker. Our first theorem giving sufficient conditions for a class F to be a P –Donsker class will be formulated in terms of uniform entropy as follows: suppose that F is an envelope function for the class F and that ∞ , sup log N (F Q,2 , F, L2 (Q))d < ∞ 0 Q 79 (6.1) CHAPTER 6. DONSKER THEOREMS: UNIFORM CLT’S 80 where the supremum is taken over all finitely discrete measures Q on (X , A) with F 2Q,2 = F 2 dQ > 0. Then we say that F satisfies the uniform entropy condition. Here is the resulting theorem: Theorem 6.1 Suppose that F is a class of measurable functions with envelope function F satisfying: (a) the uniform entropy condition (6.1) holds; (b) P ∗ F 2 < ∞; 2 are P –measurable ∀δ > 0. (c) the classes Fδ = {f − g : f, g ∈ F, f − gP,2 < δ} and F∞ Then F is P –Donsker. Proof. Let δ > 0. By Markov’s inequality and the symmetrization Corollary 4.1, ∗ P (Gn Fδ n # 2 ∗ # # 1 # √ > x) ≤ E # i f (Xi )# x n Fδ i=1 (remember that X is symmetric if L(−X) = L(X) and Levy’s inequality works for sums of independent variables. The random variable X is symmetric since − =d and is a Rademacher random variable independent of X). Now, the supremum on the right side is measurable by the assumption (c), so Fubini’s theorem applies and the outer expectation can be calculated as EX E . Thus we fix random variables X1 , . . . , Xn , and bound the inner expectation over the Rademacher 1 n √ i , i = 1, . . . , n. By Hoeffding’s inequality, the process f → i=1 i f (Xi ) is subn Gaussian for the L2 (Pn )-seminorm f n given by 1 2 = Pn f = f (Xi ). n n f 2n 2 i=1 Thus the maximal inequality for sub-Gaussian processes Corollary 3.5 yields (a) ∞+ n # 1 # # # E # √ i f (Xi )# log N (, Fδ , L2 (Pn ))d. n Fδ 0 i=1 The set Fδ fits in a single ball of radius once is larger than θn given by θn2 = sup f ∈Fδ f 2n n #1 # # # =# f 2 (Xi )# . n Fδ i=1 Also, note the covering numbers of the class Fδ are bounded by covering numbers of F∞ = {f − g : f, g ∈ F}, and the latter satisfy N (, F∞ , L2 (Q)) ≤ N 2 ( 2 , F, L2 (Q)) for every measure Q. Thus we can limit the integral in (a) to the interval (0, θn ), change variables, and bound 6.1. UNIFORM ENTROPY DONSKER THEOREM 81 the resulting integral above by a supremum over measures Q: we find that the right side of (a) is bounded by θn 0 + log N (, Fδ , L2 (Pn ))d ≤ √ 2 ≤ √ 2 θn /F n 0 + log N (F n , F, L2 (Pn ))d · F n , θn /F n sup 0 Q log N (F Q,2 , F, L2 (Q))d · F n . The integrand is integrable by assumption (a). Furthermore, F 2n is bounded below by F∗ 2n which converges almost surely to its expectation which may be assumed positive. Now apply the Cauchy-Schwartz inequality to conclude that (up to an absolute constant) the expected value of the bound in the last display is bounded by $ (b) EX θn F n 0 , sup Q 2 % 12 log N (F Q,2 , F, L2 (Q))d 1 {EX (F 2n )} 2 . This bound converges to something bounded above by δ F∗ P,2 (c) 0 , sup Q log N (F Q,2 , F, L2 (Q))d · F ∗ P,2 if we can show that (d) θn∗ ≤ δ + op (1). To show that this holds, note first that sup{P f 2 : f ∈ Fδ } ≤ δ2 . Since Fδ ⊂ F∞ , (d) holds if Pn f 2 − P f 2 ∗F∞ →p 0; 2 is a weak P –Glivenko-Cantelli class. But F 2 has an integrable envelope (2F )2 , i.e. if F∞ ∞ 2 , L (P )) and is measurable by assumption. Furthermore, the covering number N (2F 2n , F∞ 1 n is bounded by the covering number N (F n , F∞ , L2 (Pn )) since, for any pair f, g ∈ F∞ , Pn |f 2 − g2 | = Pn |f − g||f + g| ≤ Pn (|f − g|(4F )) ≤ f − gn 4F n ≤ 4F 2n . By the uniform entropy assumption (i), N (F n , F∞ , L2 (Pn )) is bounded by a fixed number, so its logarithm is certainly op (n), as required by the Glivenko-Cantelli Theorem 5.2. Letting δ 0 we see that the asymptotic equicontinuity holds. It remains only to prove that F is totally bounded in L2 (P ). By the result of the previous paragraph, there exist a sequence of discrete measures Pn with Pn f 2 −P f 2 F∞ converging to zero. Choose n sufficiently large so that the supremum is bounded by 2 . √ By assumption N (, F, L2 (Pn )) is finite. But an -net for F in L2 (Pn ) is a 2-net in L2 (P ). Thus F is P –Donsker by Theorem 2.2. 2 CHAPTER 6. DONSKER THEOREMS: UNIFORM CLT’S 82 It will be useful to record the result of the method of the proof used in terms of a general inequality. For a class of function F with envelope function F and δ > 0, let J(δ, F) = sup Q δ 0 , 1 + log N (F Q,2 , F, L2 (Q))d, (6.2) where the supremum is over all discrete probability measures Q with F Q,2 > 0. It is clearly true that J(1, F) < ∞ if F satisfies the uniform-entropy condition (6.1). Theorem 6.2 Let F be a P –measurable class of measurable functions with measurable envelope function F . Then, for p ≥ 1, # # # # # # # # #Gn ∗F # #J(θn , F)F n # P,p P,p J(1, F)F P,2∨p . (6.3) Here θn = (supf ∈F f n )∗ /F n where ·n is the L2 (Pn )-seminorm and the inequalities are valid up to constants depending only on p. In particular, when p = 1 EGn ∗F E{J(θn , F)F n } J(1, F)F P,2 . Proof. See van der Vaart and Wellner (1996), page 240. (6.4) 2 Proposition 6.1 (Le Cam (1981), Gin´ e and Zinn (1984), (1986)) Suppose that F ⊂ L2 (X , A, P ). Suppose that the functions f in F take values in [−1, 1] and are centered: P f = 0 for all f ∈ F. (i) Let Mn ≡ √ √ n supf ∈F P f 2 ≡ nσ 2 and suppose that t, ρ are positive numbers such 1/2 that λ ≡ t1/2 − 21/2 Mn − 2ρ > 0. Then $ n % # # # ∗ # 2 1/2 ≤ E ∗ 1 ∧ 8N (ρ/n1/4 , F, L2 (Pn )) exp(−λ2 n1/2 /4) . Pr # f (Xi )# > tn F i=1 √ This implies that for all v ≥ 47σ > 2(2 + 21/2 )σ, # #+ & ' # # P r ∗ # Pn f 2 # > v ≤ E ∗ 1 ∧ 8N (σ, F, L2 (Pn )) exp(−v 2 n/16) . F (ii) In particular, if σ 2 is any number satisfying supf ∈F P f 2 ≤ σ 2 ≤ P F 2 and F satisfies N (, F, L2 (Q)) ≤ AF Q,2 V , 0 < < F Q,2 for some A ≥ e and V ≥ 1, then, for all t ≥ 47nσ 2 > 4 (2 + 21/2 )2 nσ 2 , $ n $ % % # # AF V # Q,2 ∗ # 2 ∗ Pr # f (Xi )# > t ≤ E 1 ∧ 8 exp(−t/16) . σ F i=1 6.1. UNIFORM ENTROPY DONSKER THEOREM Proof. 83 Let 1 , . . . , n be i.i.d. Rademacher random variables that are independent of the Xi ’s. Set S+ (f ) = f 2 (Xi ), {i≤n:i =1} S− (f ) = f 2 (Xi ). {i≤n:i =−1} From the above definition, it follows that S+ (f ) = n i + 1 2 f (Xi ), 2 S− (f ) = i=1 n 1 − i i=1 2 f 2 (Xi ). Then S+ (f ) and S− (f ) have the same distribution and are conditionally independent given {i }ni=1 . Moreover, S+ (f ) − S− (f ) = n i f 2 (Xi ), S+ (f ) + S− (f ) = i=1 n f 2 (Xi ), i=1 and, recalling the definition of Mn , 1 1 1/2 E [S− (f )]2 = E{S− (f )} = nP f 2 ≤ n1/2 Mn . 2 2 By the triangle inequality for the Euclidean distance in Rn and observing that √ √ 2 a + b, we have √ √ a+ b ≤ * * * 1/2 * 1/2 1/2 1/2 *S+ (f ) − S− (f ) − S+ (g) − S− (g) * $ n %1/2 $ n %1/2 i + 1 1 − i 2 2 + ≤ f (Xi ) − g(Xi ) f (Xi ) − g(Xi ) 2 2 i=1 i=1 %1/2 $ n √ √ √ & '1/2 2 f (Xi ) − g(Xi ) ≤ 2 = 2 n Pn (f − g)2 . i=1 Hence it follows $ n % # # # # 2 1/2 Pr # = P r S+ (f ) + S− (f )F > tn1/2 f (Xi )# > tn i=1 F 1/2 ≤ 2P r S+ (f )F > t1/2 n1/4 /21/2 1/2 1/2 ≤ 4E PX S+ (f ) − S− (f )F > t1/2 n1/4 /21/2 − Mn1/2 n1/4 1/2 1/2 = 4EX P S+ (f ) − S− (f )F > (t1/2 − 21/2 Mn1/2 ) n1/4 /21/2 . (6.5) To get the second inequality we have used the symmetrization lemma for probabilities, applying the Chebyshev’s inequality to calculate βn (). CHAPTER 6. DONSKER THEOREMS: UNIFORM CLT’S 84 Suppose that Fρ/n1/4 is a finite subset of F, ρ/n1/4 -dense with respect to L2 (Pn ). It follows that Fρ/n1/4 can be chosen to be of cardinality N (ρ/n1/4 , F, L2 (Pn )). So, we have 1/2 1/2 P S+ (f ) − S− (f )F > (t1/2 − 21/2 Mn1/2 ) n1/4 /21/2 ≤ N (ρ/n1/4 , F, L2 (Pn )) × 1/2 1/2 sup P |S+ (f ) − S− (f )| > (t1/2 − 21/2 Mn1/2 − 2ρ) n1/4 /21/2 f ∈Fρ/n1/4 = N (ρ/n1/4 , F, L2 (Pn )) × $ % |S+ (f ) − S− (f )| sup P > (t1/2 − 21/2 Mn1/2 − 2ρ) n1/4 /21/2 . 1/2 1/2 f ∈Fρ/n1/4 S+ (f ) + S− (f ) Making use of inequality √ x+y ≤ √ x+ √ (6.6) y in Eq. (6.6), and setting λ ≡ t1/2 − 21/2 Mn1/2 − 2ρ, we obtain N (ρ/n1/4 , F, L2 (Pn )) × $ % |S+ (f ) − S− (f )| sup P > (t1/2 − 21/2 Mn1/2 − 2ρ) n1/4 /21/2 1/2 1/2 f ∈Fρ/n1/4 S+ (f ) + S− (f ) ≤ N (ρ/n1/4 , F, L2 (Pn )) × n | i=1 i f 2 (Xi )| 1/4 1/2 sup P > λ n /2 . { ni=1 f 2 (Xi )}1/2 f ∈Fρ/n1/4 (6.7) From Eq. (6.7) it follows that n | i=1 i f 2 (Xi )| 1/4 1/2 N (ρ/n , F, L2 (Pn )) sup P > λ n /2 { ni=1 f 2 (Xi )}1/2 f ∈Fρ/n1/4 n | i=1 i f (Xi )| 1/4 1/4 1/2 ≤ N (ρ/n , F, L2 (Pn )) sup P > λ n /2 { ni=1 f 2 (Xi )}1/2 f ∈Fρ/n1/4 1/4 λ2 n1/2 ≤ N (ρ/n1/4 , F, L2 (Pn )) 2 exp − , (6.8) 4 where we have used hypothesis |f | ≤ 1 to get the first inequality, and the Hoeffding’s inequality for Eq. (6.8). From Eqs. (6.6), (6.7), (6.8) we have 1/2 1/2 P S+ (f ) − S− (f )F > (t1/2 − 21/2 Mn1/2 ) n1/4 /21/2 λ2 n1/2 ≤ N (ρ/n1/4 , F, L2 (Pn )) 2 exp − . 4 Combining this result with Eq. (6.5), we finally obtain the first conclusion of point (i). √ To obtain the second conclusion, take ρ/n1/4 = σ and t = v 2 n so that λ = n1/4 {v − √ √ (2 + 2) σ}. So, for v ≥ 2 (2 + 2) σ, it follows that √ λ2 n1/2 n v2 n = {v − (2 + 2) σ}2 ≥ . 4 4 16 6.2. BRACKETING ENTROPY DONSKER THEOREMS 85 Part (ii) of the proposition easily follows from the first. 2 6.2 Bracketing Entropy Donsker Theorems The second main empirical central limit theorem uses bracketing entropy rather than uniform entropy. The simplest version of this theorem is due to Ossiander (1987). Theorem 6.3 Suppose that F is a class of measurable functions satisfying ∞, log N[ ] (, F, L2 (P ))d < ∞. (6.9) 0 Then F is P –Donsker. We will actually prove a more general result from van der Vaart and Wellner (1996). The finiteness of the L2 (P )–bracketing integral implies that P ∗ F 2 < ∞ for an envelope function F . This condition is not necessary for a class F to be Donsker. On the contrary, we know that every P –Donsker class F satisfies P (f − P f ∗F > x) = o(x−2 ) as x tends to infinity. Consequently, if P f F < ∞, then F possesses an envelope function F with a weak second moment (meaning that P (F > x) = o(x−2 ) as x → ∞). Similarly, the L2 (P )-norm used in the brackets can be replaced by a weaker norm that makes the bracketing numbers smaller and the convergence of the integral easier. So, we define the L2,∞ (P )-norm of a function f as f P,2,∞ = sup{x2 P (|f (X)| > x)}1/2 . x>0 Actually this is not a norm because it does not satisfy the triangle inequality. However, it can be shown that there exists a norm equivalent to · 2,∞ up to a constant multiple. Note that f P,2,∞ ≤ f P,2 , so that the bracketing numbers relative to L2,∞ (P ) are smaller. Theorem 6.4 Let F be a class of measurable functions such that ∞, ∞+ log N[ ] (, F, L2,∞ (P ))d + log N (, F, L2 (P ))d < ∞. 0 (6.10) 0 Suppose also that the envelope function F of F has a weak second moment lim x2 P (F (X) > x) = 0. x→∞ Then F is P –Donsker. Proof. The following proof is from van der Vaart and Wellner (1996). N q For each natural number q, there exists a partition {Fqi }i=1 of F into Nq disjoint subsets such that CHAPTER 6. DONSKER THEOREMS: UNIFORM CLT’S 86 (a) 2−q + log Nq < ∞, q≥1 (b) ( sup |f − g|)∗ P,2,∞ < 2−q , f,g∈Fqi (c) sup f − gP,2 < 2−q . f,g∈Fqi To see this, first cover F separately with the minimal numbers of L2 (P )-balls and L2,∞ (P )-brackets of size 2−q , disjointify, and then take the intersection of the two partitions. If Nq1 and Nq2 are the number of sets in the two partitions, the total number of sets in the new partition will be Nq = Nq1 Nq2 . Noting that + , log Nq = log Nq1 + log Nq2 ≤ , log Nq1 + , log Nq2 , condition (a) holds if it is satisfied for both Nq1 and Nq2 . Conditions (b) and (c) follow from how the partition is constructed. The sequence of partitions can, without loss of generality, be chosen as successive refinements. Indeed, first construct a sequence of N N q q partitions {F qi }i=1 , (q = 1, 2, . . .), F = ∪i=1 F qi , possibly without this property. Then, take the partition at stage q consisting of the intersections of the form ∩qp=1 F pip . So, we obtain partitions into Nq = N 1 · · · N q sets. Noting that (see Exercise 6.1) ∞ −q 2 + log Nq ≤ 2 q=1 ∞ −p 2 , log N p , p=1 we conclude that condition (a) continues to hold. Now for each q, we choose a fixed element fqi from each partitioning set Fqi , and define if f ∈ Fqi , πq f = fqi (d) ∗ ∆q f = sup |h − g| g,h∈Fqi if f ∈ Fqi . By this definition, if f runs through F, πq f and ∆q f run through a set of just Nq functions. Recalling Theorem 2.3, to conclude that F is a P –Donsker class, it suffices to P show that the sequence Gn (f − πq0 f )∗F → 0 as n → ∞ followed by q0 → ∞. For each fixed n and q ≥ q0 define truncation levels aq and indicator functions Aq f , Bq f + aq = 2−q / log Nq+1 , √ √ Aq−1 f = 1{∆q0 f ≤ naq0 , . . . , ∆q−1 f ≤ naq−1 }, √ Bq f = Aq−1 f 1{∆q f > naq }, √ Bq0 f = 1{∆q0 f > naq0 }. 6.2. BRACKETING ENTROPY DONSKER THEOREMS 87 Being the partitions nested, Aq f and Bq f are constant in f on each set Fqi of the partition at level q. Now, consider the following decomposition (pointwise in x) (e) f − πq0 f = (f − πq0 f )Bq0 f + ∞ (f − πq f )Bq f + q=q0 +1 ∞ (πq f − πq−1 f )Aq−1 f, q=q0 +1 based on the idea to write f − πq0 f = (f − πq1 f ) + q1 (πq f − πq−1 f ) q=q0 +1 for the largest q1 = q1 (f, x) such that each link |πq f − πq−1 f | is bounded by √ naq (note that |πq f − πq−1 f | ≤ ∆q−1 f ). To obtain decomposition (e) rigorously, note that for indicator function Bq f there are only two possible cases: (i) Bq f = 0 for all q, (ii) there is a unique q = q1 such that Bq1 f = 1. In case (i), being Bq f = 0 for all q, we have Aq f = 1 for all q. So, in the right side of decomposition (e), the first two terms vanish and the third is an infinite series, ∞ (πq f − πq−1 f ), whose q-th partial sum telescopes out to πq f − πq0 f and converges, q=q0 +1 by definition of Aq f , to f − πq0 f , i.e. the left side. In case (ii), condition Bq1 f = 1 yields Aq−1 f = 1 if and only if q ≤ q1 . So, decomposition (e) immediately follows. √ Now apply the empirical process Gn = n(Pn − P ) to each of the three terms of decomposition (e) separately, and take the suprema over f ∈ F for each term. It will be shown that the resulting three terms converge to zero in probability as n → ∞ followed by q0 → ∞. For the first term, we have |f − πq0 f |Bq0 f ≤ 2F 1{2F > so that Gn (f − πq0 f )Bq0 f F ≤ √ √ naq0 }, n(Pn + P ) 2F 1{2F > √ naq0 }. (6.11) From Eq. (6.11), taking the expected values, we finally obtain √ √ E ∗ Gn (f − πq0 f )Bq0 f F ≤ 4 nP ∗ F 1{2F > naq0 }. (6.12) Recalling that each random variable X with a weak second moment satisfies E|X|1{|X| > t} = o(t−1 ) as t → ∞ (see Exercise 6.2), and that, by hypothesis, the envelope function F has a weak second moment, we conclude that the right side of Eq. (6.12) converges to zero as n → ∞, for each fixed q0 . CHAPTER 6. DONSKER THEOREMS: UNIFORM CLT’S 88 Aiming to study the second and third term, note that for a fixed bounded function f , Bernstein’s inequality yields P (|Gn (f )| > x) ≤ 2 exp 1 x2 √ − . 2 P f 2 + (1/3)f ∞ x/ n So, by virtue of Proposition 3.2, for any finite set F with cardinality at least 2, we have + f ∞ (f) EGn (f )F max √ log |F| + max f P,2 log |F|, f f n i.e., the left side EGn (f )F is bounded by a constant times the right side of equation (f). We begin studying the second term. Since the partitions are nested, it follows that ∆q f Bq f ≤ ∆q−1 f Bq f . Moreover, for any non-negative random variable X we have the following inequalities (see Exercise 6.3) X22,∞ ≤ sup t EX1{X > t} ≤ 2X22,∞ . (6.13) t>0 So, making use of Eq. (6.13) and condition (b), we obtain √ √ √ naq P (∆q f Bq f ) = naq P (∆q f Aq−1 f 1{∆q f > naq }) √ √ ≤ naq P (∆q f 1{∆q f > naq }) ≤ 2∆q f 22,∞ ≤ 2(2−q )2 = 2 · 2−2q . (6.14) Moreover, for q > q0 , it is ∆q f Bq f ≤ ∆q−1 f Bq f ≤ √ naq−1 , so that, by Eq. (6.14), P (∆q f Bq f )2 ≤ √ √ aq−1 −2q naq−1 P (∆q f 1{∆q f > naq }) ≤ 2 2 . aq (6.15) Note that √ Gn (f − πq f )Bq f F = n(Pn − P )(f − πq f )Bq f F √ √ ≤ n(Pn + P )∆q f Bq f F = n(Pn + P + P − P )∆q f Bq f F √ √ ≤ n(Pn − P )∆q f Bq f F + 2 nP ∆q f Bq f F √ = Gn ∆q f Bq f F + 2 nP ∆q f Bq f F . (6.16) Taking the expectation of each side of Eq. (6.16) and making use of Eq. (6.14) and condition (f ), we finally obtain ∞ ∞ ∞ # # √ # # E∗# Gn (f − πq f )Bq f # ≤ E ∗ Gn ∆q f Bq f F + 2 nP ∆q f Bq f F q0 +1 ∞ aq−1 log Nq + q0 +1 F - q0 +1 q0 +1 √ 2 · 2−2q aq−1 −q + . 2 log Nq + 2 n √ aq naq 6.2. BRACKETING ENTROPY DONSKER THEOREMS 89 Since aq is decreasing, the ratio aq−1 /aq in the last term of the previous display can be replaced by its square so that ∞ ∞ # # aq−1 −q + 4 −2q # E # aq−1 log Nq + Gn (f − πq f )Bq f # 2 log Nq + 2 aq aq F q +1 q +1 ∗# 0 = ∞ 0 + + + 2−(q−1) log Nq + 2−(q−1) log Nq+1 + 4 · 2−q log Nq+1 . (6.17) q0 +1 The series (6.17) can be bounded by a multiple of ∞ 2−q + log Nq and this up- q0 +1 per bound is independent of n and converges to zero as q0 → ∞. We conclude that series (6.17) converges to zero as q0 → ∞. Finally, we have to analyse the third term of decomposition (e). Since the partitions are nested, it follows that |πq f − πq−1 f |Aq−1 f ≤ ∆q−1 f Aq−1 f ≤ √ naq−1 . (6.18) Moreover, being |πq f − πq−1 f | ≤ ∆q−1 f ≤ 2−(q−1) , we obtain )2 ( P |πq f − πq−1 f |2 ≤ 2−(q−1) = 4 · 2−2q . (6.19) Noting that there are at most Nq functions πq f − πq−1 f and at most Nq−1 functions Aq−1 f and making use of condition (f ) and Eqs. (6.18), (6.19), we thus have ∞ ∞ # # + # E # aq−1 log Nq + 2−q log Nq . Gn (πq f − πq−1 f )Aq−1 f # ∗# F q0 +1 (6.20) q0 +1 Again this upper bound (6.20) is independent of n and converges to zero as q0 → ∞. 2 This completes the proof. In the following we derive bounds on the expected value of Gn f F for classes F that posses a finite bracketing entropy integral (6.9). More generally, for a given norm, · , we can define the bracketing integral of a class of functions F by δ, J[ ] (δ, F, · ) = 1 + log N[ ] (F , F, · )d. 0 The basic bracketing maximal inequality uses the L2 (P )-norm. Theorem 6.5 Let F be a class of measurable functions with measurable envelope function F . For a given η > 0, set a(η) = , ηF P,2 1 + log N[ ] (ηF P,2 , F, L2 (P )) . CHAPTER 6. DONSKER THEOREMS: UNIFORM CLT’S 90 Then, for every η > 0, √ √ E ∗ Gn f F J[ ] (η, F, L2 (P ))F P,2 + nP F 1{F > na(η)} , +f P,2 F 1 + log N[ ] (ηF P,2 , F, L2 (P )). If f P,2 < δF P,2 for every f ∈ F, then taking η = δ in the last display yields E ∗ Gn f F J[ ] (δ, F, L2 (P )) F P,2 + √ nP F 1{F > √ na(δ)}. Hence, for any class F, E ∗ Gn f F J[ ] (1, F, L2 (P )) F P,2 . 6.3 Donsker Theorem for Classes Changing with Sample Size The Glivenko-Cantelli and Donsker theorems concern the empirical process for different n, but each time with the same indexing class F. This is sufficient for a large number of applications, but in other cases it may be necessary to allow the class F to change with n. Suppose that Fn is a sequence of classes of measurable functions fn,t : X → R indexed by a parameter t which belongs to a common index set T , i.e. Fn = {fn,t : t ∈ T }. We want to study the weak convergence of the stochastic processes Zn (t) = Gn fn,t (6.21) as elements of l∞ (T ). We know that weak convergence is equivalent to marginal convergence and asymptotic tightness. The marginal convergence to a Gaussian process follows under the conditions of the Lindeberg theorem; sufficient conditions for tightness can be given in terms of the entropies of the classes Fn . We shall assume that there is a semimetric ρ for the index set T for which (T, ρ) is totally bounded and that relates to the L2 -metric in that sup P (fn,s − fn,t )2 → 0 f or every δn ↓ 0. (6.22) ρ(s,t)<δn Furthermore,we suppose that the classes Fn possess envelope functions Fn that satisfy the Lindeberg condition $ P Fn2 = O(1), √ P Fn2 1{Fn > n} → 0 for every > 0. (6.23) 6.3. DONSKER THEOREM FOR CLASSES CHANGING WITH SAMPLE SIZE 91 The other hypothesis needed is the control on entropy: we can use either bracketing or uniform entropy. However, it will be convenient to formulate the hypothesis in terms of the modified bracketing entropy integral: δ, ˜ J[ ] (δ, F, · ) = log N[ ] (, F, · )d. 0 Theorem 6.6 Let Fn = {fn,t : t ∈ T } be a class of measurable functions indexed by (T, ρ) which is totally bounded. Suppose that conditions (6.22) and (6.23) hold. If either J˜[ ] (δn , Fn , L2 (P )) → 0, for every δn ↓ 0, or J(δn , Fn , L2 ) → 0, for every δn ↓ 0 and all the classes Fn are P –measurable, then the processes {Zn (t) : t ∈ T } defined by (6.21) converge weakly to a tight Gaussian process Z provided that the sequence of covariance functions Kn (s, t) = P (fn,s fn,t )−P (fn,s ) P (fn,t ) converges pointwise on T ×T . If K(s, t), s, t ∈ T , denotes the limit of the covariance functions, then it is a covariance function and the limit process Z is a mean zero Gaussian process with covariance function K. Proof. The following proof is under the bracketing entropy condition. For every δ > 0, we can use the semimetric ρ and condition (6.22) to partition T into finitely many sets T1 . . . , Tk such that, for every sufficiently large n, max sup P (fn,s − fn,t)2 < δ2 . 1≤i≤k s,t∈Ti Next we apply Theorem 6.5 to obtain √ P F˜n2 1{F˜n > a ˜n (δ) n} ˜ ˜ E max sup |Gn (fn,s − fn,t)| J[ ] (δ, Fn , L2 (P )) + , 1≤i≤k s,t∈Ti a ˜n (δ) (6.24) where a ˜n (δ) is the number a(δ/F˜n P,2 ) of Theorem 6.5 evaluated for the class of functions F˜n = Fn − Fn with envelope F˜n : a ˜n (δ) = , δ 1 + log N[ ] (δ, F˜n , L2 (P )) . The number a ˜n (δ) can be bounded below, up to constants, by the corresponding number an (δ) and envelope for Fn , i.e. an (δ) = , δ 1 + 2 log N[ ] (δ/2, Fn , L2 (P )) . Because J˜[ ] (δn , Fn , L2 (P )) → 0, for every δn ↓ 0, we must have that J˜[ ] (δ, Fn , L2 (P )) = O(1), for every δ > 0 and hence an (δ) is bounded away from zero. Consequently, the number a ˜n (δ) is also bounded away from zero and so, by the Lindeberg condition (6.23), the second term in the right side of Eq. (6.24) converges to zero as n → ∞, for every CHAPTER 6. DONSKER THEOREMS: UNIFORM CLT’S 92 fixedδ > 0. The first term in the right side of Eq. (6.24) can be made arbitrarily small as n → ∞ by choosing δ sufficiently small. This shows that the asymptotic equicontinuity holds. Convergence of the finite dimensional distributions follows from the Lindeberg condition (6.23) and from the hypothesized convergence of the covariance functions. 6.4 2 Universal and Uniform Donsker Classes If F is P –Donsker for all probability measures P on (X , A), then we say that F is a universal Donsker class. Moreover, denoting by P = P(X , A) the set of all probability measures on the measurable space (X , A), we define F a uniform Donsker class if sup P ∈P(X ,A) d∗BL (Gn,P − GP ) → 0 as n → ∞; here d∗BL is the dual-bounded Lipschitz metric d∗BL (Gn,P − GP ) = sup |E ∗ H(Gn,P ) − EH(GP )|, H∈BL1 and BL1 is the collection of all functions H : l∞ (F) → R which are uniformly bounded by 1 and satisfy |H(z1 ) − H(z2 )| ≤ z1 − z2 F . We define F a bounded Donsker class if Gn,P ∗F = OP (1). (6.25) If the envelope function F of class F is such that sup F (x) < ∞, then, using the Hoffmanx Jørgensen’s inequality, it can be shown (see Exercise 6.4) that condition (6.25) is equivalent to lim sup EP∗ Gn,P F < ∞. (6.26) n→∞ If the condition (6.25) holds for every P ∈ P, we say that F is a universal bounded Donsker class. Similarly, if lim sup sup EP∗ Gn,P F < ∞, n→∞ P ∈P (6.27) we say that F is a uniform bounded Donsker class. Theorem 6.7 Let C be a countable class of sets in X satisfying the universal bounded Donsker class property: lim lim sup P ∗ {Gn,P C > M } = 0 M →∞ n→∞ Then C is a VC–class. f or all P ∈ P. 6.4. UNIVERSAL AND UNIFORM DONSKER CLASSES Proof. 93 Applying the Hoffman-Jørgensen and symmetrization inequalities we obtain, respectively, the following two equations: √ sup nEPn − P C < ∞, (6.28) n n # # 1 # # E # i (1C (Xi ) − P C)# ≤ 2EPn − P C . n C (6.29) i=1 Making use of Eqs. (6.28) and (6.29) it follows that √ n # # 1 √ # # nE # i (1C (Xi ) − P C)# ≤ 2 nEPn − P C < ∞. n C (6.30) i=1 From Eq. (6.30), we have n n # √ 1 # 1 ** ** # # √ E# i 1C (Xi )# ≤ √ E * i * + 2 nEPn − P C , n n C i=1 (6.31) i=1 so that, for the Rademacher complexity of C at P n # 1 # # # R(P ) ≡ sup √ E # i 1C (Xi )# n C n i=1 we obtain the following bound n √ 1 ** ** R(P ) ≤ sup √ E * i * + sup 2 nEPn − P C n n n i=1 √ √ ≤ sup 2 nEPn − P C + 2π < ∞, n where we used Hoeffding’s inequality at the last step. Thus, R(P ) < ∞, ∀P. (6.32) We now show that there exists a constant M < ∞ such that R(P ) ≤ M, ∀P. (6.33) To this end, we consider two measures on (X , A), P 0 and P 1 , and define P = αP 0 + (1 − α)P 1 . We want to show that R(P ) ≥ αR(P 0 ). Suppose Xi0 , Xi1 respectively, be i.i.d. P 0 , P 1 respectively and λi i.i.d. Bernoulli random variables with parameter 1 − α independent of the Xi0 ’s and Xi1 ’s. By these definitions, we have Xi =d Xiλi and, by the contraction principle, n n # # # # # # # # E# i 1C (Xi )# ≥ E # i 1C (Xi0 )1[λi =0] # . i=1 C i=1 C CHAPTER 6. DONSKER THEOREMS: UNIFORM CLT’S 94 From the above inequality, using Jensen’s inequality, we obtain n n # # # # # # # # E# i 1C (Xi )# ≥ αE # i 1C (Xi0 )# , C i=1 C i=1 and hence R(P ) ≥ αR(P 0 ). Now suppose that Eq. (6.33) is false. It means that there exists a sequence of measures Pk on (X , A) such that R(Pk ) ≥ 4k for every k. Then, defining P as P = ∞ 2−j Pj = 2−k Pk + (1 − 2−k ) 2−j Pj , 1 − 2−k j=k j=1 we find R(P ) ≥ 2−k R(Pk ) ≥ 2−k 22k = 2k for every k, and this yelds R(P ) = ∞, contradicting condition (6.32). Thus condition (6.33) holds. Now suppose that C is not VC. Then, for every k there is a set A = Ak = {x1 , . . . , xk } ⊂ X such that C shatters A, i.e. #{C ∩ A : C ∈ C} = 2k . Then for each α ∈ Rk we have k |αi | = α+ i + α− i ≤ 2 max α+ i , α− i i=1 k # # # # ≤ 2# αi 1C (xi )# , i=1 (6.34) C where the inequality (6.34) holds equality when C picks out the set of xi ’s correspond + − ing to those αi ’s yelding the maximum between αi and αi . Now take P = k −1 2 2 k i=1 δxi and choose n so large that n > (4M ) . Then choose k > 2n and let Ω0 ≡ ∩i=j Xi = Xj . We have ( ) ) ( P (Ωc0 ) = P ∪i=j≤n Xi = Xj ≤ P Xi = Xj ≤ n2 k−1 < 1/2, i=j≤n so that P (Ω0 ) ≥ 1/2. Thus, recalling that R(P ) ≤ M and the inequality (6.34) we obtain % $ n n # # # # √ # # # # M n ≥ E# i 1C (Xi )# ≥ E # i 1C (Xi )# 1Ω0 i=1 C i=1 C n n P (Ω0 ) ≥ . 2 4 √ This inequality yields n ≤ 4M n and it contradicts our choice of n > (4M )2 . It follows ≥ that C is VC. 2 6.5. EXERCISES 6.5 95 Exercises Exercise 6.1 Suppose that {N q }∞ q=1 satisfy N 1 · · · N q also satisfies 2−q (log N q )1/2 < ∞. Show that Nq = q −q 2 1/2 (log Nq ) < ∞. q Solution. 9 : / q + : −q −q ; 2 log Nq = 2 log Np q≥1 p=1 q≥1 9 : : q −q ; = 2 log N p p=1 q≥1 ≤ = 2−q q≥1 ∞ , p=1 ∞ = 2 q , log N p p=1 log N p 2−p 2−q q≥p , log N p . p=1 Being ∞ 2−p , log N p < ∞ by hypothesis, it follows that p=1 2−q + log Nq < ∞. 2 q≥1 Exercise 6.2 Suppose that X is a random variable satisfying the weak second moment condition t2 P (|X| > t) → 0 as t → ∞. Show that tE{|X|1{|X| > t}} → 0 as t → ∞. Without loss of generality, we can assume X ≥ 0 because for a general Solution. random variable it is sufficient replacing X by |X|. We have ∞ t E(X 1{X > t}) = t P (X 1{X > t} > x) dx 0 t ∞ =t P (X 1{X > t} > x) dx + t P (X 1{X > t} > x) dx. 0 t Being P (X 1{X > t} > x) = ⎧ ⎨ P (X > t) if 0 < x ≤ t, ⎩ P (X > x) if x > t, Eq. (6.35) becomes 2 t E(X 1{X > t}) = t P (X > t) + t ∞ P (X > x) dx. t (6.35) CHAPTER 6. DONSKER THEOREMS: UNIFORM CLT’S 96 By hypothesis, given > 0, we can find T such that for t ≥ T we have t2 P (X > t) < /2, i.e. P (X > t) < /2t2 . Then, for t ≥ T , ∞ ∞ t2 P (X > t) + t P (X > x) dx ≤ + t dx ≤ , 2 2x2 t t so that t E(X 1{X > t}) ≤ . 2 Exercise 6.3 Show that for any non-negative random variable X we have the inequalities X22,∞ ≤ sup tE{X1{X > t}} ≤ 2X22,∞ , t>0 where X22,∞ = sup t2 P (X > t). t>0 Solution. Recalling Exercise 6.2, we have ∞ 2 t E(X 1{X > t}) = t P (X > t) + t P (X > x) dx t ∞ ∞ 1 2 1 2 2 2 = t P (X > t) + t x P (X > x) dx ≤ t P (X > t) + t X2,∞ dx 2 x x2 t t = t2 P (X > t) + X22,∞ ≤ 2X22,∞ . It follows that t E(X 1{X > t}) ≤ 2X22,∞ and hence sup t E(X 1{X > t}) ≤ 2X22,∞ . t>0 On the other hand, t E(X 1{X > t}) ≥ t2 P (X > t), so that sup t E(X 1{X > t}) ≥ sup t2 P (X > t) = X22,∞ . t>0 t>0 2 Exercise 6.4 If the envelope function of class F, F , is such that sup F (x) < ∞, then x Gn,P ∗F = OP (1) ⇐⇒ lim sup EP∗ Gn,P F < ∞. n→∞ Solution. Let lim sup EP∗ Gn,P F < ∞. Using Markov’s inequality, we have n→∞ P (Gn,P ∗F > k) ≤ EP∗ Gn,P F k 6.5. EXERCISES 97 so that lim sup P (Gn,P ∗F > k) ≤ n→∞ Being 1 lim sup E ∗ Gn,P F . k n→∞ P 1 lim sup EP∗ Gn,P F = 0, k→∞ k n→∞ lim we obtain lim lim sup P (Gn,P ∗F > k) = 0, k→∞ n→∞ and so Gn,P ∗F = OP (1). Let Gn,P ∗F = OP (1). By Hoffman-Jørgensen’s inequality we have that if X1 , . . . , Xn are independent mean zero stochastic processes indexed by an arbitrary set T , then there exist constants Kp and 0 < vp < 1, such that ( ) ∗ p ∗ p −1 p E Sn ≤ Kp E max Xk + Gn (vp ) . (6.36) k≤n Take p = 1, T = F, Xi (f ) = √1 n ( ) f (Xi ) − P f . Then n n ) √ 1 ( 1 √ f (Xi ) − P f = √ Sn (f ) = f (Xi ) − n P f ≡ Gn,P n n i=1 i=1 ∗ and so Sn = Gn,P ∗F and G−1 n is the quantile function of Gn,P F . Hence Eq. (6.36) becomes 1 ∗ −1 E Gn,P F ≤ K1 E √ max f (Xi ) − P f F + Gn (v1 ) . n i≤n ∗ (6.37) Since Gn,P ∗F is OP (1), G−1 n (v1 ), for a fixed v1 , is O(1). Moreover, by the hypothesis of a finite envelope function, there exists a constant M < ∞ such that sup |f (x)| ≤ M . f ∈F It follows that max f (Xi ) − P f F ≤ 2M . i≤n From Eq. (6.38), recalling Eq. (6.37), we conclude that E ∗ Gn,P F = O(1). (6.38) 2 98 CHAPTER 6. DONSKER THEOREMS: UNIFORM CLT’S Chapter 7 VC-theory: bounding uniform covering numbers In this chapter we will treat some classes of sets which are defined through combinatorial properties. It is important to remark that these classes satisfy the entropy conditions for the Donsker theorem and the Glivenko-Cantelli theorem. Thus they are P –GlivenkoCantelli and P –Donsker under suitable moment conditions on their envelope function, if these ones are measurable. 7.1 Introduction Let X be a set and C a collection of subsets of X . Consider an arbitrary n-point set {x1 , . . . , xn }. Then we can give the following definitions. Definition 7.1 We say that the collection C picks out a certain subset from {x1 , . . . , xn } if this subset can be written as C ∩ {x1 , . . . , xn }, for some set C ∈ C . Definition 7.2 The collection C is said to shatter {x1 , . . . , xn } if C picks out each of its 2n subsets. For all finite point set {x1 , . . . , xn } in X , we set ∆n (C , x1 , . . . , xn ) ≡ # { C ∩ {x1 , . . . , xn } : C ∈ C } , so that ∆n (C , x1 , . . . , xn ) denotes the number of subsets of {x1 , . . . , xn } picked out by the collection C . Moreover, if we set mC (n) ≡ max ∆n (C , x1 , . . . , xn ) , x1 ,...,xn 99 CHAPTER 7. VC-THEORY: BOUNDING UNIFORM COVERING NUMBERS 100 we can define the following numbers & ' V (C ) ≡ inf n : mC (n) < 2n & ' S(C ) ≡ sup n : mC (n) = 2n . Remark 7.1 In words, we say that the VC-index V (C ) of the class C is the smallest n such that C shatters no set of size n. By definition, a VC-class of sets picks out strictly less than 2n subsets from any set of n ≥ V (C ) elements. Remark 7.2 Note that the infimum over the empty set is taken to be infinity and the supremum over the empty set is taken to be −1. So we can conclude that V (C ) = ∞ if and only if C shatters sets of arbitrarily large size. Remark 7.3 It’s easy to show that the next equality holds S(C ) ≡ V (C ) − 1. Now we are able to give the following definition. Definition 7.3 A collection C is called a VC-class if V (C ) < ∞, or equivalently S(C ) < ∞. Next property is very interesting and is trivially demonstrable. Proposition 7.1 A class C of subsets of a set X has V (C ) = 0, or equivalently S(C ) = −1, if and only if C = ∅. Also, V (C ) = 1, or equivalently S(C ) = 0 if and only if C contains exactly one set. Thus S(C ) ≥ 1 if and only if C contains at least two sets. Proof. The first two statements are consequences of the definition. A class C shatters the empty set if and only if C contains at least one set. If C contains at least two sets, then for some A, B ∈ C and x ∈ X , x ∈ A\B. Then C shatters {x}, so S(C ) ≥ 1. Conversely if S(C ) ≥ 1, then C contains at least two sets. 2 Before introducing the main result of this chapter, involving the covering numbers of any VC-class, it can be sharp to give some interesting examples and preliminary results. Example 7.1 Let X = R and C = { (−∞, b ] , b ∈ R } . C shatters no two-point set {x1 , x2 }, because it cannot pick out {x1 ∨ x2 }. Hence its VC-index is 2 and C is a VCclass. Example 7.2 Let X = R and C = { (a, b ] , a, b ∈ R, a < b, } . C shatters no three-point set {x1 , x2 , x3 }, because it cannot pick out {x1 , x3 }. Hence its VC-index is 3 and C is a VC-class. 7.1. INTRODUCTION 101 & ' (−∞, b ] , b ∈ Rd , it can be shown that & ' C is a VC-class and its VC-index is d + 1. Similarly, if C = (a, b ] , a, b ∈ Rd a < b, , it Remark 7.4 Let X = Rd . Suppose that C = is trivial to prove that C is VC with VC-index 2d + 1. For a VC-class, the following lemma is very interesting and it will imply that mC (n) grows as a polynomial in n. Lemma 7.1 (VC, Sauer and Shelah) For a VC-class of sets with VC-index V (C ) , setting S = S(C ) , it holds mC (n) ≤ S ≤ j j=0 Proof. n ne S S , for n ≥ S. (7.1) Begin with the first inequality. By definition, for a VC-class all shattered sets are among those ones of size at most V (C ) − 1. Now, the sum in (7.1) is just the number of sets shattered by C and we know that this number gives an upper bound on ∆n (C , x1 , . . . , xn ) . It follows that mC (n) is bounded above by the same sum. The second inequality is obtained trivially. Suppose that Y ∼ Binomial(n, 1/2), due to Markov inequality we get S S n n 1 n = 2n = 2n P (Y ≤ S) ≤ 2n E r Y −S , for any r ≤ 1. 2 j j j=0 j=0 Recalling that Y is a binomial random variable, we can compute its mean and obtain the following result n 2 E r Y −S n −S = 2 r 1 r + 2 2 n = r −S (1 + r)n ; by choosing r = S/n and recalling the definition of e, the previous quantity becomes n S n S S n 1+ ≤ eS . S n S 2 Hereafter there are two sufficient conditions for S(C ) = 1. Theorem 7.1 Suppose that C is a collection of at least two subsets of a set X , then S(C ) = 1 if either of the following statements hold: (i) the collection C is linearly ordered by inclusion, (ii) any two sets in C are disjoint. CHAPTER 7. VC-THEORY: BOUNDING UNIFORM COVERING NUMBERS 102 Proof. Due to Proposition 7.1, in any case it is S(C ) ≥ 1. To show that (i) holds, suppose that C shatters {x, y}. Let A, B ∈ C , A ∩ {x, y} = {x} , B ∩ {x, y} = {y} . Because C is linearly ordered by inclusion, it must be A ⊂ B or B ⊂ A, yielding a contradiction. If the sets in C are disjoint, then we can argue as in part (i), taking C ∈ C with {x, y} ⊂ C, but C cannot be disjoint from A or B, so it gives a contradiction. 2 Now we are going to introduce an example of classes of sets for which the VC-property fails. Example 7.3 Let X = [0, 1] and let C be the class of all finite subsets of X . Let P be the uniform (Lebesgue) law on [0, 1]. It is S(C ) = ∞ and so C is not a VC-class. Moreover, for any possible value of Pn , we have Pn (A) = 1 for some A = {X1 , . . . , Xn } ∈ C while P (A) = 0. Thus Pn − P C = supA∈ C | (Pn − P )(A) | = 1, for all n. It follows that C is not a Glivenko-Cantelli class for P , neither a Donsker class. Now we find an upper bound to covering numbers of VC-classes. Theorem 7.2 There exists a universal constant K such that for any VC-class C of sets, any probability measure Q, any r ≥ 1, and 0 < ≤ 1, N (, C , Lr (Q)) ≤ K log(3e/r ) r S(C) Moreover, ≤ ˜ V (C ) N (, C , Lr (Q)) ≤ K K 4e r r S(C)+ δ , δ > 0. (7.2) S(C) , (7.3) ˜ is universal. where K Proof. The proof of (7.3) is very long, so it is omitted. Whereas we are going to show inequality (7.2) which is a weaker result. The upper bound for a general r is an easy consequence of the bound for r = 1. Thus let r = 1, fix 0 < ≤ 1 and let m be the packing number for the collection C , i.e. m = D (, C , L1 (Q)). We know that N (, C , L1 (Q)) ≤ D (, C , L1 (Q)), so if m ≤ (K log 3e)S(C) the claim is trivially obtained. Thus assuming m > (K log K)S(C) , it suffices to show the claimed bound when log m > S(C ) ≥ 1 or m > e > 2. By definition of packing number, there exist m sets C1 , . . . , Cm ∈ C such that for any pair i = j Q (Ci # Cj ) = EA | 1Ci − 1Cj | > . Let X1 , . . . , Xn be i.i.d. Q. Observe that Ci and Cj pick out the same subset {X1 , . . . , Xn } if and only if Xk ∈ / Ci # Cj for all k ≤ n. If each Ci # Cj contains some Xk , then all Ci ’s 7.1. INTRODUCTION 103 pick out different subsets, and C picks out at least m subsets from {X1 , . . . , Xn }. Thus we compute the probability that this event does not occur Q ([ for all i = j, Xk ∈ Ci # Cj for some k ≤ n ]c ) = Q ([ for some i = j, Xk ∈ / Ci # Cj for all k ≤ n ]) ≤ Q ([ Xk ∈ / Ci # Cj for all k ≤ n ]) i<j m ≤ 2 m ≤ max [1 − Q (Ci # Cj )]n (1 − )n ≤ 2 m 2 e−n . (7.4) The latter holds for n large enough. In this case the expression (7.4) is strictly less than 1. Especially this holds if log n> m 2 = log (m(m − 1)/2) . For all m ≥ 1 it is m(m − 1)/2 ≤ m2 , thus (7.4) holds if n = 2 log m/, for this value of n, it is Q ([ for all i = j, Xk ∈ Ci # Cj for some k ≤ n ]) > 0. Thus we can find n points X1 (ω), . . . , Xn (ω) such that m ≤ ∆n (C, X1 (ω), . . . , Xn (ω)) ≤ max ∆n (C , x1 , . . . , xn ) x1 ,...,xn en S ≤ S (7.5) where (7.5) holds in virtue of Lemma 7.1 with S = S(C ) = V (C ) − 1. If n = 2 log m/, inequality (7.5) implies that m≤ 3e log m S S ⇐⇒ m1/S 3e 3e ≤ ⇐⇒ g(m1/S ) ≤ , log m S where the function g is defined as g(x) = x/ log x. Inequality (7.6) yields e 3e 3e 1/S m ≤ log e−1 (7.6) (7.7) 104 CHAPTER 7. VC-THEORY: BOUNDING UNIFORM COVERING NUMBERS or D (, C , L1 (Q)) = m ≤ e 3e log e−1 3e S . (7.8) Recalling that N (, C , L1 (Q)) ≤ D (, C , L1 (Q)), inequality (7.2) holds for r = 1 with K = 3e2 /(e − 1). If r > 1, it is 1C − 1D L1 (Q) = Q(C# D) = 1C − 1D rLr (Q) , so that N (, C , Lr (Q)) = N ( , C , L1 (Q)) ≤ r K −r log K r S . 2 This completes the proof of (7.2). Definition 7.4 Let f : X → R be a function, the subgraph of f will be the set {(x, t) ∈ X × R : t < f (x)} . Definition 7.5 Let F be a class of real-valued functions on X . If the collection of all the subgraphs of the functions in F forms a VC-class of sets in X × R, we say that F is a VC-subgraph class, or just VC-class. We denote with V (F ) the VC-index of the set of subgraphs of functions in F . The following theorem gives an important result which involves covering numbers of VC-classes of functions. These are bounded by a polynomial in 1/. Theorem 7.3 For a VC-subgraph class with envelope function F , for any r ≥ 1, for any probability measure Q with F Q, r > 0, and for 0 < < 1 there exists a universal constant K such that V (F ) N ( F Q, r , F , Lr (Q)) ≤ K V (F ) (16e) Proof. r(V (F )−1) 1 (7.9) For each f ∈ F , denote with Cf its subgraph and with C the collection of all Cf ’s. Let λ be the Lesbegue measure on R, in virtue of Fubini’s theorem we obtain Q|f − g| = Q × λ(Cf # Cg ). Renormalize Q × λ to a probability measure on {(x, t) : | t| ≤ F (x)} by defining P = (Q × λ)/(2QF ). Due to theorem 7.2 we can find a universal constant K such that V (F )−1 4e N ( 2QF, F , L1 (Q)) = N (, C, L1 (P )) ≤ K V (F ) . For r > 1, denote with R the probability measure with density F r−1 /Q(F r−1 ) with respect to Q, it is Q| f − g |r ≤ Q| f − g |(2F )r−1 = 2r−1 R| f − g | Q(F r−1 ). 7.1. INTRODUCTION 105 1/r Thus the Lr (Q)-distance is bounded by the distance 2 (Q(F r−1 ))1/r f − g R, 1 . By (7.3) we conclude that N ( 2F Q, r , F , Lr (Q)) ≤ N ( RF, F , L1 (R)) ≤ K V (F ) r 8e r V (F )−1 . 2 The following propositions give basic methods for generating VC-classes of sets and functions. Let’s introduce the following definition. Definition 7.6 Let F be a collection of real-valued functions on a set X . We can define the following sets pos( f ) = {x : f (x) > 0} , pos(F ) = {pos( f ) : f ∈ F } ; nn( f ) = {x : f (x) ≥ 0} , nn(F ) = {nn( f ) : f ∈ F } . Proposition 7.2 Let F be a r-dimensional real vector space of functions on X , let g be any real function on X , and let g + F ≡ {g + f : f ∈ F } . Then (i) S(pos(g + F )) = S(nn(g + F )) = r (ii) S(pos(F )) = S(nn(F )) = r (iii) S(F ) ≤ r + 1 Proof. It will be shown (ii). Suppose that v = dim(F ) + 1 = r + 1 and let x1 , . . . , xv be v distinct points of X . Let’s define the map A : F → Rv as A(f ) = (f (x1 ), . . . , f (xv )). Since dim(F ) = r = v − 1, then it is also dim(A(F )) ≤ v − 1. Thus we can find a vector b = (b1 , . . . , bv ) ∈ Rv which is orthogonal to A(F ), i.e. 0= v bi f (xi ) for all f ∈ F , i=1 and thus i: bi ≥0 bi f (xi ) = − bi f (xi ). i: bi <0 Assume, without loss of generality, that {i ≤ v : bi < 0} is not empty. If there were a function f ∈ F such that {f ≥ 0} ∩ {x1 , . . . , xv } = {xi : bi ≥ 0}, the left side of the last equality would be greater than zero, while the right side would be strictly negative. This yields a contradiction. Thus there exists a subset {x1 , . . . , xv } which is not obtained as intersection of {x1 , . . . , xv } and {f ≥ 0}. Hence nn(F ) is VC and S(nn(F )) ≤ r. But since dim(F ) = r, then there is some subset {x1 , . . . , xr } with A(F ) = Rr , thus all 106 CHAPTER 7. VC-THEORY: BOUNDING UNIFORM COVERING NUMBERS subsets of {x1 , . . . , xr } are of the form B ∩ {x1 , . . . , xr } for B ∈ nn(F ). This implies that S(nn(F )) ≥ r. Hence we have shown that S(nn(F )) = r. Finally, S(pos(F )) = r because we know that for any set X and for any C ⊂ 2X the complement in X of C , say D, is such that S(D) = S(C ). Hence S(pos(F )) = r by taking complements. 2 Example 7.4 Suppose that X = Rd and set H(u, t) = y ∈ Rd : $y, u% ≤ t . & ' Consider the set C = Hd = H(u, t) : u ∈ Rd−1 , t > 0 . Let F be the space spanned by 1 and x1 , . . . , xd , then dim(F ) = d + 1. Moreover, H(u, t) = x ∈ Rd : $x, u% ≤ t = x ∈ Rd : t − $x, u% ≥ 0 = {x : ft,u (x) ≥ 0} where ft,u (x) = t − $x, u% ranges in F . By Proposition 7.2 S( Hd ) = d + 1. Example 7.5 Suppose that X = Rd and consider B(x, t) = y ∈ Rd : | y − x | ≤ t , & ' Set C = Bd = B(x, t) : x ∈ Rd , t > 0 . Let F be the space spanned by the functions fj (x) = xj , (j = 1, . . . , d) and the constant function 1, and let g be the function defined as g(x) = −| x |2 . Thus, dim(F ) = d + 1. Moreover, & ' B(x, t) = {y : | y − x | ≤ t} = y : | y |2 − 2$y, x% + | x |2 ≤ t & ' = y : 2$y, x% − | y |2 − | x |2 + t ≥ 0 = {y : g(y) + ft,x (y) ≥ 0} where ft,x (y) = 2$y, x%−| x |2 +t ranges in F . Since Bd = nn( g+F ) and S(nn( g+F )) = d + 1 by Proposition 7.2, it follows that S(Bd ) = d + 1. Next propositions may be considered as stability properties. Proposition 7.3 Assume that C and D are VC-classes of subsets of a set X , and suppose that φ : X → Y and ψ : Z → X are fixed functions. Then each of the following statements holds (i) C c = {C c : C ∈ C } is VC and S(C c ) = S(C ), (ii) C & D = {C ∩ D : C ∈ C , D ∈ D} is VC, (iii) C ' D = {C ∪ D : C ∈ C , D ∈ D} is VC, (iv) φ(C ) is VC if φ is one-to-one, 7.1. INTRODUCTION 107 (v) ψ −1 (C ) is VC and S(ψ −1 (C )) ≤ S(C ) with equality if ψ is onto X , (vi) the sequential closure of C for pointwise convergence of indicator functions is VC, (vii) for VC-classes C and D in sets X and Y , C × D = {C × D : C ∈ C , D ∈ D} is VC. Proof. By definition C c picks out the points of a given set {x1 , . . . , xm } that C does not pick out. Hence if C shatters a given set of points, so does C c . Thus C is VC if and only if C c is VC and the VC indices are equal. The proof that the collection of all intersection is VC is easy upon using Lemma 7.1, according to which a VC-class can picks out only a polynomial number of subsets. From n points C can pick out at most O(nS(C) ) subsets; from each of these subsets D can pick out at most O(nS(D) ) further subsets. Thus we get that C ∩ D can pick out at most O(nS(C)+S(D) ) subsets. For large n this is well below 2n . The result for the unions follows from combination of (i) and (ii), because C ∪ D = (C c ∩ D c )c . To prove (iv), observe that if φ(C ) shatters {y1 , . . . , yn }, then each yi must be in the range of φ and there are x1 , . . . , xn such that φ is a bijection between x1 , . . . , xn and y1 , . . . , yn . Hence C must shatter {x1 , . . . , xn }. To prove (v), note that if ψ −1 (C ) shatters {z1 , . . . , zn }, then all ψ(zi ) must be different, and the restriction of ψ to z1 , . . . , zn is a bijection on its range, so ψ −1 (C ) is VC for the previous result. The proof of (vii) is an immediate consequence of (ii), in fact C ×Y and X ×D are VC-classes, and hence their intersection C × D is VC by (ii). Let’s prove (vi). Consider any set of points x1 , . . . , xn and any set C¯ in the sequential closure. Suppose that C¯ is the pointwise limit of a net Cα , then for large α we get 1C¯ (xi ) = 1Cα (xi ) for each i. ¯ For such α the set Cα picks out the same subset at C. 2 Proposition 7.4 Let = &, ', or × and set S(j, k) = max {S(C D) : S(C ) = j, S(D) = k} . Then for each j, k ∈ N, S (j, k) = S (j, k) = S× (j, k) = S(j, k) and S(j, k) ≤ sup {r ∈ N :rC≤ j r C≤ k ≥ 2r } = T (j, k) where r C≤ j = j l=0 n j . 108 CHAPTER 7. VC-THEORY: BOUNDING UNIFORM COVERING NUMBERS Proof. The first equality follows taking complements. For the second, for given k and m, we can consider large enough finite sets in place of X and Y , so we can assume X = Y . We have (j, k) ≤ × (j, k) by restricting to diagonal in X × X. On the other hand, let ΠX and ΠY be the projection of X × Y onto X and Y . Let C ⊂ 2X and & ' & ' A ⊂ 2Y . Set F = Π−1 and B = Π−1 X (C) : C ∈ C Y (A) : A ∈ A . Then S(F ) = S(C ) −1 and S(B) = S(A). Since Π−1 X (C) ∩ ΠY (A) ≡ C × A it follows that S(F & B) ≥ S(C × A), and thus the proof is complete. 2 Next proposition is the stability property for VC-classes of functions. Proposition 7.5 Assume that F and G are VC-subgraph classes of functions on a set X , and suppose that g : X → R and φ : R → R and ψ : Z → X are fixed functions. Then each of the following statements holds (i) F ∨ G = {f ∨ g : f ∈ F , g ∈ G} is VC-subgraph, (ii) F ∧ G = {f ∧ g : f ∈ F , g ∈ G} is VC-subgraph, (iii) {F > 0} = {{f > 0} : f ∈ F } is VC, (iv) −F is VC-subgraph, (v) g + F = {g + f : f ∈ F } is VC-subgraph, (vi) g · F = {g · f : f ∈ F } is VC-subgraph, (vii) F ◦ ψ(C ) = {f (ψ) : f ∈ F } is VC-subgraph, (viii) φ ◦ F = {φ( f ) : f ∈ F } is VC-subgraph for monotone φ. Proof. The subgraphs of suprema and infima are the intersection and union of the subgraphs of f and g, thus (i) and (ii) are consequences of Proposition 7.3. To see that (iii) holds, note that the sets {f > 0} are one-to-one images of the intersections of the subgraphs with the set X × {0}. Hence the class {F > 0} is VC by (ii) and (iv) of the preceding proposition. The subgraphs of the class −F are the images of the open subgraphs of F under the map (x, t) → (x, −t), and the open subgraphs are the complements of the closed subgraphs which are VC. Thus (iv) follows from the previous proposition. For (v), observe that the subgraphs of F + g shatter a given set of points (x1 , t1 ), . . . , (xn , tn ) if and only if the subgraphs of F shatter the set (xi , ti − g(xi )). Thus we get that g + F is VC-subgraph. The subgraphs of the function f g is the union 7.1. INTRODUCTION 109 of the sets C + = {(x, t) : t < f (x)g(x), g(x) > 0} , C − = {(x, t) : t < f (x)g(x), g(x) < 0} , C 0 = {(x, t) : t < 0, g(x) = 0} . Thus it sufficies to show that these sets are VC in (X ∩ {g > 0}) × R, (X ∩ {g < 0}) × R, (X ∩ {g = 0}) × R. For instance, let {i : (xi , ti ) ∈ C − } be the set of indices of the points (xi , ti /g(xi )) picked out by the open subgraphs of F . These are the complements of the closed subgraphs and hence form a VC-class. The subgraphs of F ◦ ψ are the inverse of the subgraphs of function in F under the map (z, t) → (ψ(z), t). Hence (vii) follows from (v) of Proposition 7.3. To prove the statement (viii), assume that the subgraphs of φ ◦ F shatter the set of points (x1 , t1 ), . . . , (xn tn ). Choose f1 , . . . , fm from F such that the functions φ ◦ fj pick out all m = 2n subsets. For each i, set si = max {fj (xi ) : φ( fj (xi ) ) ≤ ti }. Now, si < fj (xi ) if and only if ti < φ ◦ fj (xi ), for every i and j, and the subgraphs of f1 , . . . , fm shatter the points (xi , si ). This completes 2 the proof. Definition 7.7 A class of real functions on a set X is called Euclidean class for the envelope function F if there exist constants A and V such that, for 0 < ≤ 1, one has N ( F Q,1 , F , L1 (Q)) ≤ A −V , with 0 < F Q,1 = QF < ∞. Remark 7.5 It is important to observe that constants A and V may not depend on Q. Remark 7.6 Note that if F is Euclidean, then for each r > 1 and 0 < ≤ 1, one has N ( F Q,r , F , Lr (Q)) ≤ A 2rV −rV , whenever 0 < QF r < ∞, as follows from the definition of N (2(/2)r F µ,1 , F , L1 (µ)) for the measure µ(·) = Q(· (2F )r−1 ). For Euclidean class some properties of stability hold, such as the following one. Proposition 7.6 Assume that F and G are Euclidean classes of functions with envelopes F and G respectively, and suppose that Q is a measure with QF r < ∞ and QGr < ∞ for some r ≥ 1. Then the class of functions F + G = {f + g : f ∈ F , g ∈ G} is Euclidean for the envelope F + G and N ((2 + 2δ) F + G Q,r , F + G, L2 (Q)) ≤ N ( F Q,r , F , Lr (Q)) + N (δ G Q,r , G, Lr (Q)). CHAPTER 7. VC-THEORY: BOUNDING UNIFORM COVERING NUMBERS 110 7.2 Convex Hulls Definition 7.8 Let Y be a vector space, and A ⊂ Y, then the convex hull of A is the set ⎧ ⎫ k ⎨ ⎬ conv(A) = ti yi , yi ∈ A, ti ≥ 0, tj ≤ 1 ⎩ ⎭ i=1 j for some integer k. Definition 7.9 Let F be a class of functions, the convex hull of F is defined as the following set conv(F ) = $m αi fi , fi ∈ F , αi > 0, i=1 m % αi = 1 . i=1 Definition 7.10 Let F be a class of functions, the symmetric convex hull of F is $m % m sconv(F ) = αi fi , fi ∈ F , αi ≤ 1 . i=1 i=1 Definition 7.11 A collection of measurable functions F is a VC-hull class if there exists a VC-class G of functions, such that f ∈ F is the pointwise limit of a sequence of functions fm contained in sconv(G). Suppose that F is a class of measurable function, then an upper bound for the covering numbers of the convex hull conv(F ) can be obtained in L2 -norm, once it is known an upper bound for the covering numbers for the class F in L2 -norm. Theorem 7.4 Assume that Q is a probability measure on (X , A), and let F be a class of measurable functions with measurable square integrable envelope F such that 0 < QF 2 < ∞, and for 0 < ≤ 1 V 1 N ( F Q,2 , F , L2 (Q)) ≤ C . Then there exists a constant K depending on C and V only such that 2V /(V +2)+δ 1 log N ( F Q,2 , conv(F ), L2 (Q)) ≤ K . Proof. The power 2V /(V + 2) is sharp, in fact for any V < ∞ it is 2V /(V + 2) < 2. Hence we can say that the convex hull G = conv(F ) of a polynomial class F satisfies the uniform entropy condition ∞ , sup log N ( G Q,2 , G, L2 (Q)) d < ∞, 0 provided G 2Q,2 ≡ Q G2 dQ is finite for some envelope function G of G. 2 7.2. CONVEX HULLS 111 This result can be extended to Lr -metrics for 1 < r < ∞. Theorem 7.5 Assume that Q is a probability measure on (X , A), and let F be a class of measurable functions with measurable envelope F such that QF r < ∞, and for 0 < < 1, and r > 1, V 1 N ( F Q,r , F , Lr (Q)) ≤ C . Then there exists a constant K depending on C and V and r such that 1 1 min(1− r1 , 21 )+ V1 log N ( F Q,r , conv(F ), Lr (Q)) ≤ K . We complete the chapter with an example. ' & Example 7.6 Consider the class of all distribution functions on Rd . Set Gd ≡ 1[ t, ∞) : t ∈ Rd . Gd is VC and V (Gd ) = d + 1. The envelope function is 1. Hence an upper bound for the covering numbers is given by equation (7.3) of Theorem 7.2: N (, Gd , Lr (Q)) ≤ K −rd , for 0 < ≤ 1. The entropy of conv(Gd ) is given by log N (, conv(Gd ), Lr (Q)) ≤ K −γ(r,d) , where $ γ(r, d) = 2rd (rd+2) , rd (r−1)d+1 , r≥2 1 < r ≤ 2. Compute γ for the extreme values, we obtain γ(2, d) = 2d/(d + 1) and γ(r, d) d as r 1. Especially γ(2, 1) = 1 = γ(1, 1) and γ(2, 2) = 4/3, γ(r, 2) 2 as r 1. 112 CHAPTER 7. VC-THEORY: BOUNDING UNIFORM COVERING NUMBERS Chapter 8 Bracketing Numbers We have already seen two ways of controlling bracketing numbers; recall Lemma 5.1 and Lemma 5.2. Our goal here is to describe some of other available results for larger classes of functions. Control of bracketing numbers typically comes via results in approximation theory. Bounds are available in the literature for many interesting classes: see for example Kolmogorov and Tikhomirov (1959), Birman and Solomjak (1967), Clements (1963), Devore and Lorentz (1993), and Birg´e and Massart (2000). We give a few examples in this chapter. Many of the available results are stated in terms of the supremum norm ·∞ ; these yield bounds on Lr (Q) bracketing via the following easy lemma (see Exercicse 9.1). Lemma 8.1 For any class of measurable real-valued functions F on (X , A), and any 1 ≤ r < ∞, N (, F, Lr (Q)) ≤ N[ ] (, F, Lr (Q)), and N[ ] (N (, F, Lr (Q)) ≤ N (/2, F, ·∞ ) for every > 0. Proof. Let k = N[ ] (, F, Lr (Q)) and let {[li , ui ]}ki=1 be the -brackets. Then, for any f ∈ F exist i such that ui (x) ≥ f (x) ≥ li (x) for every x. It follows that # # # # # # # #f #f # li # ui # #f − li + ui # # # # ≤ # − # + − # # 2 #Lr (Q) 2 2 Lr (Q) # 2 2 #Lr (Q) 1 1 f − li Lr (Q) + f − ui Lr (Q) . = 2 2 113 CHAPTER 8. BRACKETING NUMBERS 114 Because ui − f ≤ ui − li and 0 ≤ ui − f we have |ui − f |r dQ ≤ |ui − li |r dQ that is f − ui Lr (Q) + li − ui Lr (Q) < . Similarly we obtain f − li Lr (Q) < , and then # # # # #f − li + ui # < . # 2 #Lr (Q) Finally F⊆ k B i=1 ui + li , , Lr (Q) . 2 This proves the first inequality. ) ( Now we show N[ ] (N (, F, Lr (Q)) ≤ N (/2, F, ·∞ ) = l. Let F ⊆ ∪ki=1 B fi , 2 , · ∞ . Then for any f ∈ F, exist fi such that f − fi ∞ < . 2 Then pointwise we have fi − and clearly 8.1 & fi − 2 , fi + 2 'l i=1 ≤ f ≤ fi + 2 2 forms l -brackets that cover F. 2 Smooth Functions First consider the collection of smooth function on a bounded set X in Rd with uniformly bounded derivatives of a given order α > 0 defined as follows: let α denote the greatest integer smaller than α, and for any vector k = (k1 , . . . , kd ) of d integers, let Dk = where k = d ∂k ∂xk11 · · · ∂xkdd , Then for a function f : X → R, define * k * * * *D f (x) − D k f (y)* * k * f α = max inf *D f (x)* + max sup , k≤α x k=α x,y y − xα−α j=1 kj . α (X ) where the suprema hare taken over all x, y in the interior of X with x = y. Let CM be the set of all continuous functions f : X → R with f α ≤ M . The following theorem goes back to Kolmogorov and Tikhomirov (1959). Theorem 8.1 Suppose that X is a bounded, convex subset of Rd with nonempty interior. Then there exists a constant K depending only on α and d such that d/α K α 1 log N (, C1 (X ), · ∞ ) ≤ λ(X ) (8.1) 8.1. SMOOTH FUNCTIONS 115 for every > 0. (Here λ(X 1 ) is the Lebesgue measure of the set X 1 = {x : x − X < 1}). By application of Lemma 8.1, this yields the following corollary. Corollary 8.1 Let X be a bounded, convex subset of Rd with nonempty interior. Then there exists a constant K depending only on α, λ(X 1 ), and d such that log N[ ] (, C1α (X ), Lr (Q)) d/α 1 ≤K for every r ≥ 1, > 0, and probability measure Q on Rd . Example 8.1 Let Fα = C1α [0, 1] for 0 < α ≤ 1, the class of all Lipschitz functions of degree α ≤ 1 on the unit interval [0, 1]. Then log N (, C1α [0, 1], L2 (Q)) ≤ K(1/)1/α for all > 0, and hence Fα is universal Donsker for α > 1/2. Similarly, for Fd,α = C1α [0, 1]d ,we conclude that Fd,α is universal Donsker for α > d/2. [It follow from a results of Strassen and Dudley (1969) that this is sharp in a sense: if α = d/2, then the class Fd,α is not even pre-Gaussian for Q = λ on [0, 1]d .] If we replace the uniform bounds in the definition of the norm used to define the α (X ) by bounds on L -norms of derivative, then the resulting classes of funcclasses CM p tions are the Sobolev classes Wpα (X ) defined as follows. For α ∈ N and p ≥ 1, define $ f p,α = f Lp + %1/p = αD k f pLp , k where Lp = Lp (X , B, λ). If α is not an integer, define f p,α = f Lp ⎧ ⎫1/p ⎨ **D k f (x) − D k f (y)**p ⎬ + dxdy . ⎩ ⎭ x − yp(α−α)+d k=α X X The Sobolev space Wpα (X ) is the set of real valued functions on X with f p,α < ∞. α,p Let DM (X ) = {f ∈ Wpα (X ) : f p,α ≤ M }. Birman and Solomjak (1967) proved the following entropy bound. Theorem 8.2 (Birman and Solomjak) Suppose that X is a bounded, convex subset of Rd with nonempty interior. Then there exists a constant K depending only on r and d such that log N (, D1α,p ([0, 1]d ), · Lp ) ≤ K d/α (8.2) for every > 0 and 1 ≤ q ≤ ∞ when p > d/α, 1 ≤ q < q ∗ := p(1 − pα/d)−1 when p ≤ d/α. CHAPTER 8. BRACKETING NUMBERS 116 Theorem 8.2 has recently been extended to balls in the Besov space Bp,∞ ([0, 1]d ) by Birg´e and Massart (2000). Here is the definition of these spaces in the case d = 1 following DeVore and Lorentz (1993). Suppose that [a, b] is a compact interval in R. For an integer r define the r-th order differences of a function f : [a, b] → R by r r ∆rh (f, x) = (−1)r−k f (x + kh) k k=0 where x, x + kh ∈ [a, b]. The Lp -modulus of smoothness ωr (f, y, [a, b])p is then defined by b−rh [ωr (f, y, [a, b])p ]p = sup 0<h<y |∆rh (f, x)|p dx for y > 0. a For given α > 0 and p > 0, define f Bpα by f Bpα = sup y −α ωr (f, y, [a, b])p . y>0 The Besov space Bp,∞([a, b])p is the collection of all functions f ∈ Lp ([a, b]) with f Bpα < ∞. This generalizes to functions on bounded subsets of Rd as follows: Theorem 8.3 (Birg´ e and Massart) Suppose that p > 0 and 1 ≤ q ≤ ∞. Let α α VM (Bp,∞ ([0, 1]d )) = {f ∈ Bp,∞ ([0, 1]d ) : f Bpα ≤ M }. Then, for a constant k depending on d, α, p, and q, α log N (, VM (Bp,∞ ([0, 1])d )), Lp ) ≤K M d/α provided that α > (d/p − d/q)+ . The results stated so far in this section apply to function f defined on a bounded subset X of Euclidean space. By adding hypotheses in the form of moment conditions on the underlying probability measure, the entropy bounds can be generalized to classes of functions on Rd . Here is an extension of this type for the H¨ older classes treated for bounded domains in Theorem 8.1. d Corollary 8.2 (van der Vaart) Suppose that Rd = ∪∞ j=1 Ij is a partition of R into bounded, convex sets Ij with nonempty interior, and let F be a class of functions f : α (I ) for every j. Then there is a Rd → R such that the restrictions F|Ij are in CM j j constant K depending only on α, V, r and d such that ⎛ ⎞ V +r r V ∞ V r V 1 1 V r+r V +r ⎝ ⎠ V +r log N[ ] (, F, Lr (Q)) ≤ K λ(Ij ) Mj Q(Ij ) , j=1 for every > 0, V ≥ d/α, and probability measure Q. 8.2. MONOTONE FUNCTIONS See van der Vaart and Wellner (1996), page 158, and van der Vaart (1994). 2 Proof. 8.2 117 Monotone Functions As we have seen in Chapter 6, the class F of bounded monotone functions on R has L2 (Q) uniform entropy bounded by a constant times 1/ via the convex hull Theorem 7.4. It follows that F is Donsker for every probability measure P on R. Another way to prove this is via bracketing. The following theorem was proved by Van de Geer (1991) by use of the methods of Birman Solomjak (1967). Theorem 8.4 Let F be the class of all monotone function f : R → [0, 1]. Then log N[ ] (, F, Lr (Q)) ≤ K for every probability measure Q, every r ≥ 1, and a constant K depending on r only. See van der Vaart and Wellner (1996), pages 159-162 for a complete proof. 2 Proof. The bracketing entropy bound is very useful in applications because of the relative case of bounding suprema of empirical processes in terms of bracketing integrals, as developed in Chapter 7. 8.3 Convex Functions and Convex Sets To deal with convex sets in a metric space (D, d), we first introduce a natural metric, the Hausdorff metric: for C, D ⊂ D, let h(C, D) = sup d(x, D) ∨ sup d(x, C). x∈C x∈D When restricted to closed subsets, this yields a metric (which can be infinite). The following result of Bronˇstein (1976) gives the entropy of the collection of all compact, convex subsets of a fixed, bounded subset X of Rd with respect to the Hausdorff metric. Lemma 8.2 Suppose that Cd is the class of all compact, convex subsets of a fixed bounded subset X of Rd with d ≥ 2. Then there are constants 0 < K1 < K2 < ∞ such that Proof. (d−1)/2 (d−1)/2 1 1 K1 ≤ log N (, C, h) ≤ K2 . See Bronˇstein (1976) or Dudley (1999), pages 269-281. 2 CHAPTER 8. BRACKETING NUMBERS 118 There is an immediate corollary of Lemma 8.2 for Lr (Q) bracketing numbers when Q is absolutely continuous with respect to Lebesgue measure on X with bounded density. Corollary 8.3 Let Cd be the class of all compact, convex subsets of a fixed bounded subset X of Rd with d ≥ 2, and suppose that Q is a probability distribution on X with bounded density q. Then (d−1)r/2 1 log N[ ] (, Cd , Lr (Q)) ≤ K , for every ε > 0 and a constant K depending only on X , q∞ , and d. Proof. 2 See van der Vaart and Wellner (1996), page 163. Note that for r = 2 the exponent in the bound in Corollary 8.3 is d − 1, which is < 2 for d = 2 (and hence C2 is P –Donsker for measures P with bounded Lebesgue density), but is ≥ 2 when d ≥ 3. Bolthausen (1978) showed that C2 is Donsker. Dudley (1984), (1999) studied the boundary case d = 3 and shows that the when P is Lebesgue measure λ = λd on [0, 1]d , for each δ > 0 there is an M = M (δ) < 0 such that ) ( P Gn C3 > M (log n)1/2 (log log n)−δ−1/2 → 1 as n → ∞; it follow in particular that C3 is not λd –Donsker. Now consider convex function f : X → R where X is a compact, convex subset of Rd . If we also require that the functions be uniformly Lipschitz, then an entropy bound with respect to the uniform metric can be derived from the preceding result. Corollary 8.4 Suppose that F is the class of all convex functions f : X → [0, 1] defined on a compact, subset X of Rd satisfying |f (x) − f (y)| ≤ Lx − y for every x, y ∈ X . Then log N (, F, · ∞ ) ≤ K(1 + L) d/2 d/2 1 for all > 0 for a constant K that depend on d and the set X only. Proof. 8.4 See van der Vaart and Wellner (1996), page 164. 2 Lower layers A set C ⊂ Rd is called a lower layer if and only if x ∈ C and y ≤ x implies y ∈ C. Here y ≤ x means that yj ≤ xj for j = 1, . . . , d where y = (y1 , . . . , yd ) and x = x1 , . . . , xd . Let LLd denote the collection of all lower layers in Rd with nonempty complement, and let LLd,1 = {L ∩ [0, 1]d : L ∈ LLd , L ∩ [0, 1]d = ∅}. 8.4. LOWER LAYERS 119 Lower layers arise naturally in connection with problems connected with functions f : Rd → R that are monotone in the sense of being increasing (nondecreasing) in each of their arguments. For such a function the level sets appear as the boundaries of sets which are lower layers: for t ∈ R {x ∈ Rd : f (x) ≤ t} = C is a lower layer (if t is the interior of the range of f ?). Recall that for a metric space (D, d), x ∈ D, and a set A ⊂ D, d(x, A) = inf{d(x, y) : y ∈ A}. Further, the Hausdorff pseudometric h for sets A, B ⊂ D is given by h(A, B) = max{sup d(x, B), sup d(y, A)}. x∈A y∈B It is not hard to show that h is a metric on the class of closed, bounded, nonempty subsets of D. The following Theorem concerning the behavior of the covering numbers and bracketing numbers for lower layers is from Dudley (1999), page 266. Theorem 8.5 For d ≤ 2, as ↓ 0 the following assertions hold: log N (, LLd,1 , h) ( log N (, LLd,1 , L1 (λ)) ( 1−d , and log N[ ] (, LLd,1 , L1 (λ)) ( 1−d . For other results on lower layers and related statistical problems involving monotone functions, see Wright (1981) and Hanson, Pledger, and Wright (1973). CHAPTER 8. BRACKETING NUMBERS 120 8.5 Exercises Exercise 8.1 Suppose that F is the class of all differentiable function f from [0, 1] with f ∞ ≤ 1. Show that for some constant K K (Hint: Consider approximations of the form log N (, F, · ∞ ) ≤ >0 B f (0) 1(aj−1 ,aj ] + 1{0} j=1 C D = M , aM +1 = 1, and M = 1 , so that M + 1 ≤ 2/). f˜(x) = with a0 = 0, a1 = , . . . , aM for all M +1 ?a @ A j Suppose that x ∈ (aj−1 , aj ], then * A * B * * * f (aj ) * *˜ * * − f (x)** *f (x) − f (x)* = * * A * B * f (aj ) * = ** − f (aj )** + |f (aj ) − f (x)| *A * B * f (aj ) * * f (aj ) ** * = * x)* − + |aj − x| *f (˜ * ≤ 2 x ˜ ∈ [x, aj ] * * *˜ * by hypothesis and construction of aj ’s. Note that also *f (0) − f (0)* ≤ ≤ 2, because f − f˜∞ ≤ 2. Solution. +1 The number of partitions: {0}, (aj−1 , aj ]M j=1 is M + 2, and M + 2 ≤ 2/, for sufficiently small . Total number of choices for f˜(0) is clearly 1/ (because 0 ≤ f (0) ≤ 1). But * * * * * * *˜ * * * * * *f (ak ) − f˜(ak−1 )* ≤ *f˜(ak ) − f (ak )* + *f (ak ) − f (ak−1 )* * * * * + *f˜(ak−1 ) − f (ak−1 )* ≤ 3. Then f˜(ak−1 ) − 3 ≤ f˜(ak ) ≤ f˜(ak−1 ) + 3 A B B f (ak−1 ) f (ak−1 ) − 3 ≤ f˜(ak ) ≤ + 3 A B A B f (ak−1 ) f˜(ak ) f (ak−1 ) −3 ≤ ≤ + 3. A So, once f˜(0) is fixed, there are at most 7 choices for f˜(aj ) and then at most 7 for f˜(a2 ), and so on. Hence the total number of possible A B 1 M +1 ≤ 7 functions is 1 2/ (7) 8.5. EXERCISES 121 and finally 1 N (2, F, · ∞ ) ≤ (49)1/ , leading to the conclusion of the proposition. Exercise 8.2 Suppose that F = 2 * * 1 f : [0, 1] → [0, 1]* 0 (f (x))2 dx ≤ 1 . Show that for λ=Lebesgue measure on [0, 1] there is a constant K so that K log(K/) log N (, F, L2 (λ)) ≤ Solution. for all > 0. As in previous Exercise (8.1), get a0 = 0, a1 = , . . . , aM = M , aM +1 = 1 with M = 1/, so that M + 1 ≤ 2/. For f ∈ F, define ak and let g˜ = f (x)dx ak−1 gk = ak − ak−1 M +1 g k 1(ak−1 ,ak ] . k=1 For x ∈ (ak−1 , ak ], we have 2 |g(x) − gk | ≤ 2 ak ak−1 |g (u)|2 du ak − ak−1 . Note then that 1 0 M +1 ak 2 |g(x) − g˜(x)| dx = k=1 ak−1 ak M +1 2 ≤ k=1 Next define ∗ g = M +1 k=1 |g(x) − g k |2 dx * *2 *g (u)* du ≤ 2 . ak−1 A B gk 1(ak−1 ,ak ] . Now, the total number of all such functions is dominated by (K1 /)2/ , since the total number of sets of the form (ak−1 , ak ] is ≤ 2/, and gk / ≤ 2/ since 0 ≤ g k ≤ 1. We will show that g∗ − g˜L2 (λ) ≤ . This together with the fact that ˜ g − gL2 (λ) ≤ gives g∗ − gL2 (λ) ≤ 2. Thus N (, F, L2 (λ)) ≤ K K/ for K > max{K1 , 2} CHAPTER 8. BRACKETING NUMBERS 122 This then establishes the desired result. It remain to show the intermediate technical steps. Consider that g∗ − g˜L2 (λ) = 1 2M +1 0 gk 1(ak−1 ,ak ] − k=1 M +1 ak ! k=1 ak−1 M +1 2 (ak k=1 2 − ak−1 ) A M +1 k=1 32 B gk 1(ak−1 ,ak ] dλ A B"2 gk dλ k=1 ak−1 ! A B"2 M +1 ak 2 g k gk dλ = a k−1 k=1 M +1 ak ≤ 2 dλ = = gk − = . Finally, to show that 2 |g(x) − gk | ≤ consider = = = = = ≤ ≤ 2 ak ak−1 |g (u)|2 du ak − ak−1 x ∈ (ak−1 , ak ] * ak *2 * * ak−1 g(y)dy * * *g(x) − * * ak − ak−1 * * * * ak g(y) − g(x) *2 * * dy * * * ak−1 ak − ak−1 * * " " **2 ak ! y * x ! x 1 * * g (u)du dy + g (u)du dy * *− * (ak − ak−1 )2 * ak−1 y x x * *2 2 3 ak ! ak " * * x u 1 * * − du g (u)du + dy g (u)du * * 2 * (ak − ak−1 ) * ak−1 ak−1 x u * * ak * *2 x 1 * * (u − ak−1 )g (u)du + (ak − u)g (u)du* *− * (ak − ak−1 )2 * ak−1 x * *2 * ak ξ(u)g (u) * * * du* with ξ(u) = (ak−1 − u)1{u ≤ x} + (ak − u)1{(x < u)} * * ak−1 ak − ak−1 * ak 1 2 ξ 2 (u)g (u)du (ak − ak−1 ) ak−1 ak * * 2 * 2 * g (u) * * du. (ak − ak−1 ) ak−1 8.5. EXERCISES Where the last inequality follow because ξ 2 (u) ≤ 2 . 123 2 124 CHAPTER 8. BRACKETING NUMBERS Chapter 9 Multiplier Inequalities and CLT 9.1 The unconditional multiplier CLT If we write Zi = δXi −P , then the empirical process can be rewritten as Gn = √1 n n i=1 Zi . The Donsker theorem says that 1 √ Zi ⇒ G n n in ∞ (F). i=1 where G is a tight Brownian bridge process. Now suppose that ξ1 , . . . , ξn are i.i.d. real random variables which are also indepen dent of Z1 , . . . , Zn and consider the process √1n ni=1 ξi Zi . If the ξi have mean zero and satisfy a certain moment condition the hypothesis that F is a Donsker class is necessary and sufficient for the multiplier CLT : 1 √ ξi Zi ⇒ σG n n in ∞ (F) i=1 where σ 2 = Var(ξ1 ). For a random variable ξ, set ξ 2,1 = ∞+ 0 P (| ξ |> t)dt. 125 CHAPTER 9. MULTIPLIER INEQUALITIES AND CLT 126 It is easily seen that ξ 2,1 < ∞ implies that ξ 2 < ∞. In fact 2 E[| ξ | ] = 0 ∞ ∞ = 0 P (| ξ |2 > t)dt P (| ξ |> t)2t dt (by a change of variable) + P (| ξ |> t) P (| ξ |> t)dt 0 ∞ + 1 ≤ 2 t 2 E[| ξ |2 ] P (| ξ |> t)dt t 0 ∞+ + = 2 E[| ξ |2 ] P (| ξ |> t)dt = ∞ 2t + (by Markov inequality) 0 and then + E[| ξ |2 ] ≤ 2 ∞+ 0 P (| ξ |> t)dt ξ 2 ≤ 2 ξ 2,1 . The following multiplier inequalities give an upper and a lower bound of the expectation to the sup norm of the multiplier process in terms of a symmetrized version by Rademacher random variables. Lemma 9.1 Suppose that Z1 , . . . , Zn are i.i.d. stochastic processes with E ∗ Zi F < ∞ independent of the Rademacher variables 1 , . . . , n . Suppose that ξ1 , . . . , ξn are i.i.d. mean zero random variables independent of Z1 , . . . , Zn satisfying ξ 2,1 < ∞. Then, for any 1 ≤ n0 ≤ n, # # n # # 1 # ∗# 1 √ ξ 1 E # i Zi # # n # 2 i=1 # # n # 1 # # √ ≤ E # ξi Zi # # n # ∗# F i=1 F | ξi | ≤ 2(n0 − 1)E ∗ Z1 F E max √ 1≤i≤n n # # k # 1 # # # +4 ξ 2,1 max E ∗ # √ i Zi # . # # n0 ≤k≤n k i=n0 F If the ξi ’s are symmetric about zero, then the constant 1/2, 2 and 4 can be replaced by 1. Proof. Define 1 , . . . , n independent of ξ1 , . . . , ξn on their own factor of a product probability space. Suppose that the ξi ’s are symmetric, then the random variable i | ξi | 9.1. THE UNCONDITIONAL MULTIPLIER CLT 127 have the same distribution as the ξi ’s, and the inequality on the left follows from # n # # n # # # # # # # # # E∗ # ξi Zi # = E∗ # i |ξi | Zi # # # # # i=1 i=1 F F # n # # # # # = E ∗ Eξ # i |ξi | Zi # (by property of conditional expectation) # # i=1 F # 2 3# n # # # ∗ # ≤ EZ i |ξi | Zi # (by Jensen and convexity of the sup norm) #Eξ # # i=1 F # n # # # # # = E∗ # i Zi Eξ |ξi |# (by independence) # # i=1 F # # n # # # # = ξ1 E ∗ # i Zi # . # # i=1 F For the general case, let η1 , . . . , ηn be an independent copy of ξ1 , . . . , ξn . Then ξi 1 ≤ ξi − ηi 1 because * ' & E |ξi − ηi | = E E |ξi − ηi | *ξi ≥ E |E [ξi − ηi |ξi ]| = E |ξi − Eηi | . Therefore ξi 1 can be replaced by ξi − ηi 1 on the left side. Consider that the variable ξi − ηi is symmetric and apply the inequality proved above to it, obtaining # # # # n n # # # # # # ∗# 1 ∗# 1 √ √ ξ 1 E # i Zi # ≤ ξ − ηi 1 E # i Zi # # n # # n # i=1 i=1 F # # F n # 1 # # # ≤ E∗ # √ i |ξi − ηi | Zi # # n # i=1 F # # n # 1 # # # ≤ E∗ # √ (ξi − ηi ) Zi # # n # i=1 F # # n # 1 # # # ≤ 2E ∗ # √ ξi Zi # . # n # i=1 F We have used in the last step the triangle inequality and the identity in distribution of the ξi ’s and the ηi ’s. Thus the inequality on the left has been proved. To prove the inequality on the right side, start again with the case of symmetric ξi ’s. Let ξ˜i ≥ . . . ≥ ξ˜n be the reversed order statistics of the random variables |ξ1 | , . . . , |ξn |. By the definition of Z1 , . . . , Zi as fixed functions of the coordinates on the product space (X n , B n ), it follows that for any fixed ξ1 , . . . , ξn , # n # # n # # # # # # # # # E EZ∗ # i |ξi | Zi # = E EZ∗ # i ξ˜i Zi # . # # # # i=1 F i=1 F CHAPTER 9. MULTIPLIER INEQUALITIES AND CLT 128 By the Fubini’s theorem for outer measures (Lemma 1.2.7 of van der Vaart and Wellner (1996)), the joint outer expectation E ∗ can be replaced by Eξ, EZ∗ . Thus it follows by the triangle inequality that, for any n0 ≤ n, # # # # # # n n n # # # # # # # # # ∗# ∗ # ∗ # ˜ E # ξi Zi # = Eξ, EZ # i |ξi | Zi # = Eξ, EZ # i ξi Zi # # # # # # # i=1 i=1 i=1 F #n # F # n # F 0 # # # # # # # # ≤ Eξ, EZ∗ # i ξ˜i Zi # + Eξ, EZ∗ # i ξ˜i Zi # # # # # i=1 i=n0 F #n #F # n # 0 # # # # # # # # = Eξ, EZ∗ # i ξ˜i Zi # + E ∗ # i ξ˜i Zi # (by Fubini). # # # # F i=1 i=n0 F For the first term in the last display we have #n # #n # 0 0 # # # # # # # ∗ # ∗ Eξ, EZ # i ξ˜i Zi # ≤ Eξ, EZ # i ξ˜1 Zi # # # # # i=1 i=1 F F # # n0 # # # # = Eξ, EZ∗ #(ξi : |ξi | = ξ˜1 ) Zi # # # i=1 #n F # 0 # # * * # # * * = E *(ξi : |ξi | = ξ˜1 )* Eξ, EZ∗ # Zi # # # i=1 ∗ F ≤ E ξ˜1 (n0 − 1)E Z1 F (by triangle inequality). Now write ξ˜i = nk=i ξ˜k − ξ˜k+1 in the second term (with ξ˜n+1 = 0) and change the order of summation to find that the second term equals # # # n # # n # k # # # # # ∗# ∗# ˜ ˜ ˜ E # ξk − ξk+1 i ξi Zi # = E # i Zi # # # # #k=n0 # i=n0 i=n0 F F ⎧ ⎫ # # n k # 1 # ⎨ √ ⎬ # # ≤ E k ξ˜k − ξ˜k+1 i Zi # . max E ∗ # √ # k # ⎩ ⎭ n0 ≤k≤n i=n0 k=n0 F Since k = # {i ≤ n : |ξ| ≥ t} on ξ˜k+1 < t < ξ˜k , the first expectation in the last display can be written as E n k=n0 ξ˜k ξ˜k+1 √ kdt ≤ E ≤ + 0 ≤ ∞ 0 0 # {i ≤ n : |ξi | ≥ t}dt ∞+ E# {i ≤ n : |ξi | ≥ t}dt ∞+ nP (|ξi | ≥ t)dt = (by Jensen) √ n ξ2,1 . Combining these pieces yields the upper bound in the case of symmetric variable ξi . 9.1. THE UNCONDITIONAL MULTIPLIER CLT 129 For asymmetric multipliers ξi , note that # # # # # # n n n # # # # # # # # # # # # E∗ # ξi Zi # = E ∗ # (ξi − Eηi ) Zi # ≤ E ∗ # (ξi − ηi ) Zi # . # # # # # # i=1 F F i=1 i=1 F Then apply the bound already derived for symmetric multipliers to the right side in the above display # n # # # # ∗# E # (ξi − ηi ) Zi # # # i=1 |ξ1 − ηi | ∗ √ E Z1 F + n # # k # 1 # # # i Zi # . + ξ − η2,1 max E ∗ # √ # k # n0 ≤k≤n i=n0 ≤ (n0 − 1)E max 1≤i≤n F F For the first term use the triangle inequality ξ − η1 ≤ 2 ξ1 , while for the second one note that ξ − η2,1 ≤ 4 ξ2,1 In fact for any pair of random variable ξ and η we have that P (|ξ + η| > t) ≤ P (|ξ| > t/2) + P (|η| > t/2) √ √ √ and a + b ≤ a + b for a, b ≥ 0, so ∞+ ξ + η2,1 = P (|ξ + η| > t)dt 0 ∞ + ∞+ ≤ P (|ξ| > t/2)dt + P (|η| > t/2)dt 0 0 ∞+ ∞+ = 2 P (|ξ| > t)dt + 2 P (|η| > t)dt 0 0 = 2 ξ2,1 + 2 η2,1 . 2 This completes the proof. The main application of Lemma 9.1 is to the unconditional multiplier central limit theorem. Theorem 9.1 Suppose that F is a class of measurable functions on a probability space (X , A, P ). Suppose that ξ1 , . . . , ξn are i.i.d. real random variables with mean zero, variance 1, and ξ1 2,1 < ∞, independent of X1 , . . . , Xn . Then the sequence n −1/2 n ξi (δXi − P ) i=1 converges to a tight limit process in ∞ (F) if and only if F is P –Donsker. When either convergence holds, the limit process in each case is a (tight) P –Brownian bridge process G. CHAPTER 9. MULTIPLIER INEQUALITIES AND CLT 130 Since both the empirical process n−1/2 ni=1 (δXi − P ) and the multiplier pro cess n−1/2 ni=1 ξi (δXi − P ) do not change if indexed by the class of functions {f − P f : Proof. f ∈ F} instead of F, it may be assumed without loss of generality that P f = 0 for every f . Marginal convergence of both sequences of processes is equivalent to F ⊂ L2 (P ). It suffices to show that the asymptotic equicontinuity conditions for the empirical and the multiplier processes are equivalent. If F is Donsker, then its envelope function F ) ( possesses a weak second moment: P ∗ (F > x) = o x−2 as x → ∞ (see e.g. Lemma 2.3.9 in van der Vaart and Wellner (1996), page 113). By the same lemma convergence of ) ( the multiplier process to a tight limit implies that P ∗ (|ξF | > x) = o x−2 . In particular P ∗ F < ∞. For Zi = δXi − P , since P f = 0 ∀f ∈ F we have that Since ξ2,1 E ∗ Z1 F = E ∗ F (X) = P ∗ F < ∞. ( ) < ∞ implies that E ξ 2 < ∞, it follows that √ E max |ξi | / n → 0. 1≤i≤n Consider the multiplier inequalities of Lemma 9.1; using what we claimed above the first term on the far right side converge to 0 and we have # # # # n n # # # # 1 # # ∗# 1 ∗# 1 ξ 1 lim sup E # √ i Zi # ≤ lim sup E # √ ξi Zi # # # # # 2 n n n→∞ n→∞ i=1 i=1 Fδ Fδ # # k # 1 # # ∗# ≤ +4 ξ 2,1 sup E # √ i Zi # # # k k≥n0 i=n0 Fδ for every n0 and δ > 0. By the symmetrization 4.1, the Rademacher random variables in these inequalities can be deleted at the cost of changing the constants by factors of two. Consider that ξ < ∞ and ξ2,1 < ∞; this yields the conclusion that # # # # n n # # # 1 # # # # # E ∗ #n−1/2 Zi # → 0 if and only if E ∗ # √ ξi Zi # → 0. # # # n # i=1 Fδ i=1 Fδ These are the L1 -versions of the asymptotic equicontinuity conditions. But they are equivalent to the probability version. Consider for example the case of the empirical process. Since F is Donsker, the random variable Z1 ∗F possesses a weak second moment. This implies that Z1 ∗F √ → 0. 1≤i≤n n In view of the triangle inequality, the same is true with F replaced by Fδn . AsympE ∗ max totic equicontinuity condition correspond to the convergence to zero in probability of # # −1/2 n # #n i=1 Zi F ; this implies pointwise convergence to zero of the sequence of their δ quantile functions. Apply Hoffman-Jørgensen’s inequality 4.4 and obtain the condition in terms of moment. 2 9.2. CONDITIONAL MULTIPLIER CLT’S 9.2 131 Conditional multiplier CLT’s While the unconditional multiplier CLT Theorem 9.1 is useful, the deeper conditional multiplier CLT involve conditioning on the original Xi ’s and examining the convergence properties of the resulting sums as a function of the random multipliers. The following two theorems assert weak convergence of G n = n−1/2 ni=1 ξi (δXi − P ) in probability and given every sequence X1 , X2 , . . . , and are of interest for statistics in connection with bootstrap results (see Chapter 14). Conditional weak convergence in probability must be formulated in terms of a metric on conditional laws. Since conditional laws do not exist without proper measurability, we utilize the bounded dual Lipschitz distance based on outer expectations. In fact weak convergence, Gn ⇒ G, of a sequence of random elements, Gn in ∞ (F), to a separable limit, G, is equivalent to sup |E ∗ H (Gn ) − EH (G)| → 0. H∈BL1 Here BL1 is the set of all functions H : ∞ (F) → [0, 1] such that |H(z1 ) − H(z2 )| ≤ z1 − z2 F for every z1 , z2 . Theorem 9.2 Suppose that F is a class of measurable functions and that ξ1 , . . . , ξn are i.i.d. random variables with mean zero, variance 1, and ξ2,1 < ∞, independent of X1 , . . . , Xn . Let G n = n−1/2 ni=1 ξi (δXi − P ). Then the following assertions are equivalent: (i) F is Donsker. (ii) supH∈BL1 |Eξ H (G n ) − EH (G)| → 0 in outer probability, and the sequence G n is asymptotically measurable. Also for the almost sure conditional convergence we utilize a condition in terms of bounded dual Lipschitz distance. Theorem 9.3 Suppose that F is a class of measurable functions and that ξ1 , . . . , ξn are i.i.d. random variables with mean zero, variance 1, and ξ2,1 < ∞, independent of X1 , . . . , Xn . Let G n = n−1/2 ni=1 ξi (δXi − P ). Then the following assertions are equivalent: (i) F is Donsker and P ∗ f − P f 2F < ∞. (ii) suph∈BL1 |Eξ H (G n ) − EH (G)| → 0 outer almost surely, and the sequence Eξ H (G n )∗ − EH (G n )∗ converges almost surely to zero for every H ∈ BL1 . 132 CHAPTER 9. MULTIPLIER INEQUALITIES AND CLT Note that (ii) implies that |Eξ H (G n ) − EH (G)| → 0 for every H ∈ BL1 , for almost every sequence X1 , X2 , . . . . By the portmanteau theorem, this is then also true for every continuous, bounded H. Thus, the sequence G n convergences in distribution to G given almost every sequence X1 , X2 , . . . . Part II Empirical Processes: Applications 133 Chapter 10 Consistency of Maximum Likelihood Estimators Consistency of maximum likelihood estimator is well established for regular parametric models. For nonparametric models, even the definition of maximum likelihood estimator poses certain problems, and it is clear that they are not consistent in general. We first prove a general result for nonparametric maximum likelihood estimation in a convex class of densities. The results in this chapter are based on the papers of Pfanzagl (1988) and Van de Geer (1993), (1996). Consider a class P of densities on a measurable space (X , A), with respect to a fixed σ-finite measure µ. Suppose that X1 , . . . , Xn are i.i.d. P0 with density p0 ∈ P. Let pˆn ≡ arg max Pn log p . p For 0 < α ≤ 1, let ϕα (t) = (tα − 1)/(tα + 1) for t ≥ 0, ϕ(t) = −1 for t < 0. Then ϕα is bounded and continuous for each α ∈ (0, 1]. For 0 < β < 1 define h2β (p, q) ≡ 1 − pβ q 1−β dµ . 1 √ √ { p − q}2 dµ 2 yields the Hellinger distance between p and q. By H¨ older’s inequality, hβ (p, q) ≥ 0 with Note that h21/2 (p, q) ≡ h2 (p, q) = equality if and only if p = q a.e. µ. Proposition 10.1 Suppose that P is convex. Then pˆn 2 h1−α/2 (ˆ . pn , p0 ) ≤ (Pn − P0 ) ϕα p0 In particular, when α = 1 we have, with ϕ ≡ ϕ1 , pˆn 2ˆ pn 2 2 h (ˆ = (Pn − P0 ) . pn , p0 ) = h1/2 (ˆ pn , p0 ) ≤ (Pn − P0 ) ϕ po pˆn + p0 135 136 CHAPTER 10. CONSISTENCY OF MAXIMUM LIKELIHOOD ESTIMATORS Corollary 10.1 Suppose that {ϕ(p/p0 ) : p ∈ P} is a P0 –Glivenko-Cantelli class. Then for each 0 < α ≤ 1, h1−α/2 (ˆ pn , p0 ) →a.s. 0. Proof. Since P is convex and pˆn maximizes Pn log p over P, it follows that Pn log pˆn ≥0 (1 − t)ˆ pn + tp1 for all 0 ≤ t ≤ 1 and every p1 ∈ P; this holds in particular for p1 = p0 . Note that equality holds if t = 0. Differentiation of the left side with respect to t at t = 0 yields Pn p1 ≤1 pˆn for every p1 ∈ P . If L : (0, ∞) → R is increasing and t → L(1/t) is convex, then Jensen’s inequality yields 1 pˆn p1 Pn L ≥L . ≥ L(1) = Pn L p1 Pn (p1 /ˆ pn ) p1 Choosing L = ϕα and p1 = p0 in this last inequality and noting that L(1) = 0, it follows that (a) 0 ≤ Pn ϕα (ˆ pn /p0 ) = (Pn − P0 )ϕα (ˆ pn /p0 ) + P0 ϕα (ˆ pn /p0 ) ; see van der Vaart and Wellner (1996), page 330, and Pfanzagl (1988), pages 141-143. Now we show that (b) P0 ϕα (p/p0 ) = pα − pα0 β 1−β dP0 ≤ − 1 − p0 p dµ pα + pα0 for β = 1 − α/2. Note that this holds if and only if pα −1 + 2 p0 dµ ≤ −1 + pβ0 p1−β dµ, α α p0 + p or pβ0 p1−β dµ ≥ 2 But this holds if pβ0 p1−β ≥ 2 pα0 pα p0 dµ. + pα pα p0 . pα0 + pα With β = 1 − α/2, this becomes + 1 α α/2 (po + pα ) ≥ p0 pα/2 = pα0 pα , 2 and this holds by the arithmetic mean-geometric mean inequality. Thus (b) holds. Combining (b) with (a) yields the claim of the proposition. The corollary follows by noting that ϕ(t) = (t − 1)/(t + 1) = 2t/(t + 1) − 1. 2 The bound given in Proposition 10.1 is one of a family of results of this type. Here is another one which does not require that the family P be convex. 137 Proposition 10.2 (Van de Geer) Suppose that pˆn maximizes Pn log p over P. Then E p ˆ n h2 (ˆ pn , p0 ) ≤ (Pn − P0 ) − 1 1{p0 > 0} . p0 Since pˆn maximizes Pn log p, 1 pˆn 0 ≤ dPn log 2 [p0 >0] p0 E pˆn ≤ − 1 dPn since log(1 + x) ≤ x p0 [p0 >0] E E pˆn pˆn = − 1 d(Pn − P0 ) + P0 − 1 1{p0 >0} p0 p0 [p0 >0] E pˆn = − 1 d(Pn − P0 ) − h2 (ˆ pn , p0 ), p0 [p0 >0] Proof. where the last equality follows by direct calculation and the definition of the Hellinger 2 metric h. Proposition 10.3 (Birg´ e and Massart) If pˆn maximizes Pn log p over P, then 1 pˆn + p0 pn + p0 )/2, p0 ) ≤ (Pn − P0 ) log h2 ((ˆ 1[p0 >0] , 2 2p0 and 2 2 h (ˆ pn , p0 ) ≤ 24h Proof. pˆn + p0 , p0 2 . By concavity of log, pˆn + p0 1 pˆn log 1[p0 >0] ≥ log 1[p0 >0] . 2p0 2 p0 Thus pˆn 1 Pn log 1[p0 >0] 4 p0 1 pˆn + p0 1[p0 >0] Pn log 2 2p0 1 1 pˆn + p0 pˆn + p0 (Pn − P0 ) 1[p0 >0] + P0 1[p0 >0] log log 2 2p0 2 2p0 1 1 pˆn + p0 (Pn − P0 ) 1[p0 >0] − K(P0 , (Pˆn + P0 )/2) log 2 2p0 2 1 pˆn + p0 (Pn − P0 ) 1[p0 >0] − h2 (P0 , (Pˆn + P0 )/2), log 2 2p0 0 ≤ ≤ = = ≤ where we used Exercise 10.2 at the last step. The second claim follows from Exercise 10.3. 2 138 CHAPTER 10. CONSISTENCY OF MAXIMUM LIKELIHOOD ESTIMATORS + p/p0 − Corollary 10.2 (Hellinger consistency of MLE) Suppose that either {( p+p0 1 1)1{p0 > 0} : p ∈ P} or { 2 log 2p0 1{p0 > 0} : p ∈ P} is a P0 –Glivenko-Cantelli class. Then h(ˆ pn , p0 ) →a.s. 0. The following examples show how the Glivenko-Cantelli preservation theorems of Chapter 5 can be used to verify the hypotheses of Corollary 10.1 and Corollary 10.2. Example 10.1 (Interval censoring, case I) Suppose that Y ∼ F on R+ and T ∼ G. Here Y is the time of some event of interest, and T is an “observation time”. Unfortunately, we do not observe (Y, T ); instead what is observed is X = (1{Y ≤ T }, T ) ≡ (∆, T ). Our goal is to estimate F , the distribution of Y . Let P0 be the distribution corresponding to F0 , and suppose that (∆1 , T1 ), . . . , (∆n , Tn ) be i.i.d. as (∆, T ). Note that the conditional distribution of ∆ given T is simply Bernoulli(F (T )), and hence the density of (∆, T ) with respect to the dominating measure #×G (here # denotes counting measure on {0, 1}) is given by pF (δ, t) = F (t)δ (1 − F (t))1−δ . Note that the sample space in this case is X = {(δ, t) : δ ∈ {0, 1}, t ∈ R+ } = {(1, t) : t ∈ R+ } {(0, t) : t ∈ R+ } := X1 X2 . Now, the class of functions {pF : F a d.f. on R+ } is a universal Glivenko-Cantelli class by an application of Theorem 5.7, since on X1 , pF (1, t) = F (t), while on X2 , pF (0, t) = 1 − F (t) where F is a distribution (and hence bounded and monotone nondecreasing). Furthermore the class of functions {pF /pF0 : F a d.f. on R+ } is P0 –Glivenko by an application of Theorem 5.6: take F1 = {pF : F a d.f. on R+ } and F2 = {1/pF0 }, and ϕ(u, v) = uv. Then both F1 and F2 are P0 –Glivenko-Cantelli classes, ϕ is continuous, and H = ϕ(F1 , F2 ) has P0 –integrable envelope 1/pF0 . Finally, by a further application of Theorem 5.6 with ϕ(u) = (t − 1)/(t + 1) shows that the hypothesis of Corollary 10.1 holds: {ϕ(pF /pF0 ) : F a d.f. on R+ } is P0 –Glivenko-Cantelli. Hence the conclusion of the corollary holds and we conclude that h2 (pFˆn , pF0 ) →a.s. 0 as n → ∞. Now note that h2 (p, p0 ) ≥ d2T V (p, p0 )/2 and we compute dT V (pFˆn , pF0 ) = |Fˆn (t) − F0 (t)| dG(t) + |1 − Fˆn (t) − (1 − F0 (t))| dG(t) = 2 |Fˆn (t) − F0 (t)| dG(t) , so we conclude that |Fˆn (t) − F0 (t)| dG(t) →a.s. 0 139 as n → ∞. Since Fˆn and F0 are bounded (by one), we can also conclude that |Fˆn (t) − F0 (t)|r dG(t) →a.s. 0 for each r ≥ 1, in particular for r = 2. Example 10.2 (Mixed case interval censoring) Our goal in this example is to use the theory developed so far to give a proof of the consistency result of Schick and Yu (2000) for the Maximum Likelihood Estimator (MLE) Fˆn for “mixed case” interval censored data. Our proof is based on Proposition 10.1 and Corollary 10.1. Suppose that Y is a random variable taking values in R+ = [0, ∞) with distribution function F ∈ F = {all df’s F on R+ }. Unfortunately we are not able to observe Y itself. What we do observe is a vector of times TK = (TK,1 , . . . , TK,K ) where K, the number of times, is itself random, and the interval (TK,j−1 , TK,j ] into which Y falls (with TK,0 ≡ 0, TK,K+1 ≡ ∞). More formally, we assume that K is an integer-valued random variable, and T = {Tk,j , j = 1, . . . , k, k = 1, 2, . . .}, is a triangular array of “potential observation times”, and that Y and (K, T ) are independent. Let X = (∆K , TK , K), with a possible value x = (δk , tk , k), where ∆k = (∆k,1 , . . . , ∆k,k ) with ∆k,j = 1(Tk,j−1 ,Tk,j ] (Y ), j = 1, 2, . . . , k + 1, and Tk is the k-th row of the triangular array T . Suppose we observe n (i) (i) i.i.d. copies of X; X1 , X2 , . . . , Xn , where Xi = (∆K (i) , TK (i) , K (i) ), i = 1, 2, . . . , n. Here (Y (i) , T (i) , K (i) ), i=1,2,. . . are the underlying i.i.d. copies of (Y, T , K). We first note that conditionally on K and TK , the vector ∆K has a multinomial distribution: (∆K | K, TK ) ∼ MultinomialK+1 (1, ∆FK ), where ∆FK ≡ (F (TK,1 ), F (TK,2 ) − F (TK,1 ), . . . , 1 − F (TK,K )). Suppose for the moment that the distribution Gk of (TK |K = k) has density gk and pk ≡ P (K = k). Then a density of X is given by (1) pF (x) ≡ pF (δ, tk , k) = k+1 / (F (tk,j ) − F (tk,j−1 ))δk,j gk (t)pk , j=1 where tk,0 ≡ 0, tk,k+1 ≡ ∞. In general, pF (x) ≡ pF (δ, tk , k) = k+1 / (F (tk,j ) − F (tk,j−1 ))δk,j j=1 (2) = k+1 j=1 δk,j (F (tk,j ) − F (tk,j−1 )) 140 CHAPTER 10. CONSISTENCY OF MAXIMUM LIKELIHOOD ESTIMATORS is a density of X with respect to the dominating measure ν where ν is determined by the joint distribution of (K, T ), and it is this version of the density of X with which we will work throughout the rest of the paper. Thus the log-likelihood function for F of X1 , . . . , Xn is given by n K (i) +1 1 1 (i) (i) (i) ∆K,j log F (TK (i) ,j ) − F (TK (i) ,j−1 ) = Pn mF , ln (F |X) = n n i=1 j=1 where mF (X) = K+1 ∆K,j log(F (TK,j ) − F (TK,j−1 )) ≡ j=1 K+1 ∆K,j log(∆FK,j ) j=1 and where we have ignored the terms not involving F . We also note that ⎛ ⎞ K+1 PmF (X) = P ⎝ ∆F0,K,j log(∆FK,j )⎠ . j=1 The (Nonparametric) Maximum Likelihood Estimator (MLE) Fˆn is the distribution function Fˆn (t) which puts all its mass at the observed time points and maximizes the loglikelihood ln (F |X). It can be calculated via the iterative convex minorant algorithm proposed in Groeneboom and Wellner (1992) for case 2 interval censored data. By Proposition 10.1 with α = 1 and ϕ ≡ ϕ1 as before, it follows that h2 (pFˆn , pF0 ) ≤ (Pn − P0 )(ϕ(pFˆn /pF0 )), where ϕ is bounded and continuous from R to R. Now the collection of functions G ≡ {pF : F ∈ F} is easily seen to be a Glivenko-Cantelli class of functions. This can be seen by first applying Theorem 5.7 to the collections Gk , k = 1, 2, . . . obtained from G by restricting to the sets K = k. Then for fixed k, the collections Gk = {pF (δ, tk , k) : F ∈ F} are P0 –Glivenko-Cantelli classes since F is a uniform Glivenko-Cantelli class, and since the functions pF are continuous trasformations of the classes of functions x → δk,j and x → F (tk,j ) for j = 1, . . . , k + 1, and hence G is P –Glivenko-Cantelli by Theorem 5.6. Note that single function pF0 is trivially P0 –Glivenko-Cantelli since it is uniformly bounded, and the single function (1/pF0 ) is also P0 –Glivenko-Cantelli since P0 (1/pF0 ) < ∞. Thus by Proposition 5.2 with g = (1/pF0 ) and F = G = {pF : F ∈ F}, it follows that G ≡ {pF /pF0 : F ∈ F} is P0 –Glivenko-Cantelli. Finally another application of Theorem 5.6 shows that the collection H ≡ {ϕ(pF /pF0 ) : F ∈ F} 141 is also P0 –Glivenko-Cantelli. When combined with Corollary 10.1, this yields the following theorem. Theorem 10.1 The NPMLE Fˆn satisfies h(pFˆn , pF0 ) →a.s. 0. To relate this result to a recent theorem of Schick and Yu (2000), it remains only to understand the relationship between their L1 (µ) and the Hellinger metric h between pF and pF0 . Let B denote the collection of Borel sets in R. On B we define measures µ and µ ˜, as follows: for B ∈ B, (3) µ(B) = ∞ P (K = k) (4) µ ˜(B) = ∞ k=1 P (Tk,j ∈ B|K = k), j=1 k=1 and k P (K = k) k 1 P (Tk,j ∈ B|K = k). k j=1 Let d be the L1 (µ) metric on the class F; thus for F1 , F2 ∈ F, d(F1 , F2 ) = |F1 (t) − F2 (t)|dµ(t) . The measure µ was introduced by Schick and Yu (2000); note that µ is a finite measure if E(K) < ∞. Note that d(F1 , F2 ) can also be written in terms of an expectation as ⎡ ⎤ K+1 (5) d(F1 , F2 ) = E(K,T ) ⎣ |F1 (TK,j ) − F2 (TK,j )|⎦ . j=1 As Schick and Yu (2000) observed, consistency of the NPMLE Fˆn in L1 (µ) holds under virtually no further hypotheses. Theorem 10.2 (Schick and Yu). Suppose that E(K) < ∞. Then d(Fˆn , F0 ) →a.s. 0. Proof. We will show that Theorem 10.2 follows from Theorem 10.1 and the following 2 Lemma. Lemma 10.1 Proof. 1 2{ |Fˆn − F0 |d˜ µ}2 ≤ h2 (pFˆn , pF0 ). We know that h2 (pFˆn , pF0 ) ≤ dT V (pFˆn , pF0 ) ≤ √ 2h(pFˆn , pF0 ) 142 CHAPTER 10. CONSISTENCY OF MAXIMUM LIKELIHOOD ESTIMATORS where, with yk,0 = −∞, yk,k+1 = ∞, 2 h (pFˆn , pF0 ) = ∞ P (K = k) k+1 {[Fˆn (yk,j ) − Fˆn (yk,j−1 )]1/2 j=1 k=1 −[F0 (yk,j ) − F0 (yk,j−1 )]1/2 }2 dGk (y) while dT V (pFˆn , pF0 ) = ∞ P (K = k) k+1 |[Fˆn (yk,j ) − Fˆn (yk,j−1 )] j=1 k=1 −[F0 (yk,j ) − F0 (yk,j−1)]|dGk (y). Note that k+1 |[Fˆn (yk,j ) − Fˆn (yk,j−1 )] − [F0 (yk,j ) − F0 (yk,j−1)]| j=1 = k+1 |(Fˆn − F0 )(yk,j−1 , yk,j )| ≥ j=1 max |Fˆn (yk,j ) − F0 (yk,j )| , 1≤j≤k+1 so integrating across this inequality with respect to Gk (y) yields k+1 |[Fˆ n (yk,j ) − Fˆn (yk,j−1 )] − [F0 (yk,j ) − F0 (yk,j−1)]| dGk (y) j=1 ≥ max 1≤j≤k |Fˆn (yk,j ) − F0 (yk,j )| dGk,j (yk,j ) k 1 ≥ |Fˆn (yk,j ) − F0 (yk,j )| dGk,j (yk,j ). k j=1 By multiplying across by P (K = k) and summing over k, this yields dT V (pFˆn , pF0 ) ≥ |Fˆn − F0 |d˜ µ, and hence (a) 1 h (pFˆn , pF0 ) ≥ 2 2 2 µ |Fˆn − F0 |d˜ . 2 The measure µ ˜ figuring in Lemma 10.1 is not the same as the measure µ of Schick and Yu (2000) because of the factor 1/k. Note that this factor means that the measure µ ˜ is always a finite measure, even if E(K) = ∞. It is clear that µ ˜(B) ≤ µ(B) for every Borel set B, and that µ ) µ ˜. The following lemma (Lemma 2.2 of Schick and Yu (2000)) together with Lemma 10.1 shows that Theorem 10.1 implies the result of Schick and Yu once again. 143 Lemma 10.2 Suppose that µ and µ ˜ are two finite measures, and that g, g1 , g2 , . . . are measurable functions with range in [0,1]. Suppose that µ is absolutely continuous with respect to µ ˜. Then |gn − g|d˜ µ → 0 implies that |gn − g|dµ → 0. Proof. Write |gn − g|dµ = |gn − g| dµ d˜ µ d˜ µ and use the dominated convergence theorem applied to a.e. convergent subsequences. 2 Example 10.3 (Exponential scale mixtures) Suppose that P = {PG : G a d.f. on R} where the measures PG are scale mixtures of exponential distributions with mixing dis tribution G: pG (x) = ∞ ye−yx dG(y). 0 We first show that the map G → pG (x) is continuous with respect to the topology of vague convergence for distributions G. This follows easily since kernels for our mixing family are bounded, continuous, and satisfy ye−x y → 0 as y → ∞ for every x > 0. Since vague convergence of distribution functions implies that integrals of bounded continuous functions vanishing at infinity converge, it follows that p(x, G) is continuous with respect to the vague topology for every x > 0. This implies, moreover, that the family F = {pG /(pG + p0 ) : G is a d.f. on R} is pointwise, for a.e. x, continuous in G with respect to the vague topology. Since the family of sub-distribution functions G on R is compact for (a metric for) the vague topology (see e.g. Bauer (1972), page 241), and the family of functions F is uniformly bounded by 1, we conclude from Lemma 5.1 that N[ ] (ε, F, L1 (P )) < ∞ for every ε > 0. Thus it follows from Corollary 10.1 that the MLE ˆ n of G0 satisfies G h(pGˆ n , pG0 ) →a.s. 0. ˆ n converges weakly to G0 with By uniqueness of Laplace transforms, this implies that G probability 1. This method of proof is due to Pfanzagl (1988); in this case we recover a result of Jewell (1982). (See also Van de Geer (2000), Example 4.2.4, page 54). Example 10.4 (k-monotone densities) Suppose that Pk = {PG : G a d.f. on R} where the measures PG are scale mixtures of Beta(1, k) distributions with mixing distribution G: pG (x) = ∞ 0 k/x y x k−1 y x k−1 y 1− dG(y) = y 1− dG(y) , k + k 0 x > 0. With k=1,the class P1 coincides with the class of monotone decreasing functions on R studied by Prakasa Rao (1969); the class P2 corresponds to the class of convex decreasing densities studied by Groeneboom, Jongbloed, and Wellner (2001). Of course the case 144 CHAPTER 10. CONSISTENCY OF MAXIMUM LIKELIHOOD ESTIMATORS k = ∞ is just Example 10.3. To prove consistency of the MLE, we again show that the map G → pG (x) is continuous with respect to the topology of vague convergence for distributions G. This follows easily since kernels for this mixing family are bounded, continuous, and satisfy y(1 − yx/k)k−1 → 0 as y → 0 or ∞ for every x > 0. Since + vague convergence of distribution functions implies that integrals of bounded continuous functions vanishing at infinity convergence, it follows that p (x, G) is continuous with respect to the vague topology for every x > 0. By the same argument as in Example 10.3 it follows that N[ ] (, F, L1 (P )) < ∞ for every > 0, and hence from Corollary 10.1 ˆ n of G0 satisfies that the MLE G h(pGˆn , pG0 ) →a.s. 0. ˆ n , G0 ) →a.s. 0 for any metric τ for the vague topology (see Exercise This implies that τ (G ˆ n , G0 ) →a.s. 0 (since G0 is a proper distribution function). 10.4), and hence that dBL (G This gives another proof of a result of Balabdaoui (2003). Example 10.5 (Current status competing risks data) Suppose that (X1 , X2 , . . . , Xj , T ) is a J + 1-vector of non-negative, real valued random variables. We assume that T is independent of (X1 , . . . , Xj ), and that T ∼ G. Let X(1) be the minimum of X1 , X2 , . . . , Xj , let Fj be the cumulative incidence function for Xj , Fj (t) = P (Xj ≤ t, Xj = X(1) ), and define S(t) = 1 − J Fj (t) ≡ 1 − F (t) . j=1 Let ∆∗j = 1{Xj = X(1) } and ∆j = 1{X(1) ≤ T }∆∗j for j = 1, . . . , J. Suppose we observe (∆1 , . . . , ∆j , T ) . Finally, set ∆. = J j=1 ∆j = 1{X(1) ≤ T }. Then, conditionally on T = t the distribution of (∆1 , . . . , ∆J , 1 − ∆.) is multinomial: (∆1 , . . . , ∆J , 1 − ∆.) ∼ MultinomialJ+1 (1, (F1 (t), . . . , Fj (t), S(t))). Note that the Fj s are monotone nondecreasing, while S is monotone nonincreasing. Thus the joint density pF for one observation is given by pF (δ1 , . . . , δJ , δJ+1 , t) = J=1 / Fj (t)δj j=1 with respect to # × G where # denotes counting measure on {0, 1}J+1 , δJ+1 = 1− δJ and FJ+1 = S, and F = (F1 , . . . , FJ ) ∈ FJ , the class of J-tuples of nondecreasing functions summing pointwise to no more than 1. 145 Suppose we observe (∆1i , . . . , ∆Ji , Ti ) , i = 1, . . . , n i.i.d. as (∆1 , . . . , ∆J , T ). Our goal is to estimate F1 , . . . , FJ . These models are of current interest in the biostatistics literature; see e.g. Jewell and Kalbfleish (2001) or Jewell, van der Laan, and Henneman (2001). This is a convex model, so Proposition 10.1 and Corollary 10.1 apply. To show that the class of functions {φ(pF /pF0 ) : F = (F1 , . . . , FJ ) ∈ FJ } is P0 –Glivenko-Cantelli, we first use Theorem 5.7 applied to {pF : F ∈ FJ } and the partition {X }J+1 j=1 where Xj = {(0, . . . , 1, 0, . . . , 0, t) : t ∈ R} where the 1 is in the j-th position for j = 1, . . . , J and XJ+1 = {(0, . . . , 0, t) : t ∈ R}. Then the functions pF |Xj are bounded and monotone nondecreasing for j = 1, . . . , J, and bounded and monotone nonincreasing for j = J +1, and hence are (universal) Glivenko-Cantelli. The conclusion from Theorem 5.7 is that P = {pF : F ∈ F} is Glivenko-Cantelli. The next step is just as in both Examples 10.1 and 10.2: since 1/pF0 is P0 = PF0 integrable, the collection P is uniformly bounded, and ϕ(u, v) = uv is continuous, it follows from Proposition 5.2 that P/pF0 = {pF /pF0 : F ∈ Fj } is P0 –Glivenko-Cantelli. Finally, it follows from Theorem 5.6 that {ϕ(pF /pF0 ) : F ∈ Fj } with ϕ(t) = (t − 1)/(t + 1) is P0 –Glivenko-Cantelli. We conclude that h(pFˆn , pF0 ) →a.s. 0 as n → ∞ . By the familiar inequality relating Hellinger and total variation distance, we conclude that dT V (pFˆn , pF0 ) = J+1 |Fˆnj (t) − Foj (t)| dG(t) →a.s. 0. j=1 Example 10.6 (Cox model with interval censored data) Suppose that conditional on a covariate vector Z, Y has conditional survival function 1 − F (y|Z) = (1 − F (y))exp(β T Z) where β ∈ Rd , Z ∈ Rd , and F is a distribution function on R+ . For simplicity of notation we will write this in terms of survival functions as S(y|z) = S(y)exp(β T z) . Suppose that conditionally on Z the pair of random variables (U ,V ) has conditional distribution G(.|Z) with P (U < V |Z) = 1, and that conditionally on Z the pair (U ,V ) is independent of Y . Finally, suppose that Z has distribution H on Rd . Suppose that we observe only i.i.d. copies of X = (∆1 , ∆2 , ∆3 , U, V, Z) where ∆ = (∆1 , ∆2 , ∆3 ) = (1[0,U ] (Y ), 1(U,V ] (Y ), 1(V,∞) (Y )). 146 CHAPTER 10. CONSISTENCY OF MAXIMUM LIKELIHOOD ESTIMATORS Based on X1 , . . . , Xn i.i.d. as X our goal is to estimate β and F . The parameter space is Θ = Rd × {all d.f.’s on R+ }. The conditional distribution of ∆ given U ,V ,Z is just multinomial with one trial, three cells, and cell-probabilities (1 − S(U |Z), S(U |Z) − S(V |Z), S(V |Z)). Thus pβ,F (δ, u, v, z) = (1 − S(u|z))δ1 (S(u|z) − S(v|z))δ2 S(v|z)δ3 with respect to the dominating measure given by the product of counting measure on {0, 1}3 × G × H. As in the previous examples, we first use Theorem 5.7 applied to {pβ,F : F a d.f. on R+ , β ∈ Rd }, and the partition {Xj }3j=1 where Xj corresponds to δj = 1 for j = 1, 2, 3. On X1 the class of functions we need to consider is {1−S(t)exp(β T z) : F a d.f. on R+ , β ∈ Rd }. Up to the leading constant 1, this is of the form φ(G1 , G2 ) where G1 = {S = 1 − F : F a d.f. on R+ }, G2 = {exp (β T z) : β ∈ Rd }, and φ(r, s) = r s . Now G1 is a universal Glivenko-Cantelli class (since it is a class of uniformly bounded decreasing functions), and G2 is a Glivenko-Cantelli class if we assume that β ∈ K ⊂ Rd for some compact set K. Then |β T Z| ≤ M |Z| for M = supβ∈K |β| is an envelope for β T Z and, hence G2 (x) = exp(M |z|) is an integrable envelope for exp(β T z, β ∈ K) if E exp(M |Z|) < ∞. Thus G2 is P –Glivenko-Cantelli under these two assumptions. Furthermore, all the functions φ(g1 , g2 ) = g1g2 with gi ∈ Gi for i = 1, 2 are uniformly bounded by 1. We conclude from Theorem 5.6 that the class {pβ,F (1, 0, 0, u, v, z) : F a d.f. on R+ , β ∈ K} is a P –GlivenkoCantelli class of functions under these same two assumptions. Similarly, under these same assumptions the class {pβ,F (0, 0, 1, u, v, z) : F a d.f. on R+ , β ∈ K} is a P –GlivenkoCantelli class of functions, and so is {pβ,F (0, 1, 0, u, v, z) : F a d.f. on R+ , β ∈ K} since it is the difference of two P –Glivenko-Cantelli classes. Much as in Examples 10.1 and 10.2 it follows that {ϕ(pβ,F /pβ0 ,F0 ) : F a d.f. on R+ , β ∈ K} is P –Glivenko-Cantelli √ where ϕ(t) = t. Thus it follows from Proposition 10.2 that the MLE θˆn = (βˆn , Fˆn ) satisfies h(pβˆn ,Fˆn , pβ0 ,F0 ) →a.s. 0 . Since convergence in the Hellinger metric implies convergence in the total variation metric, the convergence in the last display implies that the total variation distance also 147 converges to zero where dT V (pβˆn ,Fˆn , pβ0 ,F0 ) = + + ≥ (1) ˆ |Sˆn (u)exp(βn z) − S0 (u)exp(β0 z) |dG(u, v|z)dH(z) ( ˆ ˆ |Sˆn (u)exp(βn z) − Sˆn (v)exp(βn z) − S0 (u)exp(β0 z) + ) − S0 (v)exp(β0 z) |dG(u, v|z)dH(z) ˆ |Sˆn (v)exp(βn z) − S0 (v)exp(β0 z) |dG(u, v|z)dH(z) ˆ |Sˆn (t)exp(βn z) − S0 (t)exp(β0 z) |dµ(t, z). In this last inequality of the last display we have dropped the middle term and combined the two end terms by defining the measure µ on R × Rd by µ(A × C) = G(A × ν|z)dH(z) + G(U × A|z)dH(z) C C = P (U ∈ A, Z ∈ C) + P (V ∈ A, Z ∈ C) f or A ∈ B, C ∈ Bd . We will examine the special case in which d = 1 and Z takes on the two values 0 and 1 with probabilities 1 − p and p respectively with p ∈ (0, 1). We will assume, moreover, that F is continuous. In this special case the right side of (1) can be rewritten as ˆ |Sˆn (t)exp(βn z) − S0 (t)exp(β0 z) | dµ(t, z) (2) |Sˆn (t) − S0 (t)|dµ(t, 0) + = ˆ |Sˆn (t)exp(βn ) − S0 (t)exp(β0 ) | dµ(t, 1). Since the left side of (1) converges to zero almost surely, we conclude that Sˆn (t) →a.s. S0 (t) for µ(·, 0) a.e. t. If µ(·, 1) µ(·, 0), then it follows immediately by dominated convergence that ˆ |Sˆn (t)exp(βn ) − S0 (t)exp(β0 ) | dµ(t, 1) →a.s. 0, and hence also, from (2), that ˆ |Sˆ0 (t)exp(βn ) − S0 (t)exp(β0 ) | dµ(t, 1) →a.s. 0 . If µ((supp(S0 ))◦ , 1) > 0 (where (supp(S0 ))◦ denotes the interior of the support of the measure corresponding to S0 ), this implies that βˆn →a.s. β0 . 148 CHAPTER 10. CONSISTENCY OF MAXIMUM LIKELIHOOD ESTIMATORS 10.1 Exercises Exercise 10.1 Show that for any two probability measures h2 (P, Q) ≤ dT V (P, Q) ≤ where dT V (P, Q) = (1/2) √ 2h(P, Q)(1 − (1/2)h2 (P, Q))1/2 ≤ √ 2h(P, Q) |p−q|dµ = supA |P (A)−Q(A)| for any measure µ dominating both P and Q. Solution. Note that dT V (P, Q) = = ≤ = = = ≤ 1 |p − q| dµ 2 1 √ √ √ √ |( p − q)( p + q) dµ 2 1/2 1/2 1 √ 1 √ √ √ ( p + q)2 dµ 2 ( p − q)2 dµ 2 2 1/2 1√ √ 2h(P, Q) 2 + 2 pq dµ 2 √ 1 2h(P, Q) {2 + 2(1 − h2 (P, Q))}1/2 2 1/2 √ 1 2 2h(P, Q) 1 − h (P, Q) 2 √ 2h(P, Q) . This finishes the proof. Now we want to show that 1 |p − q| dµ = sup |P (A) − Q(A)| . 2 A It results P (A) − Q(A) = (p − q) dµ ≤ A A {p>q} (p − q) dµ ≤ {p>q} (p − q) dµ |P (A) − Q(A)| ≤ A = A |p − q| dµ ≤ (p − q) dµ + (q − p) dµ A {q>p} (p − q) dµ + (q − p) dµ {p>q} p>q q>p = But |p − q| dµ . (p − q) dµ = 0 10.1. EXERCISES 149 and this implies (p − q) dµ + p>q i.e. (p − q) dµ = 0 p<q 1 (p − q) dµ = (q − p) dµ = 2 p>q q>p |p − q| dµ . Therefore sup |P (A) − Q(A)| ≤ |p − q| dµ . A But 1 P (A) − Q(A) ≤ (p − q) dµ = 2 p>q and similarly Q(A) − P (A) ≤ (q − p) dµ = q>p This implies 1 |P (A) − Q(A)| ≤ 2 1 2 |p − q| dµ |p − q| dµ . |p − q| dµ = dT V (p, q) ∀A and sup |P (A) − Q(A)| ≤ dT V (P, Q) A with equality when A = {p > q} or {q > p}. 2 Exercise 10.2 Show that for any two probability measures P and Q, the KullbackLeibler “distance” K(P, Q) = P (log(p/q)) satisfies K(P, Q) ≥ 2h2 (P, Q) ≥ 0 . Solution. i.e. For any two probability measures P and Q p K(P, Q) = P log ≥ 2h2 (P, Q) ≥ 0 q log p(x) q(x) p(x) dµ(x) ≥ (p1/2 (x) − q 1/2 (x))2 dµ(x) + (p(x) + q(x) − 2 p(x)q(x)) dµ(x) = 2 − 2 (p1/2 (x)q 1/2 (x)) dµ(x) = ⇔ 1 log 2 p(x) q(x) p(x) dµ(x) ≥ 1 − = 1− p1/2 (x)q 1/2 (x) dµ(x) q 1/2 (x) p(x) dµ(x) p1/2 (x) 150 CHAPTER 10. CONSISTENCY OF MAXIMUM LIKELIHOOD ESTIMATORS ⇔ − ⇔ 1 log 2 1 log 2 q(x) p(x) q(x) p(x) p(x) dµ(x) ≥ 1 − p(x) dµ(x) ≤ q 1/2 (x) p(x) dµ(x) p1/2 (x) q 1/2 (x) − 1 p(x) dµ(x) . p1/2 (x) But this holds since 1 log v ≤ v 1/2 − 1 2 (where we choose v = q(x)/p(x)). Moreover ∀v > 0 1 log v ≤ v 1/2 − 1 2 1 ⇔ v 1/2 − 1 − log v ≥ 0 2 1 1/2 log v ⇔e − 1 − log v ≥ 0 2 ∀v > 0 ∀v > 0 ∀ v > 0. 2 Exercise 10.3 Show that for any nonnegative numbers p and q, |(2p)1/2 − (p + q)1/2 | ≤ |p1/2 − q 1/2 | ≤ (1 + √ 2)|(2p)1/2 − (p + q)1/2 | . This implies that for measures P and Q the Hellinger distances h(P, Q) and h(P, (P + Q)/2)) satisfy 2h2 (P, (P + Q)/2) ≤ h2 (P, Q) ≤ 2(1 + √ 2 2 2) h (P, (P + Q)/2) ≤ 12h2 (P, (P + Q)/2). (Hint: To prove the first inequalities, prove them first for p = 0. In the second case of p = 0, divide through by p and rewrite the inequalities in terms of r = q/p, then (for the inequality on the right) consider the cases r ≥ 1 and 0 < r ≤ 1). Solution. For any non-negative numbers p and q |(2p)1/2 − (p + q)1/2 | ≤ |p1/2 − q 1/2 | √ ≤ (1 + 2)|(2p)1/2 − (p + q)1/2 | This implies that for measures P and Q P +Q 2 2h P, ≤ h2 (P, Q) 2 √ P +Q ≤ 2(1 + 2)2 h2 P, 2 P +Q ≤ 12h2 P, 2 10.1. EXERCISES 2 2h 151 P +Q P, 2 = p 1/2 (x) − p+q (x) 2 1/2 2 dµ(x) 1 ((2p(x))1/2 − (p(x) + q(x))1/2 )2 dµ(x) 2 1 ≤ (p1/2 (x) − q 1/2 (x))2 dµ(x) 2 = h2 (P, Q) = Similarly, the other inequalities involving Hellinger distances can be established. Now we want to show the pointwise inequalities: |p1/2 − q 1/2 | ≤ (1 + √ 2)|(2p)1/2 − (p + q)1/2 | . We consider the case p ≥ q. We have to show that: p1/2 − q 1/2 ≤ (1 + √ 2)((2p)1/2 − (p + q)1/2 ) i.e. p 1/2 2 1/2 3 √ q q 1/2 1/2 1/2 1− ≤ ( 2 + 1)p 2 − 1+ p p It is sufficient to show that 1/2 1/2 √ q q 1− ≤ ( 2 + 1) 21/2 − 1 + p p i.e. 1/2 1/2 − 1 + q 2 p 1 √ ≤ 1/2 2+1 1 − pq We wonder if 1 √ ≤ 2+1 that is i.e. and simplifying Let √ √ 2− √ 2 − (1 + x)1/2 1 − x1/2 0≤ q ≤ 1. p ∀0 ≤ x ≤ 1 √ 2 − (1 + x)1/2 ≥ 2−1 1 − x1/2 √ √ √ √ √ 1+x≥ 2− 2 x−1+ x √ √ √ √ 2 x + 1 − ( x + 1 + x) ≥ 0 . √ √ √ √ 2 x + 1 − ( x + 1 + x) = ξ(x) 152 CHAPTER 10. CONSISTENCY OF MAXIMUM LIKELIHOOD ESTIMATORS then ξ(0) = 0 = ξ(1) and ξ (x) = = = 1 √ 1 1 √ 2− √ − √ 2 x 2 x 2 1+x 1 1 1 √ √ − √ − √ 2 x 2 1+x 2 x ! " x 1 1 1 1 √ √ − − x 2 2 2 1+x Note that ξ (x) is greater than, equal or less than zero according as √12 − 12 is greater , x and this implies that x is less than, equal or greater than than, equal or less than 12 1+x , x x∗ , because 1+x is increasing with x. This shows that ξ(x) ≥ 0, ∀ x, with a maximum at x∗ . Now we show that |(2p)1/2 − (p + q)1/2 | ≤ |p1/2 − q 1/2 |. Once again consider the case p > q. We have to show that (2p)1/2 − (p + q)1/2 ≤ p1/2 − q 1/2 , i.e. (2p)1/2 + q 1/2 ≤ p1/2 + (p + q)1/2 . Moreover, we have to show that 2 1/2 1/2 3 √ q q p1/2 ≤ 1− p1/2 , 2− 1+ p p that is i.e. √ √ q 2− 1+ p 2−1≤ q 1+ p 1/2 1/2 1/2 q ≤1− , p 1/2 q − p 0≤ q ≤ 1. p It is sufficient to show that √ 2 − 1 ≤ (1 + x)1/2 − x1/2 . Let Ψ(x) = (1 + x)1/2 − x1/2 , then Ψ(0) = 1, Ψ(1) = √ 2 − 1 and 1 1 Ψ (x) = √ − √ ≤0 2 x 2 1+x therefore Ψ(1) ≤ Ψ(x) This completes the proof. ∀ x ∈ [0, 1] . 2 10.1. EXERCISES 153 Exercise 10.4 We will say that θ0 is identifiable for the metric τ on Θ ⊃ Θ if for all θ ∈ Θ, h(pθ , pθ0 ) = 0 implies that τ (θ, θ0 ) = 0. Prove the following claim: Suppose that Θ ⊂ Θ where (Θ, τ ) is a compact metric space. Suppose that θ → pθ is µ-almost everywhere continuous and that θ0 is identifiable for τ . Then h(pθn , pθ0 ) → 0 implies that τ (θn , θ0 ) → 0. (Hint: See Van de Geer (1993), page 37). Solution. Θ is compact for the metric τ . Moreover Θ ⊆ Θ and θ → pθ is µ-almost everywhere continuous and θ0 is identifiable for τ : h(pθn , pθ0 ) → 0 ⇒ τ (θn , θ0 ) → 0 . Consider the sequence {θn }. Given any subsequence {θn } of {θn }, we can find a further subsequence {θn } that converges to some θ (by compactness). Then p(θn , x) → p(θ, x) almost surely. By Scheffe’s theorem, 1 2 |p(θn , x) − p(θ, x)| dµ(x) → 0 i.e. dT V (p(θn , ·), p(θ, ·)) → 0 This implies that h2 (p(θn , ·), p(θ, )) → 0 . But h2 (p(θn , ·), p(θ0 , ·)) → 0 . Let sn (·) = p(θn , ·)1/2 , s0 (·) = p(θ0 , ·)1/2 and s(·) = p(θ, ·)1/2 , then h(pθ n h(pθ , pθ0 ) = ||sn − s0 ||L2 (µ) n , pθ ) = ||sn − s||L2 (µ) (ignoring the factor of 1/2). Now ||s0 − s||L2 (µ) ≤ ||sn − s0 ||L2 (µ) + ||sn − s||L2 (µ) < ε, eventually for any pre-fixed ε. Therefore ||s0 − s||L2 (µ) = 0 154 CHAPTER 10. CONSISTENCY OF MAXIMUM LIKELIHOOD ESTIMATORS i.e. h(pθ , pθ0 ) = 0 . This implies that τ (θ, θ0 ) = 0 . But then τ (θn , θ0 ) → 0 This shows that τ (θn , θ0 ) → 0 , completing the proof. 2 Chapter 11 M -Estimators: the Argmax Continuous Mapping Theorem We begin this chapter recalling the definition of M -estimators. Suppose we are interested in a parameter θ attached to the distribution of i.i.d. observations X1 , . . . , Xn . A popular method for finding an estimator θˆn = θˆn (X1 , . . . , Xn ) is to maximize a certain criterion function of the type 1 mθ (Xi ). n n θ → Pn mθ = i=1 An estimator maximizing Pn mθ over Θ is called an M-estimator. Suppose now that Mn and M are stochastic processes indexed by a metric space H; typically Mn (h) = Pn mh for a collection of real-valued functions mh defined on the sample space and M is either a deterministic function (such as M(h) = P (mh )) or a (limiting) ˆ n and h ˆ be points of (near) maximum of the stochastic process. Let the “estimators” h “criterion functions” Mn (h) and M(h) respectively. We suppose that ˆ n = arg max Mn (h) h h and ˆ h = arg max M(h) h are well defined. In the most basic version of this set-up we frequently begin with thinking of Mn (θ) = Pn log pθ and M(θ) = P0 log pθ for θ ⊂ Θ; i.e. mθ = log pθ . In this chapter we want to study the conditions under which the convergence in distribution of the criterion functions would imply the convergence in distribution of ˆ n , to the point of maximum h ˆ their point of maximum, the sequence of M -estimators h of the limit criterion function. To this aim we begin presenting the following lemma: ˆ ∈ H satisfies Lemma 11.1 Suppose that A, B ⊂ H. Assume that h ˆ > M(h) sup h∈G,h∈A / M(h) = 155 sup M(h) h∈Gc ∩A 156CHAPTER 11. M-ESTIMATORS: THE ARGMAX CONTINUOUS MAPPING THEOREM ˆ ∈ G. Suppose that h ˆ n satisfies for all open G with h ˆ n ) ≥ sup Mn (h) − op (1). Mn (h h If Mn ⇒ M in ∞ (A ∪ B), then for every closed set F ˆ n ∈ F ∩ A) ≤ P (h ˆ ∈ F ∪ B c ). lim sup P ∗ (h n→∞ If Mn ⇒ M in ∞ (H), then we can take A = B = H to conclude that, by the portmanteau ˆ ˆ n ⇒ h. theorem for weak convergence, h The following theorem follows from Lemma 11.1. Theorem 11.1 (argmax continuous mapping) Suppose that Mn ⇒ M in ∞ (K) for every compact K ⊂ H. Suppose that h → M(h) is upper semicontinuous and has ˆ Suppose, moreover, that Mn (h ˆ n ) ≥ sup Mn (h) − op (1), a unique point of maximum h. h ˆ and hn is tight (in H). Then ˆn ⇒ h ˆ h Proof. in H. (of Lemma 11.1) Suppose that F is closed. By the continuous mapping theorem it follows that sup Mn (h) − sup Mn (h) ⇒ sup M(h) − sup M(h). h∈F ∩A h∈F ∩A h∈B h∈B Now ˆ n ∈ F ∩ A} = {h ˆ n ∈ F ∩ A} ∩ {Mn F ∩A ≥ Mn B − op (1)} ∪ {h ˆ n ∈ F ∩ A} ∩ {Mn F ∩A < Mn B − op (1)} , ∪ {h where the second event implies ˆ n ) ≤ Mn F ∩A ≤ Mn B − op (1) ≤ Mn H − op (1) Mn (h and hence is empty in view of the hypothesis. Hence ˆ n ∈ F ∩ A} ⊂ {Mn F ∩A ≥ Mn B − op (1)}, {h and it follows that ˆ n ∈ F ∩ A) ≤ lim sup P (Mn F ∩A ≥ Mn B − op (1)) lim sup P (h n→∞ n→∞ = P (Mn F ∩A ≥ Mn B ) ˆ ∈ F ∪ B c ); ≤ P (h 157 to see the last inequality, note that ˆ ∈ F ∪ B c }c {h = ˆ ∈ F c } ∩ {h ˆ ∈ B} {h ˆ ∈ F c ∩ B} ∩ {MF ∩A < MB } ∪ {h ˆ ∈ F c ∩ B} ∩ {MF ∩A ≥ MB } {h ⊂ ˆ > MF ∩A ≥ MB ≥ M(h)} ˆ {MF ∩A < MB } ∪ {M(h) = {MF ∩A < MB } ∪ ∅. = 2 Proof. (of Theorem 11.1) Take A = B = K in Lemma 11.1. Then ˆ > M(h) sup M(h). h∈Gc ∩K (If not, then there is a subsequence {hm } ⊂ Gc ∩K which is compact satisfying M(hm ) → ˆ But we can choose a further subsequence (call it hm again) with hm → h ∈ Gc ∩ K M(h). since K is compact, and then ˆ = lim M(hm ) ≤ M(h) M(h) m by upper semicontinuity of M, and this implies that there is another maximizer. But this contradicts our uniqueness hypothesis). By Lemma 11.1 with A = B = K, ˆ n ∈ F ) ≤ lim sup P (h ˆ n ∈ F ∩ K) + lim sup P (h ˆ n ∈ K c) lim sup P (h n→∞ n→∞ n→∞ ˆ ∈ F ∪ K c ) + lim sup P (h ˆn ∈ K c) ≤ P (h n→∞ ˆ ∈ F ) + P (h ˆ ∈ K c ) + lim sup P (h ˆ n ∈ K c) ≤ P (h n→∞ where the second and third terms can be made arbitrarily small by choice of K. Hence, we conclude that ˆ n ∈ F ) ≤ P (h ˆ ∈ F ), lim sup P (h n→∞ ˆn ⇒ h ˆ in H. and we conclude from the portmanteau theorem that h 2 We will use this theorem in two different ways: A. First scenario (the results are applied to the original parameter): H = Θ, Mn (θ) = Pn mθ mθ (x)dPn (x), M(θ) = P0 mθ mθ (x)dP0 (x) deterministic. Here ˆ n = θˆn and h ˆ = θ0 and often mθ (x) = log pθ (x) for x ∈ X , θ ∈ Θ. h 158CHAPTER 11. M-ESTIMATORS: THE ARGMAX CONTINUOUS MAPPING THEOREM ˜ n (h) = ˙ 0 ), M B. Second scenario (the results are applied to a local parameter): H = Θ(θ sn (Mn (θ0 + rn−1 h) − Mn (θ0 )) for some sequences rn → ∞ and sn → ∞ (often ˜ ˆ n = rn (θˆn − θ0 ) and h ˆ = arg max M(h) ˜ is random. In this case h sn = r 2 ), and M(h) n ˙ 0 ) we can often take the collection {h : θt − θ0 − th = is also random. For Θ(θ o(t), for some {θt } ⊂ Θ}. By using this theorem in the set-up of our first scenario, where the limit criterion function in typically nonrandom, the approach turns into a research of results concerning consistency. Corollary 11.1 (consistency) Suppose that Mn are stochastic processes indexed by Θ and suppose that M : Θ → R is deterministic. A. Suppose that: (i) Mn − MΘ →p 0. (ii) There exists θ0 ∈ Θ such that M(θ0 ) > supθ∈G / M(θ) for all G open with θ0 ∈ G. Then any sequence θˆn with Mn (θˆn ) ≥ Mn Θ − op (1) satisfies θˆn →p θ0 . B. Suppose that Mn −MK →p 0 for all K ⊂ Θ compact, and that the map θ → M(θ) is upper semi-continuous with a unique maximum at θ0 . Suppose that {θˆn } is tight. Then θˆn →p θ0 . Proof. This follows immediately from Theorem 11.1. 2 Suppose that an estimator θˆn maximizes the criterion function θ → Mn (θ). In obtaining a limit distribution of a sequence of M -estimators, this theorem is usually not applied with the original criterion functions θ → Mn (θ), but to a rescaled and “localized” criterion function of the form ˜ n (h) = sn M h Mn θ 0 + − Mn (θ0 ) rn where θ0 is the “true” value of θ, rn → ∞ is the “rate of convergence” of the estimator and sn → ∞. If this new sequence of processes converges weakly, then Theorem 11.1 ˆ n = rn (θˆn − θ0 ). Thus we will typically proceed in steps will yield a limit theorem for h in studying the limiting distribution of M -estimators θˆn of Euclidean parameters. Step 1: Prove that θˆn is consistent: θˆn →p θ0 ; Step 2: Establish a rate of convergence rn of the sequence θˆn , or equivalently, ˆ n = rn (θˆn − θ0 ) is tight; show that the sequence of “local estimators” h 159 Step 3: Show that an appropriate localized criterion function Mn (h) as in (1) converges in distribution (i.e. weakly) to a limit process M in ∞ ({h : h ≤ K}) for every K. If the limit process M has sample functions which are upperˆ then the final conclusion is that the semicontinuous with a unique maximum h, ˆ sequence rn (θˆn − θ0 ) ⇒ h. Example 11.1 (Parametric maximum likelihood) Suppose that we observe X1 , . . . , Xn from a density pθ where θ ∈ Θ ⊂ Rd . Then the maximum likelihood estimator θˆn (assuming that it exists and is unique) satisfies Mn (θˆn ) = supθ∈Θ Mn (θ) where Mn (θ) = n−1 ni=1 log pθ (Xi ) = Pn mθ (X) with mθ (x) = log pθ (x). If pθ is smooth enough as a function of θ, then the sequences of local log-likelihood ratios is locally asymptotically normal: under P0 = Pθ0 n pθ +h/√n −1/2 n Mn (θ0 + n h) − Mn (θ0 ) = log 0 (Xi ) p θ0 i=1 1 1 ˙ = h√ lθ (Xi ) − h I(θ0 )h + oP0 (1), 2 n n i=1 where l˙θ is the score function for the model (usually ∇θ log pθ ), and I(θ0 ) is the Fisher information matrix. The finite dimensional distributions of the stochastic processes on the right side of the display converge in law to the finite-dimensional laws of the Gaussian process, that is 1 n Mn (θ0 + n−1/2 h) − Mn (θ0 ) →d M(h) = h ∆ − h I(θ0 )h , 2 where ∆ ∼ Nd (0, I(θ0 )). Then, M(h) ∼ N (− 12 h I(θ0 )h, h I(θ0 )h). If θ0 is an interior ˆ n typically converges in distribution to the maximizer h ˆ point of Θ, then the sequence h of this process over all h ∈ Rd , that is: ˆn = h √ ˆ = arg max{M(h)} n(θˆn − θ0 ) = arg max{Mn (θ0 + n−1/2 h) − Mn (θ0 )} →d h h∈Rd Assuming that I(θ0 ) is invertible, we can write 1 1 M(h) = − (h − I −1 (θ0 )∆) I(θ0 )(h − I −1 (θ0 )∆) + ∆ I −1 (θ0 )∆, 2 2 ˆ = I −1 (θ0 )∆ ∼ Nd (0, I −1 (θ0 )) with maximum and it follows that M is maximized by h value 12 ∆ I −1 (θ0 )∆. If we could strengthen the finite-dimensional convergence indicated above to convergence as a process in ∞ ({h : h ≤ K}), then the above arguments would yield ˆn = h √ ˜ ∼ Nd (0, I −1 (θ0 )). n(θˆn − θ0 ) →d h We will take this approach in Chapter 12 and 13. 160CHAPTER 11. M-ESTIMATORS: THE ARGMAX CONTINUOUS MAPPING THEOREM The classical results on asymptotic normality of maximum likelihood estimators make the convergence in the last display rigorous by specifying rather strong smoothness conditions. Our approach in Chapter 12 and 13 instead will follow the following theorem which takes into account considerably weaker smoothness hypotheses than the classical conditions. Theorem 11.2 (van der Vaart (1998)) Suppose that the model (Pθ : θ ∈ Θ) is differentiable in quadratic mean at an inner point θ0 of Θ ⊂ Rk . Furthermore, suppose that there exists a measurable function ˙ with Pθ ˙2 < ∞ such that, for every θ1 and θ2 0 in a neighborhood of θ0 , ˙ | log pθ1 (x) − log pθ2 (x)| ≤ (x)θ 1 − θ2 . If the Fisher information matrix Iθ0 is nonsingular and θˆn is consistent, then n √ −1 1 ˆ ˙θ0 (Xi ) + oPθ0 (1). n(θn − θ0 ) = Iθ0 √ n i=1 In particular, the sequence covariance matrix Iθ−1 . 0 √ n(θˆn − θ0 ) is asymptotically normal with mean zero and
© Copyright 2025