EMPIRICAL PROCESSES: THEORY AND APPLICATIONS

EMPIRICAL PROCESSES:
THEORY AND APPLICATIONS
Dalle lezioni del
“Corso Estivo di Statistica e Calcolo delle Probabilit´
a”
Torgnon (Aosta)
Luglio 2003
Jon A. Wellner, University of Washington
Moulinath Banerjee, University of Michigan
A cura di Sergio Venturini
con la collaborazione di
D. Ait Aoudio, S. Antignani, R. Argiento, A. Barla, S. Bianconcini, G. Cappelletti, B.
Casella, M. Copetti, P. De Blasi, V. Edefonti, G. Esposito, A. Farcomeni, B.
Martinucci, E. Masiello, C. May, P. Nastro, L. Sangalli, C. Valerio
2
Contents
I
Empirical Processes: Theory
9
1 Introduction
11
1.1
Some History . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
11
1.2
Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
14
2 Weak convergence: the fundamental theorems
2.1
Exercises
17
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
3 Maximal Inequalities and Chaining
30
31
3.1
Orlicz norms and the Pisier inequality . . . . . . . . . . . . . . . . . . . .
31
3.2
Gaussian and sub-Gaussian processes via Hoeffding’s Inequality . . . . . .
41
3.3
Bernstein’s inequality and ψ1 - Orlicz norms for maxima . . . . . . . . . .
44
3.4
Exercises
47
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
4 Inequalities for sums of independent processes
53
4.1
Symmetrization inequalities . . . . . . . . . . . . . . . . . . . . . . . . . .
53
4.2
The Ottaviani Inequality . . . . . . . . . . . . . . . . . . . . . . . . . . . .
57
4.3
Levy’s Inequalities . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
58
4.4
Hoffman-Jørgensen Inequalities . . . . . . . . . . . . . . . . . . . . . . . .
58
5 Glivenko-Cantelli Theorems
61
5.1
Glivenko-Cantelli classes F . . . . . . . . . . . . . . . . . . . . . . . . . .
61
5.2
Universal and Uniform Glivenko-Cantelli classes
. . . . . . . . . . . . . .
67
5.3
Preservation of the GC property . . . . . . . . . . . . . . . . . . . . . . .
69
5.4
Exercises
73
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
6 Donsker Theorems: Uniform CLT’s
79
6.1
Uniform Entropy Donsker Theorem . . . . . . . . . . . . . . . . . . . . . .
79
6.2
Bracketing Entropy Donsker Theorems . . . . . . . . . . . . . . . . . . . .
85
3
CONTENTS
4
6.3
Donsker Theorem for Classes Changing with Sample Size . . . . . . . . .
90
6.4
Universal and Uniform Donsker Classes . . . . . . . . . . . . . . . . . . .
92
6.5
Exercises
95
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
7 VC-theory: bounding uniform covering numbers
7.1
Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
7.2
Convex Hulls . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110
8 Bracketing Numbers
99
113
8.1
Smooth Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 114
8.2
Monotone Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 117
8.3
Convex Functions and Convex Sets . . . . . . . . . . . . . . . . . . . . . . 117
8.4
Lower layers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 118
8.5
Exercises
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 120
9 Multiplier Inequalities and CLT
II
99
125
9.1
The unconditional multiplier CLT . . . . . . . . . . . . . . . . . . . . . . 125
9.2
Conditional multiplier CLT’s . . . . . . . . . . . . . . . . . . . . . . . . . 131
Empirical Processes: Applications
10 Consistency of Maximum Likelihood Estimators
10.1 Exercises
133
135
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 148
11 M -Estimators: the Argmax Continuous Mapping Theorem
155
12 Rates of convergence
161
13 M -Estimators and Z -Estimators
173
13.1 M -Estimators, continued . . . . . . . . . . . . . . . . . . . . . . . . . . . . 173
13.2 Z -Estimators: Huber’s Z -Theorem . . . . . . . . . . . . . . . . . . . . . . 177
13.3 Z -Estimators: van der Vaart’s Z -Theorem . . . . . . . . . . . . . . . . . . 186
14 Bootstrap Empirical Processes
191
14.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 191
14.1.1 The general idea . . . . . . . . . . . . . . . . . . . . . . . . . . . . 191
14.1.2 Consistency of the Bootstrap Estimator . . . . . . . . . . . . . . . 193
14.2 The Empirical Bootstrap
. . . . . . . . . . . . . . . . . . . . . . . . . . . 196
CONTENTS
5
14.2.1 Basic definitions and results . . . . . . . . . . . . . . . . . . . . . . 196
14.2.2 The Delta Method for the Empirical Bootstrap . . . . . . . . . . . 199
14.3 The Exchangeable Bootstrap . . . . . . . . . . . . . . . . . . . . . . . . . 206
15 Semiparametric Models
209
15.1 Tangent spaces and Information . . . . . . . . . . . . . . . . . . . . . . . . 210
15.2 Lower Bounds . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 213
15.3 Efficient Score Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . 216
15.4 Semiparametric models and Empirical Processes . . . . . . . . . . . . . . 217
15.5 Efficient MLE in Semiparametric Mixture Models . . . . . . . . . . . . . . 218
15.6 Example: Errors in variables . . . . . . . . . . . . . . . . . . . . . . . . . 221
III
Special topics
223
16 Cube root asymptotics
225
16.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 225
16.2 Limiting processes and relevant functionals. . . . . . . . . . . . . . . . . . 233
17 Asymptotic Theory for Monotone Functions
247
18 Split Point Estimation in Decision Trees
263
18.1 Split Point Estimation in Non Parametric Regression . . . . . . . . . . . . 263
18.2 Split Point Estimation for a Hazard Function . . . . . . . . . . . . . . . . 268
Bibliography
273
6
CONTENTS
List of Figures
16.1 The Greatest Convex Minorant G1,1 of W (t) + t2 . . . . . . . . . . . . . . . . . 234
R
2
16.2 The unconstrained one-sided convex minorants GL
1,1 and G1,1 of W (t) + t . . . . 236
RC
2
16.3 The constrained one-sided convex minorants GLC
1,1 and G1,1 of W (t) + t . . . . . 236
RC
16.4 The minorants G1,1 , G01,1 , GLC
1,1 and G1,1 . . . . . . . . . . . . . . . . . . . . . 238
RC
16.5 Close-up view of G1,1 , G01,1 , GLC
1,1 and G1,1 . . . . . . . . . . . . . . . . . . . . 238
17.1 Cusum diagram and greatest convex minorant. . . . . . . . . . . . . . . . 250
17.2 The universality of D. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 260
7
8
LIST OF FIGURES
Part I
Empirical Processes: Theory
9
Chapter 1
Introduction
1.1
Some History
Empirical process theory began in the 1930’s and 1940’s with the study of the empirical
distribution function Fn and the corresponding empirical process. If X1 , . . . , Xn are i.i.d.
real-valued random variables with distribution function F (and corresponding probability
measure P on R), then the empirical distribution function is
1
1(−∞,x] (Xi ),
n
n
Fn (x) =
x ∈ R,
i=1
and the corresponding empirical process is
Zn (x) =
√
n(Fn (x) − F (x)).
Two of the basic results concerning Fn and Zn are the Glivenko-Cantelli theorem and
the Donsker theorem:
Theorem 1.1 (Glivenko-Cantelli (1933))
Fn − F =
sup
−∞<x<∞
|Fn (x) − F (x)| →a.s. 0.
Theorem 1.2 (Donsker (1952))
Zn ⇒ Z ≡ U(F )
in
D(R, · ∞ )
where U is a standard Brownian bridge process on [0, 1].
Remember that a standard Brownian bridge process on [0, 1] is a zero-mean Gaussian
process U with covariance function
E(U(s)U(t)) = s ∧ t − st,
11
s, t ∈ [0, 1].
CHAPTER 1. INTRODUCTION
12
With the symbol ⇒ we denote weak convergence of stochastic processes in the sense that
will be specified in Chapter 2.
Remark 1.1 In the statement of Donsker’s theorem we have ignored measurability difficulties related to the fact that D(R, · ∞ ) is a nonseparable Banach space. For the
most part of this text (the exception is in Chapters 2 and 3) we will continue to ignore
these difficulties. For a complete treatment of the necessary weak convergence theory,
see van der Vaart and Wellner (1996), Part I - Stochastic Convergence. The occasional
stars as superscripts on P ’s and functions refer to outer measures in the first case, and
minimal measurable envelopes in the second case. We recommend ignoring the ∗’s on a
first reading.
The need for generalizations of Theorems 1.1 and 1.2 became apparent in the 1950’s
and 1960’s. In particular, it became apparent that when the observations are in a
more general sample space X (such as Rd , or a Riemannian manifold, or some space
of functions, etc.), than the empirical distribution function is not as natural. It becomes
much more natural to consider the empirical measure Pn indexed by some class of subset
C of the sample space X , or, more generally yet, Pn indexed by some class of real-valued
functions F defined on X .
Suppose now that X1 , . . . , Xn are i.i.d. P on X . Then the empirical measure Pn is
defined by
1
Pn =
δXi ,
n
n
i=1
thus for any Borel set A ⊂ X
1
#{i ≤ n : Xi ∈ A}
1A (Xi ) =
.
n
n
n
Pn (A) =
i=1
For a real valued function f on X , we write
Pn (f ) =
1
f (Xi ).
n
n
f dPn =
i=1
If C is a collection of subsets of X , then
{Pn (C) : C ∈ C}
is the empirical measure indexed by C. If F is a collection of real-valued functions defined
on X , then
{Pn (f ) : f ∈ F}
is the empirical measure indexed by F.
1.1. SOME HISTORY
13
The empirical process Gn is defined by
Gn =
√
n(Pn − P ),
thus {Gn (C) : C ∈ C} is the empirical process indexed by C, while {Gn (f ) : f ∈ F} is
the empirical process indexed by F.
(Of course the case of sets is a special case of indexing by functions by taking F =
{1C : C ∈ C}).
Note that the classical empirical distribution function for real-valued random variables can be viewed as the special case of the general theory for which X = R ,
C = {(−∞, x] : x ∈ R}.
Two central questions for the general theory are:
(i) For what classes of sets C or functions F does a natural generalization of the
Glivenko-Cantelli Theorem 1.1 hold?
(ii) For what classes of sets C or functions F does a natural generalization of the
Donsker Theorem 1.2 hold?
If F is a class of functions for which
Pn − P ∗F = (supf ∈F |Pn (f ) − P (f )|)∗ →a.s. 0
then we say that F is a P–Glivenko-Cantelli class of functions.
If F is a class of functions for which
Gn =
√
n(Pn − P ) ⇒ G
in
∞ (F),
where G is a mean-zero P –Brownian bridge process with (uniformly) continuous sample
paths with respect to the semi-metric ρP (f, g) defined by
ρ2P (f, g) = VarP (f (X) − g(X)),
then we say that F is a P–Donsker class of functions.
Here
∞ (F) = {x : F → R such that xF = supf ∈F |x(f )| < ∞},
and G is a P–Brownian bridge process on F if it is a mean-zero Gaussian process with
covariance function
E{G(f )G(g)} = P (f g) − P (f )P (g).
Answer to these questions began to emerge during the 1970’s, especially in the work
of Vapnik and Chervonenkis (1971) and Dudley (1978), with notable contributions in
CHAPTER 1. INTRODUCTION
14
the 1970’s and 1980’s by David Pollard, Evarist Gin´e, Joel Zinn, Michel Talagrand,
Peter Gaenssler, and many others. We will give statements of some of generalizations of
Theorems 1.1 and 1.2 in later chapters. As will become apparent however, the methods
developed apply beyond the specific context of empirical processes of i.i.d. random
variables. Many of the maximal inequalities and inequalities for processes apply much
more generally. The tools developed will apply to maxima and suprema of large families
of random variables in considerable generality.
Main focus in the second half of these notes will be on applications of these results to
problem in statistics. Thus we briefly consider several examples in which the utility of the
generality of the general theory becomes apparent. The third part is instead dedicated to
an overview of some recent research topics that involve the theory of empirical processes.
1.2
Examples
A commonly recurring theme in statistics is that we want to prove consistency or asymptotic normality of some statistic which is not a sum of independent random variables,
but can be related to some natural sum of random functions indexed by a parameter in
a suitable (metric) space. The following examples illustrate the basic idea.
Example 1.1 Suppose that X, X1 , . . . , Xn , . . . are i.i.d. random variables with E|X1 | <
∞, and let µ = E(X). Consider the absolute deviations about the sample mean
1
Dn = Pn |X − X n | =
|Xi − X n |,
n
n
i=1
as an estimator of scale. This is an average of dependent random variables |Xi X|. There
are several routes available for showing that
Dn →a.s. d ≡ E|X − E(X)|,
(1.1)
but the methods we will develop in these notes lead to study of the random functions
Dn (t) = Pn |X − t|
for | t − µ |≤ δ
for δ > 0. Note that this is just the empirical measure indexed by the collection of
functions
F = {x →| x − t | : | t − µ |≤ δ},
and Dn (X n ) = Dn .
As we will see, this collection of functions is a VC–subgraph class of functions with
an integrable envelope function F , and hence empirical process theory can be used to
establish the desired convergence.
1.2. EXAMPLES
15
We might try showing (1.1) directly, but the corresponding central limit theorem is
trickier. (By the way, this example was one of the illustrative examples considered by
Pollard (1989)).
Example 1.2 Suppose that (X1 , Y1 ), . . . , (Xn , Yn ), . . . are i.i.d. F0 on R2 , and let Fn
denote their (classical!) empirical distribution function
1
Fn (x, y) =
1(−∞,x]×(−∞,y](Xi , Yi ).
n
n
i=1
Consider the empirical distribution function of the random variables Fn (Xi , Yi ), i =
1, . . . , n,
1
Kn (t) =
1[Fn (Xi ,Yi )≤t] ,
n
n
t ∈ [0, 1].
i=1
Once again the random variables {Fn (Xi , Yi )}ni=1 are dependent. In this case we are
already studying a stochastic process indexed by t ∈ [0, 1]. The empirical process method
leads to study of the process Kn indexed by t ∈ [0, 1] and F ∈ F2 , the class of all
distribution functions on R2
1
Kn (t, F ) =
1[Fn (Xi ,Yi )≤t] = Pn 1[F (Xi ,Yi )≤t] ,
n
n
t ∈ [0, 1],
F ∈ F2 ,
i=1
or perhaps with F2 replaced by the smaller class of functions F2,δ = {F ∈ F2 :
F − F0 ∞ ≤ δ}. Note that this is the empirical distribution indexed by the collection of functions
F = {(x, y) → 1[F (x,y)≤t] : t ∈ [0, 1], F ∈ F2 },
or the subset thereof obtained by replacing F2 by F2,δ , and Kn (t, Fn ) = Kn (t).
Can we prove that Kn (t) →a.s. K(t) = P (F0 (X, Y ) ≤ t) uniformly in t ∈ [0, 1]?
16
CHAPTER 1. INTRODUCTION
Chapter 2
Weak convergence: the
fundamental theorems
In this chapter we give a characterization of convergence in law of sample bounded
processes.
Let T be a set and let {Xn (t), t ∈ T }n∈N be a sequence of stochastic processes indexed
by the set T , with Xn defined on the probability space (Ω, A, P ). Assume that the
processes have versions with almost all their trajectories bounded and let us continue
denoting by Xn their sample bounded versions. Then Xn (·) ∈ ∞ (T ) almost surely,
where ∞ (T ) is the space of all bounded functions on T . The space ∞ (T ), equipped
with the sup norm · T , is a Banach space that is separable only if T is finite. We
do not assume that the finite-dimensional laws of the processes Xn correspond to the
finite-dimensional laws of (individually) tight Borel measures on ∞ (T ). (Recall that a
Borel probability measure µ is called tight if for every > 0 there exists a compact set
K with µ(K) ≥ 1 − ; a random variable X is called tight if its law µ ◦ X −1 is tight).
Now let X(t), t ∈ T , with X defined on the probability space (Ω, A, P ), be a sample
bounded process whose finite-dimensional laws do correspond to the finite-dimensional
laws of a tight Borel probability measure on ∞ (T ). Then, we say that Xn converges in
law (or, weakly) to X uniformly t ∈ T , and write
Xn ⇒ X
in
∞ (T ),
(2.1)
if
E ∗ H(Xn ) → EH(X)
in
∞ (T )
for all bounded continuous functions H : ∞ (T ) → R. As with usual convergence in law,
if F is a continuous function on ∞ (T ) with values in another metric space and F (Xn )
is measurable, then (2.1) implies F (Xn ) ⇒ F (X) in the usual way.
17
CHAPTER 2. WEAK CONVERGENCE: THE FUNDAMENTAL THEOREMS
18
The following theorem, known as Portmanteau theorem, gives equivalent ways of
describing weak convergence in a metric space (D, d).
Theorem 2.1 (Portmanteau theorem) Let Xn , n ∈ N, and X be random variables
that take values in a metric space (D, d). Then the following are equivalent:
(i)
Xn ⇒ X in D;
(ii) Ef (Xn ) → Ef (X) for all real bounded, uniformly continuous f on D;
(iii) lim supn P ∗ (Xn ∈ F ) ≤ P (X ∈ F ) for all closed sets F ⊂ D;
(iv) lim inf n P ∗ (Xn ∈ G) ≥ P (X ∈ G) for all open sets G ⊂ D;
(v) limn P ∗ (Xn ∈ A) = P (X ∈ A) for all Borel sets A with P (∂A) = 0.
Proof.
The implication (i) ⇒ (ii) is trivial.
(ii) ⇒ (iii).
Consider F closed and let f (x) :=
1−
d(x,F )
+
, where d(x, F ) :=
inf y∈F d(x, y) and > 0. The f defined is bounded and continuous, even uniformly
d(x,y)
.
continuous, because |f (x) − f (y)| ≤
And x ∈ F implies f (x) = 1, while
x ∈
/ F := {z : d(z, F ) < } implies d(x, F ) ≥ , and hence f (x) = 0. Therefore we
have
1F (x) ≤ f (x) ≤ 1F (x)
and hence
lim sup P ∗ (Xn ∈ F ) ≤ lim sup E ∗ f (Xn ) = Ef (X) ≤ P (X ∈ F ).
n
n
Since F is closed, letting ↓ 0 we get (iii).
(iii) ⇔ (iv). This equivalence follows easily by complementation.
(iii) & (iv) ⇒ (v). Let A be a Borel set with P (∂A) = 0, and denote by A◦ its interior
and by A its closure; then conditions (iii) and (iv) together imply
P (X ∈ A) ≥ lim sup P ∗ (Xn ∈ A) ≥ lim sup P ∗ (Xn ∈ A)
∗
n
∗
n
≥ lim inf P (Xn ∈ A) ≥ lim inf P (Xn ∈ A◦ ) ≥ P (X ∈ A◦ ).
n
n
Since P (∂A) = 0, the extreme terms here coincide with P (X ∈ A) and (v) follows.
(v) ⇒ (i). Without loss in generality we may assume that the bounded f satisfies
0 < f < 1.
Then Ef (X) =
∞
0
P {f (X) > t}dt =
1
0
P {f (X) > t}dt, and similarly for E ∗ f (Xn ). If
f is continuous, then ∂{f (X) > t} ⊂ {f (X) = t}, and hence P (∂{f > t}) = 0 except for
countably many t. By condition (v) and the bounded convergence theorem, we get
1
1
E ∗ f (Xn ) =
P ∗ {f (Xn ) > t}dt →
P {f (X) > t}dt = Ef (X).
0
0
2
19
The following proposition gives a description of the sample bounded processes X that
do induce a tight Borel measure on ∞ (T ).
Proposition 2.1 (de la Pena and Gin´
e (1999), van der Vaart and Wellner (1996))
Let X(t), t ∈ T be a sample bounded stochastic process. Then the finite-dimensional
distributions of X are those of a tight Borel probability measure on ∞ (T ) if and only if
there exists a pseudometric ρ on T for which (T, ρ) is totally bounded and such that X
has a version with almost all its sample paths uniformly continuous for ρ.
Proof.
Let us assume the probability law of X is a tight Borel on ∞ (T ). Then there
exists a sequence Km , m ∈ N, of compact sets in ∞ (T ) such that
∞
P X∈
Km = 1,
m=1
and let K =
∞
m=1 Km .
Then we will show that the pseudometric ρ defined on T by
ρ(s, t) =
∞
2−m (1 ∧ ρm (s, t))
m=1
with
ρm (s, t) = sup{|x(s) − x(t)| : x ∈ Km }
makes (T, ρ) totally bounded. To show this, given > 0, let k be such that
∞
2−m <
m=k+1
4
and let {x1 , . . . , xr } be a finite subset of km=1 Km , /4-dense in km=1 Km for the sup
norm, that is, for each x ∈ km=1 Km there is an integer i ≤ r such that x − xi T ≤ /4.
Such a finite set of functions exists by the compactness of km=1 Km . The subset A of Rr
defined by {x1 (t), . . . , xr (t) : t ∈ T } is bounded since x1 , . . . , xr are bounded functions.
Therefore A is totally bounded (in Rr bounded is the same as totally bounded). Hence
there exists a finite set T = {tj : 1 ≤ j ≤ N } such that, for each t ∈ T , there is a j ≤ N
for which max1≤i≤r |xi (t) − xi (tj )| ≤ /4. Then, T is -dense in T for the pseudo-metric
ρ: if t and tj are as above, then, for x ∈ Km , m ≤ k, it follows that
|x(t) − x(tj )| ≤ |x(t) − xi (t)| + |xi (t) − xi (tj )| + |xi (tj ) − x(tj )|
≤ 2x − xi T + |xi (t) − xi (tj )|
and choosing i such that x − xi T ≤ /4 we get
|x(t) − x(tj )| ≤
3
,
4
CHAPTER 2. WEAK CONVERGENCE: THE FUNDAMENTAL THEOREMS
20
thus
ρm (t, tj ) = sup |x(t) − x(tj )| ≤
x∈Km
3
4
and hence
ρ(t, tj ) ≤
k
m=1
≤
∞
2−m ρm (t, tj ) +
2−m
m=k+1
k
3
2−m + < .
4 m=1
4
This proves (T, ρ) is totally bounded. Moreover, the functions x ∈ Km are uniformly
ρ-continuous, since, if x ∈ Km , then |x(s) − x(t)| ≤ ρm (s, t) ≤ 2m ρ(s, t) for all s, t ∈ T
with ρ(s, t) ≤ 1. Since P (X ∈ K) = 1, the identity map of (∞ (T ), B, P ◦ X −1 ) (where B
denotes the Borel σ-algebra on ∞ (T )) yields a version of X with almost all its sample
paths in K, hence in Cu (T, ρ), the space of bounded uniformly ρ-continuous functions
on T . This proves the direct half of the proposition.
Conversely, let X(t), t ∈ T , be a process with a version whose sample paths are almost
all in Cu (T, ρ) for a metric or pseudo-metric ρ on T for which (T, ρ) is totally bounded,
and let us continue denoting X such a version. We can assume all the trajectories of
X are uniformly continuous. The map X : Ω → Cu (T, ρ) is Borel measurable because
the random vectors (X(t1 ), . . . , X(tk )), ti ∈ T, k ∈ N, are Borel measurable and the
Borel σ-algebra of Cu (T, ρ) is generated by the “finite-dimensional sets”{x ∈ Cu (T, ρ) :
(x(t1 ), . . . , x(tk )) ∈ A} for all Borel sets A of Rk , ti ∈ T, k ∈ N. Hence, the probability
law of X is a Borel measure in Cu (T, ρ). This space is complete, being ∞ (T ) complete
and Cu (T, ρ) closed in ∞ (T ) (the uniform limit of uniformly continuous functions is
still uniformly continuous), and it is separable, being (T, ρ) totally bounded and thus
separable. Ulam theorem says that if S is complete and separable, then each probability
measure on (S, S) is tight (see for example Billingsley (1968), Theorem 1.4 page 10).
Thus the induced probability law of X is tight on Cu (T, ρ). But a tight Borel measure
on Cu (T, ρ) is tight also on ∞ (T ), since a compact set in Cu (T, ρ) is compact also in
∞ (T ).
2
The following theorem characterizes weak convergence in ∞ (T ) in terms of asymptotic equicontinuity and convergence of finite-dimensional distributions.
Definition 2.1 A sequence {Xn } in ∞ (T ) is said to be asymptotically uniformly equicontinuous in probability with respect to the pseudometric ρ if for every , η > 0 there exists
a δ > 0 such that
lim sup P ∗
n
sup |Xn (s) − Xn (t)| > ρ(s,t)<δ
< η.
(2.2)
21
Theorem 2.2 The following are equivalent:
(i) All the finite-dimensional distributions of the sample bounded processes Xn converge
in law and there exists a pseudometric ρ on T such that both (T, ρ) is totally bounded
and the processes Xn are asymptotically uniformly equicontinuous in probability with
respect to ρ;(ii) There exists a process X whose law is a tight Borel probability measure
on ∞ (T ) and such that
Xn ⇒ X
∞ (T ).
in
If (i) holds, then the process X in (ii), which is completely determined by the limiting
finite-dimensional distributions of {Xn }, has a version with sample paths in Cu (T, ρ). If
X in (ii) has a version with almost all its trajectories in Cu (T, γ) for some pseudometric
γ for which (T, γ) is totally bounded, then (i) holds with the pseudometric ρ taken to be
γ.
Proof.
Suppose (i) holds. Let T∞ be a countable ρ-dense subset of T , and let Tk , k ∈ N,
be finite subsets of T satisfying Tk T∞ . Such sets exist since any totally bounded set is
separable. The limit laws of the finite-dimensional distributions of the processes Xn are
consistent and thus define a stochastic process X on T . Moreover, by the Portmanteau
Theorem for finite-dimensional convergence in law, for every > 0,
P{
|X(s) − X(t)| > }
sup
s,t∈Tk :ρ(s,t)≤δ
≤ lim inf P∗ {
sup
≤ lim inf P∗ {
sup
n→∞
s,t∈Tk :ρ(s,t)≤δ
n→∞
|Xn (s) − Xn (t)| > }
s,t∈T∞ :ρ(s,t)≤δ
|Xn (s) − Xn (t)| > }.
Taking the limit as k → ∞ of the left term side of this inequality we have
lim P {
k→∞
sup
s,t∈Tk :ρ(s,t)≤δ
= P{ (
k
|X(s) − X(t)| > }
sup
s,t∈Tk :ρ(s,t)≤δ
= P{
sup
s,t∈T∞ :ρ(s,t)≤δ
|X(s) − X(t)| > )}
|X(s) − X(t)| > )}.
Thus we get
P{
sup
s,t∈T∞ :ρ(s,t)≤δ
≤ lim inf P∗ {
n→∞
|X(s) − X(t)| > )}
sup
s,t∈T∞ :ρ(s,t)≤δ
|Xn (s) − Xn (t)| > }.
Taking the limit as δ → 0 and using the asymptotic equicontinuity condition we have
CHAPTER 2. WEAK CONVERGENCE: THE FUNDAMENTAL THEOREMS
22
that
lim P {
δ→0
|X(s) − X(t)| > )}
sup
s,t∈T∞ :ρ(s,t)≤δ
≤ lim lim inf P∗ {
δ→0 n→∞
sup
s,t∈T∞ :ρ(s,t)≤δ
≤ lim lim sup P ∗ {
δ→0 n→∞
|Xn (s) − Xn (t)| > }
sup
s,t∈T∞ :ρ(s,t)≤δ
|Xn (s) − Xn (t)| > } = 0.
Thus we can find a sequence δm 0 such that
P{
sup
s,t∈T∞ :ρ(s,t)≤δm
|X(s) − X(t)| > 2−m } ≤ 2−m .
Hence it follows by Borel-Cantelli that
P{
sup
s,t∈T∞ :ρ(s,t)≤δm
|X(s) − X(t)| > 2−m infinitely often} = 0
that is, there exists m(ω) < ∞ almost surely such that
sup
s,t∈T∞ :ρ(s,t)≤δm
|X(s, ω) − X(t, ω)| ≤ 2−m
∀ m > m(ω).
Therefore X(t, ω) is a ρ-uniformly continuous functions of t ∈ T∞ for almost every ω; T
being totally bounded, the restriction to T∞ of X(t, ω) is also bounded. The extension
to T by uniform continuity of the restriction of X to T∞ (only the ω set where X is
uniformly continuous needs be considered) yields a version of X with sample paths all in
Cu (T, ρ). Then, it follows from Proposition 2.1 that the law of X exists as a tight Borel
measure on ∞ (T ).
To prove weak convergence note that, since (T, ρ) is totally bounded, for every δ >
0 there exists a finite set of points {t1 , . . . , tN (δ) } that is δ-dense in (T, ρ), i.e. T ⊂
N (δ)
i=1 B(ti , δ), where B(ti , δ) is the open ball with center ti and radius δ. Thus, for
each t ∈ T we can choose πδ (t) ∈ {t1 , . . . , tN (δ) } so that ρ(πδ (t), t) < δ. Then we define
processes Xn,δ , n ∈ N, and Xδ by
Xn,δ (t) = Xn (πδ (t))
Xδ (t) = X(πδ (t)),
t ∈ T.
These are approximations of Xn and X that take at most a finite number N (δ) of values.
Convergence of the finite-dimensional distributions of Xn to those of X implies that
Xn,δ ⇒ Xδ
in
∞ (T ).
(2.3)
Furthermore, uniformly continuity of the sample paths of X yields
lim X − Xδ T = 0
δ→0
a.s.
(2.4)
23
Indeed, by uniformly continuity of the sample paths of X we get
X − Xδ T = sup |X(t) − X(πδ (t))| ≤ sup α ρ(t, πδ (t))
t∈T
a.s., for some α > 0;
t∈T
thus, if δ → 0 (and hence πδ (t) → t), then X − Xδ T → 0. Now let H : ∞ (T ) → R be
bounded and continuous. Then, using the triangle inequality we have that
|E ∗ H(Xn ) − EH(X)|
≤
|E ∗ H(Xn ) − EH(Xn,δ )| + |EH(Xn,δ ) − EH(Xδ )|
+|EH(Xδ ) − EH(X)|
≡
In + IIn,δ + IIIδ .
In order to prove the convergence part of (ii) we can show that the limδ→0 lim supn→∞
of each of this quantities is zero. This is true for IIn,δ by (2.3). Next we show it for IIIδ .
Given > 0, let K ⊂ ∞ (T ) be a compact set such that P (X ∈ K c ) < /(6H∞ ). By
Exercise 2.1, there exists a τ > 0 such that, if x ∈ K and y ∈ ∞ (T ) with x − yT < τ ,
then |H(x) − H(y)| < /6. Let δ1 > 0 be such that P (Xδ − XT ≥ τ ) < /(6H∞ ) for
all δ < δ1 ; this can be done by virtue of (2.4). Then it follows that
|EH(Xδ ) − EH(X)|
≤ 2H∞ P ([X ∈ K c ]
[Xδ − XT ≥ τ ]) +
+ sup{|H(x) − H(y)| : x ∈ K, x − yT < τ }
+
+ < ,
≤ 2H∞
6H∞
6H∞
6
so that limδ→0 IIIδ = 0 holds. To show that limδ→0 lim supn→∞ In,δ = 0, choose , τ ,
and K as above. Then we have
|E ∗ H(Xn ) − H(Xn,δ )|
≤ 2H∞ {P ∗ (Xn − Xn,δ T ≥ τ /2)| + P (rXn,δ ∈ (Kτ /2 )c )}
+ sup{|H(x) − H(y)| : x ∈ K, x − yT < τ },
(2.5)
where Kτ /2 is the τ /2 open neighborhood of the set K for the sup norm. The inequality
in the previous display can be checked as follows: if Xn,δ ∈ Kτ /2 and Xn − Xn,δ < τ /2,
then there exists x ∈ K such that x − Xn,δ T < τ /2 and x − Xn T < τ . Now, the
asymptotic equicontinuity hypothesis implies that there is a δ2 such that
lim sup P ∗ {Xn − Xn,δ T ≥ τ /2} <
n→∞
6H∞
∀ δ < δ2 ,
and finite-dimensional convergence yields
lim sup P ∗ {Xn,δ ∈ (Kτ /2 )c } ≤ P ∗ {Xδ ∈ (Kτ /2 )c } ≤
n→∞
.
6H∞
24
CHAPTER 2. WEAK CONVERGENCE: THE FUNDAMENTAL THEOREMS
Hence we conclude from (2.5) that, for δ < δ1 ∧ δ2 ,
lim sup |E ∗ H(Xn ) − EH(Xn,δ )| < ,
n→∞
and this completes the proof that (i) implies (ii).
Let us now prove the converse implication. If (ii) holds, then by Proposition 2.1 there
is a pseudometric ρ on T which makes (T, ρ) totally bounded and such that X has a
version (that we still denote by X) with all its sample paths in Cu (T, ρ). Now consider
the closed set Fδ, defined by
Fδ, = {x ∈ ∞ (T ) :
sup
s,t∈T :ρ(s,t)≤δ
|x(s) − x(t)| ≥ }.
Applying the portmanteau theorem we have that
lim sup P ∗ {
sup
s,t∈T :ρ(s,t)≤δ
n→0
≤ P{
sup
s,t∈T :ρ(s,t)≤δ
|Xn (s) − Xn (t)| ≥ }
|X(s) − X(t)| ≥ }.
Taking limits as δ → 0 yields asymptotic equicontinuity in view of the ρ-uniform continuity of the sample paths of X. Thus (ii) implies (i).
2
The following is an obvious corollary of Theorem 2.2 for the empirical process Gn indexed by a class of measurable real-valued functions F on the probability space (X , A, P ),
with the pseudo-metric ρp defined by
ρ2p (f, g) = Varp (f (X) − g(X)) = P (f − g)2 − [P (f − g)]2 .
Corollary 2.1 Let F be a class of measurable functions on (X , A). Then the following
are equivalent:
(i) (F, ρp ) is totally bounded and Gn is asymptotically uniformly equicontinuous in probability with respect to ρp ;
(ii) F is P –Donsker, i.e.
Gn ⇒ G
in
∞ (T )
where G is a mean-zero P –Brownian bridge with uniformly continuous sample paths
with respect to ρp .
Proof.
(i) ⇒ (ii). From Theorem 2.2, all we need to show is that the finite dimensional
distributions of Gn converge to those of G (recall that G is a mean-zero Gaussian process
process with covariance function E{G(f )G(g)} = P (f g) − P (f )P (g)). But this follows
25
directly from the Multivariate Central Limit Theorem
⎛
⎞
⎡
Gn f 1
(1/n) ni=1 f1 (Xi ) − P (f1 )
⎜
⎟
⎢
⎜ Gn f2 ⎟ √ ⎢ (1/n) n f1 (Xi ) − P (f1 )
i=1
⎜
⎟
⎢
⎜ . ⎟ = n⎢
..
⎜ .. ⎟
⎢
.
⎝
⎠
⎣
n
(1/n) i=1 f1 (Xi ) − P (f1 )
Gn f k
⎤
⎥
⎥
⎥
⎥ ⇒ N [0, C]
⎥
⎦
where C = [ci,j ]i=1,...,k;j=1,...,k with ci,j = Cov(fi (X1 ), fj (X1 )); but
Cov(fi (X1 ), fj (X1 )) = E[fi (X1 )fj (X1 )] − E[fi (X1 )]E[fj (X1 )]
= P (fi fj ) − P (fi )P (fj ).
(ii) ⇒ (i). From Theorem 2.2, all we need to show is that (F, ρp ) is totally bounded.
Being G tight, from Proposition 2.1 it follows that there exists a pseudometric ρ on F
for which (F, ρ) is totally bounded and such that G has a version with almost all its
sample paths uniformly continuous for ρ: for every couple f, g ∈ F,
|G(f ) − G(g)| ≤ α ρ(f, g)
a.s. for some α > 0.
Thus
|G(f ) − G(g)|2 ≤ (α ρ(f, g))2
and
E|G(f ) − G(g)|2
1/2
a.s.
≤ α ρ(f, g).
(2.6)
But
E|G(f ) − G(g)|2 = Var(G(f )) + Var(G(g)) + 2 Cov(G(f ), G(g))
= Var(f (X1 )) + Var(g(X1 )) + 2 Cov(f (X1 ), g(X1 ))
= Var(f (X1 ) − g(X1 )) = ρp (f, g).
Hence, equation (2.6) implies that if f ∈ Bρ (fi , ) then f ∈ Bρp (fi , α ), where f ∈
Bρ (fi , ) and f ∈ Bρp (fi , ) denote the open balls of center fi and radius in (F, ρ) and
(F, ρp ) respectively. This shows that also (F, ρp ) is totally bounded.
2
We close this chapter by defining asymptotic tightness and showing two characterizations of this property.
Definition 2.2 A sequence {Xn } in ∞ (T ) is said to be asymptotically tight if for every
> 0 there exists a compact set K ⊂ ∞ (T ) such that
lim inf P∗ (Xn ∈ K δ ) ≥ 1 − ,
n→∞
∀δ > 0.
Here K δ = {y ∈ ∞ (T ) : d(y, K) < δ} is the “δ-enlargement” of K.
CHAPTER 2. WEAK CONVERGENCE: THE FUNDAMENTAL THEOREMS
26
Asymptotic tightness can be given a more concrete form, either through finite approximation or connecting tightness to (asymptotic, uniform, equi-) continuity of the
sample paths.
The idea of finite approximation is that for any > 0 the index set T can be partitioned into finitely many subsets Ti such that (asymptotically) the variation of the
sample paths t → Xn (t) is less than on every one of the sets Ti . More precisely, it is
assumed that for every , η > 0, there exists a partition T = ki=1 Ti such that
lim sup P
∗
n
sup sup |Xn (s) − xn (t)| > 1≤i≤k s,t∈Ti
< η.
(2.7)
Under this condition the asymptotic of the process can be described within error margin
, η > 0 by the behaviour of the marginal (Xn (t1 ), . . . , Xn (tk )) for arbitrary fixed points
ti ∈ Ti . If the process can thus be reduced to a finite set of coordinates for any , η > 0 and
the sequences of marginal distributions are tight, then the sequence Xn is asymptotically
tight.
We are now ready to give the two characterizations.
Theorem 2.3 The following are equivalent:
(i) The sequence {Xn } is asymptotically tight;
(ii) {Xn (t)} is asymptotically tight in R for every t in T and, for all , η > 0, there exists
a finite partition T = ki=1 Ti such that (2.7) holds;
(iii) {Xn (t)} is asymptotically tight in R for every t in T and there exists a pseudometric
ρ on T such that (T, ρ) is totally bounded and {Xn } is asymptotically uniformly ρequicontinuous in probability.
Proof.
(ii) ⇒ (i). For any partition T =
k
i=1 Ti ,
the norm Xn T is bounded by
supi supt∈Ti |Xn (t) − Xn (ti )| + supi |Xn (ti )|. Indeed,
Xn T
= sup |Xn (t)| = sup sup |Xn (t)|
i
t∈T
t∈Ti
≤ sup{sup |Xn (t) − Xn (ti )| + |Xn (ti )|}
i
t∈Ti
≤ sup sup |Xn (t) − Xn (ti )| + sup |Xn (ti )|.
i
t∈Ti
(2.8)
i
Let us choose a partition such that (2.7) holds. Note that supi |Xn (ti )| is asymptotically tight, being the maximum of finitely many asymptotically tight sequences of real
variables, that is, for every ξ > 0 there exists a constant M such that
lim inf P∗ (sup |Xn (ti )| ≤ M + δ) ≥ 1 − ξ
n→∞
i
∀δ > 0.
(2.9)
27
From (2.8) we have
lim inf P∗ (Xn T ≤ (M + ) + δ)
n→∞
≥ lim inf P∗ (sup sup |Xn (t) − Xn (ti )| + sup |Xn (ti )| ≤ M + + δ)
n→∞
i
i
t∈Ti
≥ lim inf P∗ ({sup |Xn (ti )| ≤ M + δ} ∩ {sup sup |Xn (t) − Xn (ti )| ≤ })
n→∞
i
∗
i
t∈Ti
= 1 − lim sup P ({sup |Xn (ti )| > M + δ} ∪ {sup sup |Xn (t) − Xn (ti )| > })
n→∞
i
i
∗
t∈Ti
∗
≥ 1 − lim sup{P (sup |Xn (ti )| > M + δ) − P (sup sup |Xn (t) − Xn (ti )| > )}
n→∞
i
i
t∈Ti
∗
∗
≥ 1 − lim sup P (sup |Xn (ti )| > M + δ) − lim sup P (sup sup |Xn (t) − Xn (ti )| > )
n→∞
n→∞
∗
i
i
t∈Ti
= lim inf P∗ (sup |Xn (ti )| ≤ M + δ) − lim sup P (sup sup |Xn (t) − Xn (ti )| > )
n→∞
n→∞
i
i
t∈Ti
and, from (2.9) and (2.7) we get
lim inf P∗ (Xn T ≤ (M + ) + δ) ≥ 1 − ξ − η,
n→∞
∀δ > 0,
that is, the sequence Xn T is asymptotically tight in R.
Fix ζ and a sequence n ↓ 0. Take a constant M such that lim sup P ∗ (Xn T > M ) <
ζ, and for each = m and η = 2−m ζ, take a partition T = ki=1 Ti as in (2.7). For
the moment m is fixed and we do not let it appear in the notation. Let {z1 , . . . , zp } be
the set of all functions in ∞ (T ) that are constant on each Ti and take only the values
0, ±m , . . . , ±M/m m (there is a finite number of such functions). Let Km be the
union of the p closed balls of radius m around the zi . Then, by construction, the two
conditions
Xn ≤ M,
sup sup |Xn (s) − Xn (t)| ≤ m
1≤i≤k s,t∈Ti
imply that Xn ∈ Km . This is true for each fixed m.
Let K = ∞
m=1 Km . Then K is closed and totally bounded (by construction of the
Km and because m ↓ 0) and hence compact. Furthermore, for every δ > 0, there is
a j with K δ ⊃ jm=1 Km . If not, then there would be a sequence yj not in K δ , but
with yj ∈ jm=1 Km for every j. This would have a subsequence contained in one of
the balls making up K1 , a further subsequence eventually contained in one of the balls
making up K2 , and so on. The “diagonal” sequence, formed be taking the first of the
first subsequence, the second of the second subsequence and so on, would eventually be
contained in a ball of radius j for every j; hence Cauchy. Its limit would be in K,
contradicting the fact that d(yj , K) ≥ δ for every j.
28
CHAPTER 2. WEAK CONVERGENCE: THE FUNDAMENTAL THEOREMS
Conclude that if Xn is not in K δ , then it is not in jm=1 Km for some fixed m. Then
j
lim sup P ∗ Xn ∈
/ K δ ≤ lim sup P ∗ Xn ∈
/
Km
n→∞
n→∞
≤ lim sup P
∗
≤ lim sup P
∗
{Xn T > M }
n→∞
n→∞
≤ζ+
j
{
sup sup |Xn (s) − Xn (t)| > m }
m=1 1≤i≤k s,t∈Ti
Xn T > M
m=1
j
j
+
m=1
lim sup P
∗
n→∞
sup sup |Xn (s) − Xn (t)| > m
1≤i≤k s,t∈Ti
2−m ζ < 2ζ.
(2.10)
m=1
Thus, for every > 0 there exists a compact set K ⊂ ∞ (T ) such that
δ
lim inf P∗ Xn ∈ K
≥1−
∀δ > 0,
n→∞
that is, Xn is asymptotically tight.
(iii) ⇒ (ii). For every , η > 0 take δ > 0 such that (2.2) holds. Since T is totally
bounded, it can be covered with finitely many balls of radius δ. Construct a partition
by disjointifying these balls.
) ≥
(i) ⇒ (iii). Let K1 ⊂ K2 ⊂ · · · be compacts in ∞ (T ) with lim inf P∗ (Xn ∈ Km
1 − 1/m for every > 0. For every fixed m, define a semimetric ρm on T by
ρm (s, t) = sup |z(s) − z(t)|,
s, t ∈ T.
z∈Km
Then (T, ρm ) is totally bounded. Indeed, cover Km by finitely many balls of (arbitrarily
small) radius η, centered at z1 , . . . , zk . Partition Rk into cubes of edge η, and for every
cube pick up at most one t ∈ T such that (z1 (t), . . . , zk (t)) is in the cube. Since z1 , . . . , zk
are uniformly bounded this gives finitely many points t1 , . . . , tp . The quantity ρm (t, ti )
can be bounded by
2 sup inf z − zj T + sup |zj (ti ) − zj (t)|.
z∈Km j
j
Indeed,
|z(t) − z(ti )| ≤ |z(t) − zj (t)| + |zj (t) − zj (ti )| + |zj (ti ) − z(ti )|
≤ 2z − zj T + |zj (t) − zj (ti )| ;
since this holds for every j,
|z(t) − z(ti )| ≤ inf {2z − zj T + |zj (t) − zj (ti )|}
j
≤ inf {2z − zj T + sup |zj (t) − zj (ti )|}
j
j
≤ 2 inf z − zj T + sup |zj (t) − zj (ti )|,
j
j
29
and, taking the sup over z ∈ Km ,
ρm (t, ti ) ≤ 2 sup inf z − zj T + sup |zj (ti ) − zj (t)|.
z∈Km j
j
Since for every t there is a ti for which (z1 (t), . . . , zk (t)) and (z1 (ti ), . . . , zk (ti )) fall in the
same cube, the balls {t : ρm (t, ti ) < 3η} cover T .
Next set
ρ(s, t) =
∞
2−m (ρm (s, t) ∧ 1).
m=1
Fix η > 0. Take a natural number k with 2−k < η. Cover T with finitely many ρk -balls
of radius η. Let t1 , . . . , tp be their centers. Since ρ1 ≤ ρ2 ≤ . . . , for every t there is a ti
with ρ(t, ti ) < 2η. Indeed,
ρ(t, ti ) ≤
k
∞
2−m ρm (t, ti ) +
m=1
≤ ρk (t, ti )
2−m
m=k+1
k
−m
2
m=1
−k
≤ ρk (t, ti ) + 2
−k
+2
∞
2−m
m=1
< ρk (t, ti ) + η.
Thus (T, ρ) is totally bounded for ρ, too.
It is clear from the definitions that |z(s) − z(t)| ≤ ρm (s, t) for every z ∈ Km and that
ρm (s, t) ∧ 1 ≤ 2m ρ(s, t). Thus, for any ≤ 1, if ρ(s, t) < 2−m then |z(s) − z(t)| ≤ , for
every z ∈ Km . Moreover, by triangle inequality, for any pair s, t ∈ T ,
|z(s) − z(t)| ≤ |z(s) − z0 (s)| + |z0 (s) − z0 (t)| + |z0 (t) − z(t)|
≤ 2 z − z0 T + |z0 (s) − z0 (t)|.
then there exists a z ∈ K such that z − z < . Thus
Finally, if z ∈ Km
0
m
0
T
Km
⊂ {z :
sup
ρ(s,t)<2−m |z(s) − z(t)| ≤ 3}.
Hence, for given and m, and for δ < 2−m ,
lim inf P∗ ( sup |Xn (s) − Xn (t)| ≤ 3)
n→∞
ρ(s,t)<δ
≥ lim inf P∗ (
n→∞
≥
sup
|Xn (s) − Xn (t)| ≤ 3)
ρ(s,t)<2−m lim inf P∗ (Km
)≥1−
n→∞
1/m.
Thus {Xn } is asymptotically uniformly ρ-equicontinuous in probability.
2
30
2.1
CHAPTER 2. WEAK CONVERGENCE: THE FUNDAMENTAL THEOREMS
Exercises
Exercise 2.1 Show that if H : ∞ (T ) → R is bounded and continuous, and K ⊂ ∞ (T )
is compact, then for every > 0 there is a δ > 0 such that, if x ∈ K and y ∈ ∞ (T ) with
y − xT < δ, then |H(x) − H(y)| < .
Solution.
Since H is continuous, for any fixed z ∈ K and for every > 0 there exists
a δ(z) > 0 such that: if v ∈ ∞ (T ) with v − zT < δ(z), then |H(v) − H(z)| < /2.
Denoting by B(z; δ(z)/2) the open ball of center z and radius δ(z)/2, we have that
K⊂
B(z; δ(z)/2).
z∈K
Being K compact, there exist z1 , . . . , zn such that
K⊂
n
B(zi ; δ(zi )/2).
i=1
Let δ := min{δ(z1 )/2, δ(z2 )/2, . . . , δ(zn )/2}. Then δ does the job. Indeed, take x ∈
K and y ∈ ∞ (T ) with y − xT < δ. Since x ∈ K there exists zk such that x ∈
B(zk ; δ(zk )/2). Thus, by triangular inequality,
y − zk T ≤ y − xT + x − zk T < δ + δ(zk )/2 ≤ δ(zk ).
Finally, using (triangular inequality and) the continuity of H, we get that
|H(x) − H(y)| ≤ |H(x) − H(zk )| + |H(y) − H(zk )| < /2 + /2 = .
2
Exercise 2.2 Prove that if Xn ⇒ X in ∞ (T ) and g : ∞ (T ) → D for a metric space
(D, d) is continuous, then g(Xn ) ⇒ g(X) in (D, d).
Solution.
Since Xn ⇒ X in ∞ (T ), we have that, for every bounded and continuous
function f : ∞ (T ) → R,
E[f (Xn )] → E[f (X)].
(2.11)
We want to prove that, given g : ∞ (T ) → D continuous, for every bounded and continuous h : D → R,
E[h(g(Xn ))] → E[h(g(X))].
(2.12)
But, thanks to the continuity of g and h and the boundedness of h, we have that h ◦ g
is bounded and continuous. Hence (2.11) implies (2.12).
2
Chapter 3
Maximal Inequalities and
Chaining
3.1
Orlicz norms and the Pisier inequality
Let ψ be a Young modulus, that is, a convex increasing unbounded function ψ : [0, ∞) →
[0, ∞) satisfying ψ (0) = 0. For any random variable X, the Lψ -Orlicz norm of X is
|X|
Xψ = inf c > 0 : Eψ
≤1 .
c
defined as
The function
p
ψp (x) = ex − 1
(3.1)
is a Young modulus for each p ≥ 1.
Moreover, it is easy to see that for every p ≥ 1 there exists cp < ∞ such that the
inequality Xp ≤ cp Xψ1 holds for any random variable X.
In detail, we can show that the previous relationship holds for
cp = (Γp + 1)1/p .
Proof.
Without loss of generality, we can assume that X ≥ 0. In this case, by defini-
tion, we get that
Xψ1 = inf c > 0 : E eX/c − 1 ≤ 1 ,
while, on the other side, we have that
Xp = E (X p )1/p .
Thus, it remains to show that
E (X p )1/p ≤ (Γp + 1)1/p inf c > 0 : E eX/c − 1 ≤ 1 .
31
CHAPTER 3. MAXIMAL INEQUALITIES AND CHAINING
32
In order to do it, the following inequality has a crucial importance:
xp ≤ Γ (p + 1) (ex − 1) , for x ≥ 0, p ≥ 1.
From it we get, for any c > 0,
p
X
≤ Γ (p + 1) eX/c − 1 ,
c
while, taking the expectation,
!
p "
X
E
≤ Γ (p + 1) E eX/c − 1
c
and also
E (X p )
X/c
≤
E
e
−
1
.
Γ (p + 1) cp
Now, we can consider the set
D = c > 0 : E eX/c − 1 ≤ 1 .
If c ∈ D, we get
E (X p )
≤1
Γ (p + 1) cp
and
E (X p ) ≤ cp · Γ (p + 1) .
Thus, taking the infimum of both sides,
p
E (X p ) ≤ inf cp Γ (p + 1) = Xψ1 Γ (p + 1) ,
c∈D
and finally
Xp ≤ Xψ1 (Γ (p + 1))1/p .
Obviously, if p is an integer, the relationship becomes simply
Xp ≤ Xψ1 (Γ(p + 1))1/p .
In order to conclude the proof, it remains only to prove the crucial inequality used in it.
First of all, we can note that, for m ≥ 1, m integer, it holds
ex − 1 ≥
=
xm
m!
xm
,
Γ(m + 1)
from which it follows
xm ≤ (Γ(m + 1)) (ex − 1).
3.1. ORLICZ NORMS AND THE PISIER INEQUALITY
33
Then, for any p ≥ 1, if x ≥ 1, we have
xp ≤ xp+1 ≤ Γ (p + 2) (ex − 1) = Γ (p + 1) (ex − 1),
while, if x ≤ 1, we have
xp ≤ x ≤ Γ (p + 1) (ex − 1).
2
We say that a Young modulus is of exponential type if the following two conditions
are satisfied:
ψ −1 (xy)
<∞
−1
−1
min{x,y}→∞ ψ (x)ψ (y)
lim sup
and
lim sup
x→∞
ψ −1 (x2 )
< ∞.
ψ −1 (x)
(It is actually the second of these two conditions which forces the exponential type; the
first condition is satisfied by Young functions of the form ψ(x) = xp , p ≥ 1). Note that ψp
defined in (3.1) satisfies these conditions (since ψp−1 = log (x + 1)1/p ). In what follows,
if a variable X is not necessarily measurable, we write X∗ψ for |X|∗ ψ , where |X|∗ is
the measurable envelope of |X|.
The following lemma gives a simple way of bounding Xψp .
Lemma 3.1 Suppose that X is a random variable with P (|X| > x) ≤ K exp(−Cxp ) for
all x > 0 and some positive constants K and C, with p ≥ 1. Then the ψp Orlicz norm
satisfies Xψp ≤ ((1 + K)/C)1/p .
Proof.
Without loss of generality, we can assume that X > 0. By definition of ψp
Orlicz norm, we have
p p
Xψp = inf λ > 0 : E eX /λ − 1 ≤ 1 .
At the same time, we have
p
X p ψ1 = inf ξ > 0 : E eX /ξ − 1 ≤ 1 ,
from which it follows
p
X p ψ1 = Xψp .
Thus, it suffices to show that
X p ψ1 ≤
1+K
.
c
Now, let Z = X p . It remains to prove that
Zψ1 ≤
1+K
,
c
CHAPTER 3. MAXIMAL INEQUALITIES AND CHAINING
34
and, for this purpose, it can be useful to note that
P (Z > z) = P (X > z 1/p ) ≤ K exp{−Cz}.
Moreover, by definition,
Zψ1 = inf α > 0 : E eZ/α − 1 ≤ 1 ,
so it suffices to show that
with α0 =
E eZ/α0 − 1 ≤ 1,
1+K
c .
Thus, we have
1 ∞ Z/α0
Z/α0
E e
=
P e
≥ y dy +
P eZ/α0 ≥ y dy
0
1
∞ ≤ 1+
P eZ/α0 ≥ y dy
1 ∞
= 1+
P (Z > α0 log y) dy
1
∞
1+K
≤ 1+
K exp −C ·
log y dy
C
1
∞
∞
K
= 1+
dy = 1 + −y −K 1 = 2.
(1+K)
y
1
Then, it follows
E eZ/α0 − 1 = E eZ/α0 − 1 ≤ 2 − 1 = 1.
(For more details, it is also possible to refer to van der Vaart and Wellner (1996), page
2
96).
Once we have knowledge of (or bounds for) the individual Orlicz norms of some
family of random variables {Xk }, then we can also control the Orlicz norm of a particular
weighted supremum of the family. This is the content of the following proposition.
Proposition 3.1 (de la Pe˜
na and Gin´
e) Let ψ be a Young modulus of exponential
type. Then there exists a finite constant Cψ such that for every sequence of random
variables {Xk }
#
#
#sup
#
k
Proof.
#
|Xk | #
# ≤ Cψ sup Xk .
ψ
ψ −1 (k) #ψ
k
(3.2)
We can delete a finite number of terms from the supremum on the left side as
long as the number of terms deleted depends only on ψ. Furthermore, by homogeneity
it suffices to prove that the inequality holds in the case that supk Xk ψ = 1.
Let M ≥ 1/2 and let a > 0, b > 0 be constants such that
(a)
ψ −1 (xy) ≤ aψ −1 (x)ψ −1 (y)
3.1. ORLICZ NORMS AND THE PISIER INEQUALITY
35
and
ψ −1 (x2 ) ≤ bψ −1 (x) for all x, y ≥ M.
Define
1
−1
k0 = max 5, ψ ψ (M ) , M ,
b
−1
ψ (M 2 ) ψ −1 (M )
c = max
,
,b ,
ψ −1 (1/2) ψ −1 (1/2)
γ = abc.
For this choice of c we have, by the properties of ψ, that ψ(cψ −1 (t)) ≥ t2 for t ≥ 1/2;
this is easy for t ≥ M since c ≥ b and hence
x2 ≤ ψ(bψ −1 (x)) ≤ ψ(cψ −1 (x)),
while, for 1/2 ≤ t < M
ψ(cψ −1 (t)) ≥ ψ(cψ −1 (1/2)) ≥ M 2 > t2 .
Thus for t ≥ 1/2 we have
$ %
$
%
|Xk |
|Xk |
P r ψ sup
≥t
= P r sup
≥1
−1
−1
−1
k≥k0 γψ (k)
k≥k0 γψ (k)ψ (t)
≤
=
≤
≤
≤
∞
k=k0
∞
k=k0
∞
k=k0
∞
k=k0
∞
k=k0
&
'
P r |Xk | ≥ γψ −1 (k)ψ −1 (t)
&
(
)'
P r ψ (|Xk |) ≥ ψ γψ −1 (k)ψ −1 (t)
1
ψ (γψ −1 (k)ψ −1 (t))
1
ψ (bψ −1 (k)) ψ (cψ −1 (t))
1
1
≤ 2.
k 2 t2
4t
(
)
(
)
using k0 ≥ 5 at the last step and taking x = ψ b ψ −1 (k) , y = ψ c ψ −1 (t) in (a) to get
the next to last inequality. Hence it follows that
$ $ %
%
∞
|Xk |
|Xk |
1
E ψ sup
P r ψ sup
≤
+
≥ t dt
−1
−1
2
k≥k0 γψ (k)
k≥k0 γψ (k)
1/2
1 1 ∞ −2
1 1
≤
t dt = + = 1.
+
2 4 1/2
2 2
36
CHAPTER 3. MAXIMAL INEQUALITIES AND CHAINING
Thus we have proved that
#
#
#
|Xk | #
#
#
# sup −1 # ≤ γ = γψ .
#k≥k0 ψ (k) #
ψ
To complete the proof, note that
#
#
#
|Xk | #
#
#
#sup −1 # =
# k≥1 ψ (k) #
ψ
#
#
#
|Xk |
|Xk | #
#
#
∨ sup
# sup −1
#
#k<k0 ψ (k) k≥k0 ψ −1 (k) #
ψ
#
#
#
#
#
#
#
#
|X
|
|X
|
#
k #
k #
≤ #
sup
+
sup
#
#
#
#
−1
#k≥k0 ψ −1 (k) #
k<k0 ψ (k) ψ
ψ
1
≤
+ γψ ≡ Cψ .
ψ −1 (k)
k<k0
2
The following corollary of the proposition is a result similar to van der Vaart and
Wellner (1996), Lemma 2.2.2, page 96.
Corollary 3.1 If ψ is a Young function of the exponential type and {Xk }m
k=1 is any
finite collection of random variables, then
#
#
#
#
#
#
# sup |Xk |# ≤ Cψ ψ −1 (m) sup Xk ψ
#1≤k≤m
#
1≤k≤m
(3.3)
ψ
where Cψ is a finite constant depending only ψ.
To apply these basic inequalities to processes {X(t) : t ∈ T }, we need to introduce
several notions concerning the size of the index set T . For any > 0, the covering number
N (, T, d) of the metric or pseudo-metric space (T, d) is the smallest number of open balls
of radius at most and centers in T needed to cover T ; that is
$
N (, T, d) = min k : there exist t1 , . . . , tk ∈ T such that T ⊂
k
%
B(ti , ) .
i=1
The packing number is the largest k for which there exist k points t1 , . . . , tk in T at least
apart for the metric d; i.e. d(ti , tj ) ≥ if i = j. The metric entropy or -entropy of
(T, d) is log N (, T, d), and the -capacity is log D(, T, d). Covering numbers and packing
numbers are equivalent in the following sense:
D(2, T, d) ≤ N (, T, d) ≤ D(, T, d)
as can be easily checked.
(3.4)
3.1. ORLICZ NORMS AND THE PISIER INEQUALITY
Proof.
37
Let k = N (, T, d).
To prove the second inequality, we can fix a point t1 ∈ T and consider the ball of
radius around it. Since k > 1, there exists a point t2 ∈ T which is not in this ball
(i.e. d(t1 , t2 ) ≥ ). Consider now balls of radius centered at t1 and t2 . Since k > 2,
there exists t3 ∈ T such that t3 ∈
/ B (t1 ) and t3 ∈
/ B (t2 ). Proceeding in this way, we
get t1 , t2 , . . . , tk−1 which are all -separated (by construction) and finally get tk that is
-separated from the rest. So, we have found k points t1 , t2 , . . . , tk in T such that
d(ti , tj ) ≥ for i = j,
from which it follows N (, T, d) ≤ D(, T, d).
To prove the first inequality, we can take k + 1 points in T , say s1 , s2 , . . . , sk+1 . By
definition of N (, T, d), we can find k points in T , say t1 , t2 , . . . , tk , such that
T ⊆
k
B (ti ).
i=1
Then there exist two points si , sj both lying in some ball B (ti ), that is
d(si , sj) < 2.
This shows that we cannot find k + 1 points in T which are 2-separated and this prove
2
the first inequality.
As is well-known, if T ⊂ Rm is totally bounded and d is equivalent to the Euclidean
metric, then
K
m
for some constant K. For example, if T is the ball B(0, R) in Rm with radius R, then
N (, T, d) ≤
the bound in the last display holds with K = (6R)m .
As we will see in next chapters, there are a variety of interesting cases in which the
set T is a space of functions and a bound of the same form as the Euclidean case holds
(and hence such classes are called “Euclidean classes” by some authors). On the other
hand, for many spaces of functions T , the covering numbers grow exponentially fast as
0; for these classes we will typically have a bound of the form
log N (, T, d) ≤
K
r
for some finite constant K and r > 0; in these cases the value of r will turn out to be
crucial, as we will show later.
The following theorem is our first result involving a chaining argument. Its proof is
simpler than the corresponding result in van der Vaart and Wellner (1996) (Theorem
2.2.4, page 98), but it holds only for Young functions of exponential type.
CHAPTER 3. MAXIMAL INEQUALITIES AND CHAINING
38
Theorem 3.1 (de la Pe˜
na and Gin´
e) Let (T, d) be a pseudo-metric space, let {X(t) :
t ∈ T } be a stochastic process indexed by T , and let ψ be a Young modulus of exponential
type such that
X(t) − X(s)ψ ≤ d(s, t),
s, t ∈ T.
(3.5)
Then there exists a constant K dependent only on ψ such that, for all finite subsets
S ⊂ T , t0 ∈ T , and δ > 0, the following inequalities hold
#
#
D
#
#
#max |X(t)|# ≤ X(t0 ) + K
ψ −1 (N (, T, d))d,
ψ
#
#
t∈S
where D is the diameter of (T, d), and
#
#
#
#
#
#
max
|X(t) − X(s)|# ≤ K
#
s,t∈S, d(s,t)≤δ
Proof.
(3.6)
0
ψ
δ
ψ −1 (N (, T, d))d.
(3.7)
0
ψ
If (T, d) is not totally bounded, then the right hand side of (3.6) and (3.7) are
infinite. Hence we can assume that (T, d) is totally bounded and has diameter less than 1.
For a finite set S ⊂ T and t0 ∈ T , the set S ∪{t0 } is also finite and we have t0 ∈ S. We can
also assume that X(t0 ) = 0(if not, consider the process Y (t) = X(t) − X(t0 )). For each
non-negative integer k let
sk1 , . . . , skNk
≡ Sk ⊂ S be the centers of Nk ≡ N (2−k , S, d)
open balls of radius at most 2−k and centers in S that cover S. Note that S0 consists of
just one point, which we may take to be t0 . For each k, let πk : S → Sk be a function
satisfying d(s, πk (s)) < 2−k for all s ∈ S; such a function clearly exists by definition of
the set Sk . Furthermore, since S is finite, there is an integer ks such that for k ≥ ks
and s ∈ S we have d(πk (s), s) = 0. Then by (3.5) it follows that X(s) = X(πk (s)) a.s..
Therefore, for s ∈ S
X(s) =
ks
(X(πk (s)) − X(πk−1 (s)))
k=1
almost surely.
Now by the triangle inequality for the metric d we have
d(πk (s), πk−1 (s)) ≤ d(πk (s), s) + d(s, πk−1 (s)) < 2−k + 2−(k−1) = 3 · 2−k .
It therefore follows from Proposition (3.1) that
#
#
#
#
#max |X(s)|#
#
#
s∈S
ψ
#
ks #
#
#
#
#
max
≤
|X(t)
−
X(s)|
#t∈Sk ,s∈Sk−1
#
k=1
≤ 3Cψ
ks
2−k ψ −1 (Nk Nk−1 )
k=1
≤ K
ks
k=1
2−k ψ −1 (Nk ),
ψ
3.1. ORLICZ NORMS AND THE PISIER INEQUALITY
39
where we used the second condition defining a Young modulus of exponential type in
the last step. This implies (3.6) since N (2, S, d) ≤ N (, T, d) for every > 0 (to see
this, note that, if an -ball with center in T intersects S, it is contained in a 2-ball with
center in S), and then by bounding the sum in the last display by the integral in (3.6).
To prove (3.7), for δ > 0 set V = {(s, t) : s, t ∈ T, d(s, t) ≤ δ}, and for v ∈ V define
the process
Y (v) = X(tv ) − X(sv ) where v = (sv , tv ).
For u, v ∈ V define the pseudo-metric ρ(u, v) = Y (u) − Y (v)ψ . We can assume that
δ ≤ diam(T ); also note that
diamρ (V ) = sup ρ(u, v) ≤ 2 max Y (v)ψ ≤ 2δ,
v∈V
u,v∈V
and furthermore
ρ(u, v) ≤ X(tv ) − X(tu )ψ + X(sv ) − X(su )ψ ≤ d(tv , tu ) + d(sv , su ).
It follows that, if t1 , . . . , tN are the centers of a covering of T by N = N (, T, d) open
balls of radius at most , then the set of open balls with centers in {(ti , tj ) : 1 ≤ i, j ≤ N }
and ρ-radius 2 cover V . Not all the (ti , tj ) need to be in V , but if the 2 ball about
(ti , tj ) has a non-empty intersection with V , then it is contained in a ball of radius 4
centered at a point in V . Thus we have
N (4, V, ρ) ≤ N 2 (, T, d).
Thus the process {Y (v) : v ∈ V } satisfies (3.5) for the metric ρ. Thus we can apply (3.6)
to the process Y to it with the choice v0 = (s, s) for any s ∈ S, and thus Y (v0 ) = 0. We
therefore find that
#
#
#
#
#
|X(t) − X(s)|#
#s,t∈S,max
#
d(s,t)≤δ
≤ K
ψ
2δ
0
≤ K
≤ K
ψ −1 (N (r, V, ρ))dr
2δ
ψ −1 (N 2 (r/4, T, d))dr
0
δ/2
ψ −1 (N (, T, d))d,
0
where we used the second property of a Young modulus of exponential type in the last
step.
2
A process {X(t) : t ∈ T }, where (T, d) is a metric space (or a pseudo-metric space), is
separable if there exists a countable set T0 ⊂ T and a subset Ω0 ⊂ Ω with P (Ω0 ) = 1 such
CHAPTER 3. MAXIMAL INEQUALITIES AND CHAINING
40
that for all ω ∈ Ω, t ∈ T , and > 0, X(t, ω) is in the closure of {X(s, ω) : s ∈ T0 ∩B(t, )}.
If X is separable, then it is easily seen that
#
#
#
#
#sup |X(t)|# =
sup
#
#
t∈T
and similarly for
ψ
S⊂T, S finite
#
#
#
#
#max |X(t)|#
#
#
t∈S
ψ
#
#
#
#
#
#
sup
|X(s) − X(t)|# .
#
#d(s,t)≤δ, s,t∈T
#
ψ
As is well known, if (T, d) is a separable metric or pseudo-metric space and X is uniformly
continuous in probability for d, then X has a separable version. Since N (, T, d) < ∞
for all > 0 implies that (T, d) is totally bounded and the condition (3.5) implies that X
that is uniformly continuous in probability, the following corollary is an easy consequence
of the preceding theorem.
Corollary 3.2 Suppose that (T, d) is a pseudo-metric space of diameter D, and let ψ
be a Young modulus of exponential type such that
D
0
ψ −1 (N (, T, d))d < ∞.
(3.8)
If {X(t) : t ∈ T } is a stochastic process satisfying (3.5), then, for a version of X with all
sample paths in Cu (T, d), which we continue to denote by X,
#
#
#
#
#sup |X(t)|# ≤ X(t0 ) + K
ψ
#
#
t∈T
and
ψ
D
ψ −1 (N (, T, d))d
(3.9)
0
#
#
δ
#
#
#
#
sup
|X(t) − X(s)|# ≤ K
ψ −1 (N (, T, d))d.
#
#s,t∈T, d(s,t)≤δ
#
0
(3.10)
ψ
Corollary 3.3 (Gin´
e, Mason and Zaitsev (2003)) Let ψ be a Young modulus of
exponential type, let (T, d) be a totally bounded pseudometric space, and let {Xt : t ∈ T }
be a stochastic process indexed by T , with the property that there exist C < ∞ and
0 < γ < diam(T ) such that
||Xs − Xt ||ψ ≤ Cd(s, t)
(3.11)
whenever γ ≤ d(s, t) < diam(T ). Then there exists a constant L depending only on ψ
such that, for any γ < δ ≤ diam(T )
**
**∗
**
**∗
δ
**
**
**
**
**
**
**
**
ψ −1 (D(, T, d))d.
** sup |Xs − Xt |** ≤ 2 ** sup |Xs − Xt |** + CL
γ
**d(s,t)≤δ
**
**d(s,t)≤γ
**
ψ
ψ
2
(3.12)
3.2. GAUSSIAN AND SUB-GAUSSIAN PROCESSES VIA HOEFFDING’S INEQUALITY41
Let Tγ be a maximal subset of T satisfying d(s, t) ≥ γ for s = t ∈ Tγ . Then,
Proof.
card(Tγ ) = D(T, d, γ). If s, t ∈ T and d(s, t) ≤ δ, let sγ and tγ be points in Tγ such
that d(s, sγ ) < γ and d(t, tγ ) < γ, which exist by the maximality property of Tγ . Then,
d(sγ , tγ ) < δ + 2γ < 3δ. Since
*
*
*
*
*
*
|Xs − Xt | ≤ *Xs − Xsγ * + *Xt − Xtγ * + *Xsγ − Xtγ * ,
we obtain
**
**∗
**
**∗ **
**
**
**
**
**
(a) **supd(s,t)≤δ |Xs − Xt |** ≤ 2 **supd(s,t)<λ |Xs − Xt |** +**maxd(s,t)<3δ;s,t∈Tγ |Xs − Xt |**ψ .
ψ
ψ
Now, the process Xs restricted to the finite set Tγ satisfies inequality (3.11) for all
s, t ∈ Tγ , and therefore we can apply Theorem (3.1) to the restriction to Tγ of Xs /C to
conclude that
(b)
#
#
#
#
#
max
|Xs − Xt | /C #
#d(s,t)<3δ;s,t∈T
#
γ
≤ L
ψ
3δ
0
≤ 3L
δ
0
ψ −1 (D(, Tγ , d))d
ψ −1 (D(, Tγ , d))d,
where L is a constant that depends only on ψ. Now we note that D(, Tγ , d) ≤ D(, T, d)
for all > 0 and that, moreover, D(, Tγ , d) = card(Tγ ) = D(γ, T, d) for all ≤ γ. Hence,
δ
δ
−1
−1
ψ (D(, Tγ , d))d ≤ γψ (D(, T, d)) +
ψ −1 (D(, T, d))d
0
≤ 3
γ
δ
γ
2
ψ −1 (D(, T, d))d,
and this, in combination with the previous inequalities (a) and (b), gives the corollary.
2
Corollary (3.2) gives an example of “restricted” or “stopped” chaining. Gin´e and
1
Zinn (1984) use restricted chaining with γ = n− 4 at stage n, but other choices are of
1
interest in the applications of Gin´e, Mason, and Zaitsev (2003): they take γ = ρn− 2 , ρ
arbitrary.
3.2
Gaussian and sub-Gaussian processes via Hoeffding’s
Inequality
Recall that a process X(t), t ∈ T , is called Gaussian process if all the finite-dimensional
distributions are multivariate normal. As indicated previously, the natural pseudometric
ρX defined by
ρ2X (s, t) = E[(X(s) − X(t))2 ],
s, t ∈ T
CHAPTER 3. MAXIMAL INEQUALITIES AND CHAINING
42
is very convenient and useful in this setting. Here is a further corollary of Corollary (3.2)
due to Dudley (1967).
Corollary 3.4 Suppose that X(t), t ∈ T , is a Gaussian process with
D+
log N (, T, ρX )d < ∞.
0
Then there exists a version of X (which we continue to denote by X) with almost all of
its sample paths in Cu (T, ρX ) which satisfies
**
**
**
**
**sup |X(t)|** ≤ ||X(t0 )|| + K
ψ2
**
**
t∈T
D
+
0
ψ2
for any fixed t0 ∈ T , and
#
#
#
#
#
#
sup
|X(t)
−
X(s)|
#
#
#s,t∈T, ρX (s,t)≤δ
#
≤K
δ
log N (, T, ρX )d
+
0
ψ2
log N (, T, ρX )d
(3.13)
(3.14)
for all 0 < δ ≤ D = diam(T).
Proof.
By direct computation, if Z ∈ N (0, 1) then
E exp(Z 2 /c2 ) = ,
1
1−
2
c2
<∞
+
for c2 > 2. Choosing c2 = 8/3 yields E exp(Z 2 /c2 ) = 2. Hence ||Z||ψ2 = 8/3. By
,
homogeneity this yields ||ρZ||ψ2 = σ 83 . Thus it follows that
'1
8&
8
2 2
||X(t) − X(s)||ψ2 =
E[(X(t) − X(s)) ] =
ρX (s, t),
3
3
,
so we can choose ψ = ψ2 and ρ = 83 ρX in Corollary (3.2). The inequalities (3.9) and
(3.10) yield (3.13) and (3.14) for different constants K after noting two easy facts.
First,
ψ2−1 (x) =
for an absolute constant C =
+
log(1 + x) ≤ C
log(3)
log(2)
+
log x,
x ≥ 2,
< 1.26; and N (·, T, ρ) is monotone decreasing with
N (D/2, T, ρ) ≥ 2, N (D, T, ρ) = 1. It follows that for 0 < δ ≤ D/2 we have
δ+
δ+
log(1 + N (, T, ρ))d ≤ C
log N (, T, ρ)d,
0
0
and, for D/2 < δ ≤ D,
δ+
log(1 + N (, T, ρ))d ≤ 2
0
D/2 +
0
≤ 2C
≤ 2C
D/2 +
0
0
log (1 + N (, T, ρ))d
δ
+
log N (, T, ρ)d
log(1 + N (, T, ρ))d.
3.2. GAUSSIAN AND SUB-GAUSSIAN PROCESSES VIA HOEFFDING’S INEQUALITY43
Second, for any positive constant b > 0,
δ
+
log N (, T, bρ)d = b
0
δ/b +
log N (, T, ρ)d
0
by an easy change of variables. Combining these facts with (3.9) and (3.10) yields the
2
claimed inequalities.
The previous proof applies, virtually without change, to sub-Gaussian processes.
First recall that a process X(t), t ∈ T , is sub-Gaussian with respect to the pseudometric d on T if
x2
P r(|X(s) − X(t)| > x) ≤ 2 exp − 2
2d (s, t)
s, t ∈ T, x > 0.
,
Moreover the process X is sub-Gaussian in this sense with d taken to be a constant
multiple of ρX if and only if
&
'1
||X(s) − X(t)||ψ2 ≤ C E[(X(t) − X(s))2 ] 2 = CρX (s, t)
(3.15)
for some C < ∞ and all s, t ∈ T .
Example 3.1 Suppose that 1 , . . . , n are independent Rademacher random variables
(that is P r(j = ±1) = 1/2 for j = 1, . . . , n), and let
X(t) =
n
t i i ,
t = (t1 , . . . , tn ) ∈ Rn .
i=1
Then it follows from Hoeffding’s inequality that
P r(|X(s) − X(t)| > x) ≤ 2 exp −
x2
2 ||s − t||2
,
where ||·|| denotes the Euclidean norm. Hence for any subset T ⊂ Rn the process
{X(t) : t ∈ T } is sub-Gaussian with respect to the Euclidean norm and we have
||X(t) − X(s)||ψ2 ≤
√
6 ||s − t||
by Lemma (3.1). If T also satisfies
D
0
+
log(1 + N (, T, ||·||)d < ∞,
(3.16)
then {X(t) : t ∈ T } has bounded continuous sample paths on T . This example will play
a key role in the development for empirical processes in next chapters where we will
proceed by first symmetrizing the empirical process with Rademacher random variables
and by conditioning on the Xi
s generating the empirical process.
CHAPTER 3. MAXIMAL INEQUALITIES AND CHAINING
44
Here is a statement of the results bonds for sub-Gaussian processes.
Corollary 3.5 Suppose that X(t), t ∈ T , is a sub-Gaussian process with respect to the
pseudo-metric d on T satisfying
D+
0
log N (, T, d)d < ∞.
Then there exists a version of X (which we continue to denote by X) with almost all of
its sample paths in Cu (T, d) which satisfies
**
**
**
**
**sup |X(t)|** ≤ ||X(t0 )|| + K
ψ2
**
**
t∈T
ψ2
for any fixed t0 ∈ T , and
#
#
#
#
#
#
sup
|X(t) − X(s)|#
#
#s,t∈T, d(s,t)≤δ
#
ψ2
log N (, T, d)d
(3.17)
0
≤K
+
D
δ
+
log N (, T, d)d
(3.18)
0
for all 0 < δ ≤ D = diam(T ).
3.3
Bernstein’s inequality and ψ1 - Orlicz norms for maxima
Suppose that Y1 , . . . , Yn are independent random variables with EYi = 0 and P (|Yi | ≤
M ) = 1 for i = 1, . . . , n. Bernstein’s inequality gives a bound on the tail of the absolute
value of the sum ni=1 Yi . We will derive it from Bennett’s inequality.
Lemma 3.2 (Bennett’s inequality) Suppose that Y1 , . . . , Yn are independent random
variables with Yi ≤ M almost surely for all i = 1, . . . , n and zero means. Then
n
Mx
x2
P
ψ
,
Yi > x ≤ exp −
2V
V
i=1
n
where V ≥ Var ( i=1 Yi ) = ni=1 Var(Yi ) and ψ is the function given by
(3.19)
ψ(x) = 2h(1 + x)/x2
with
h(x) = x (log x − 1) + 1,
x > 0.
Lemma 3.3 (Bernstein’s inequality) If Y1 , . . . , Yn are independent random variables
with |Yi | ≤ M almost surely for all i = 1, . . . , n and zero means. Then
*
* n
* *
x2
*
*
Yi * > x ≤ 2 exp −
P *
*
*
2(V + M x/3)
i=1
where V ≥ Var ( ni=1 Yi ) = ni=1 Var(Yi ).
(3.20)
3.3. BERNSTEIN’S INEQUALITY AND ψ1 - ORLICZ NORMS FOR MAXIMA
Proof.
(a) P (
45
Set σi2 = Var(Yi ), i = 1, . . . , n. For each r > 0
n
i=1 Yi
> x) ≤ e−rx
.n
rYi
i=1 Ee
= e−rx
.n
i=1 E
&
'
1 + rYi + (1/2)r 2 Yi 2 g(rYi )
where g(x) = 2(ex − 1 − x)/x2 is non-negative increasing and convex for x ∈ R. Thus
'
&
'
&
E 1 + rYi + (1/2)r 2 Yi 2 g(rYi ) = 1 + (1/2)r 2 E Yi 2 g(rY i ) ≤ 1 + (1/2)r 2 σi 2 g(rM )
for i = 1, . . . , n. Substituting this bound into (a) and then using 1 + u ≤ eu shows that
the right side of (a) is bounded by
n
2 g(rM ) r
e−rx
exp(r 2 σi 2 g(rM )/2) = exp −rx +
σi 2
2
i=1
i=1
n
erM − 1 − rM 2
= exp −rx +
σi
M2
i=1
erM − 1 − rM
≤ exp −rx +
V
M2
n
/
Minimizing this upper bound with respect to r shows that it is minimized by the choice
r = M −1 log(1+M x/V ). Plugging this in and using the definition of ψ yields the claimed
inequality. Lemma 3.3 follows by noting that ψ(x) ≥ (1 + x/3)−1 .
2
Note that for large x the upper bound in Bernstein’s inequality is of the form
exp(−3x/2M ) while for x close to zero the bound is of the form exp(−x2 /2V ). This
suggests that it might be possible to bound the maximum of random variables satisfying a Bernstein type inequality by a combination of the ψ1 and ψ2 Orlicz norms. The
following proposition makes this explicit.
Proposition 3.2 Suppose that X1 , . . . , Xm are arbitrary random variables satisfying
the probability tail bound
x2
P (|Xi | > x) ≤ 2 exp −
2(d + cx)
,
for all x > 0 and i = 1, . . . , m for fixed positive numbers c and d. Then there is a
universal constant K < ∞ so that
**
**
√ +
**
**
** max |Xi |** ≤ K c log(1 + m) + d log(1 + m)
**1≤i≤m
**
ψ1
Proof.
Note that the hypothesis implies that
$
≤ 4 exp(−x2 /4d) if x ≤ d/c
P (|Xi | > x)
.
≤
exp(−x/4c)
if x > d/c
CHAPTER 3. MAXIMAL INEQUALITIES AND CHAINING
46
Hence it follows that the random variables |Xi | 1[|Xi |≤d/c] and |Xi | 1[|Xi |>d/c] satisfy, respectively,
P (|Xi | 1[|Xi |≤d/c] > x) ≤ 2 exp(−x2 /4d),
x>0
P (|Xi | 1[|Xi |>d/c] > x) ≤ 2 exp(−x/4c),
x>0
and
Then 3.1 implies that
√
**
**
**|Xi | 1[|X |≤d/c] ** ≤ 12d
i
ψ
2
and
**
**
**|Xi | 1[|X |>d/c] **
i
ψ1
≤ 12c
for i = 1, . . . , m. This yields
**
**
**
**
**
**
**
**
**
**
**
**
** max |Xi |**
≤ **** max |Xi | 1[|Xi |≤d/c] **** + **** max |Xi | 1[|Xi |>d/c] ****
**1≤i≤m
**
1≤i≤m
1≤i≤m
ψ1
ψ1
ψ1
**
**
**
**
**
**
**
**
≤ C **** max |Xi | 1[|Xi |≤d/c] **** + **** max |Xi | 1[|Xi |>d/c] ****
1≤i≤m
ψ
1≤i≤m
2
√ +
≤ K
d log(1 + m) + c log(1 + m) ,
ψ1
where the second inequality follows from the fact that for any random random variable
V we have ||V ||ψ1 ≤ C ||V ||ψ2 for some constant C, and the third inequality follows from
Corollary 3.1 applied with ψ = ψ2 and with ψ = ψ1 .
2
3.4. EXERCISES
3.4
47
Exercises
Exercise 3.1 Show that the constant random variable X = 1 has Xψp = (log 2)−1/p
for ψp (x) = exp(xp ) − 1.
Solution.
By definition of Lψ -Orlicz norm of X in the case proposed, we have
0 p p
1
Xψp = inf c > 0 : E eX /c − 1 ≤ 1
−p
= inf c > 0 : ec − 1 ≤ 1
−p
= inf c > 0 : ec ≤ 2 .
It follows immediately
ec
−p
≤ 2 ⇔ c−p ≤ log 2 ⇔
1
≤ (log 2)1/p ⇔ c ≥ (log 2)−1/p ,
c
and this means exactly that Xψp = (log 2)−1/p .
2
Exercise 3.2 Let ψ be a Young modulus. Show that if 0 ≤ Xn ↑ X almost surely, then
Xn ψ ↑ Xψ .
(Hint: use the monotone convergence theorem to show that lim Eψ(Xn /r Xψ ) > 1 for
any r < 1).
Solution.
Since X1 ≤ X2 ≤ · · · ≤ X, we have that Xn ψ is a non-decreasing succes-
sion and, moreover, that it is bounded above by Xψ . Thus Xn ψ converges.
Let L be its limit and put L < Xψ (obviously, L ≤ Xψ ). Under this hypothesis,
we can have L = r Xψ , for some 0 < r < 1.
Now, we can note that
Xn
X
↑
a.s.
r Xψ r Xψ
and
ψ
Xn
r Xψ
↑ψ
X
r Xψ
a.s.
By the monotone convergence theorem, we get
2 3
2 3
Xn
X
E ψ
↑ E ψ
,
r Xψ
r Xψ
and, by the definition of Orlicz norm and by the fact that r < 1, also
2 3
X
E ψ
> 1.
r Xψ
CHAPTER 3. MAXIMAL INEQUALITIES AND CHAINING
48
On the other hand, as r Xψ ≥ Xn ψ for all n, it follows that, for all n,
2 3
Xn
E ψ
≤ 1.
r Xψ
and, as done before, it means also
2 E ψ
X
r Xψ
3
≤ 1.
Obviously, this leads to a contradiction. It follows that L = limn Xn ψ = Xψ and
2
the proof is complete.
Exercise 3.3 Show that the infimum in the definition of an Orlicz norm is attained (at
Xψ ).
Solution.
Without loss of generality, we can consider the case in which X > 0. By
definition of Orlicz norm, we have
! "
X
Xψ = c˜ = inf c > 0 : E ψ
≤1 ,
c
where ψ is a Young modulus.
In order to solve the exercise, we have to show that c˜ itself belongs to the set introduced by the definition. For this purpose, we can take any succession cn strictly
decreasing to c˜. By definition of c˜,
! "
X
E ψ
≤ 1 for each n.
cn
Moreover, we have that
X X
↑
cn
c˜
a.s.,
which leads, by continuity of ψ (due to its convexity), to
X
X
ψ
↑ψ
cn
c˜
and to
! "
! "
X
X
E ψ
↑E ψ
cn
c˜
by the monotone convergence theorem.
Finally, this shows exactly that
! "
X
E ψ
≤ 1,
c˜
and, then, c˜ indeed belongs to the set.
2
3.4. EXERCISES
49
Exercise 3.4 Let ψ be a Young modulus.
1. Show that its conjugate function ψ ∗ defined by
ψ ∗ (y) = sup {xy − ψ(x)} ,
x>0
y≥0
is a Young modulus.
2. Moreover, show that
||X||1 ≤
√
2 max ||X||ψ , ||X||ψ∗ .
1. ψ ∗ satisfies the properties of a Young modulus.
Solution.
i. ψ ∗ (0) = 0
ii. ψ ∗ is increasing function.
Given any two real positive numbers y1 and y2 such that 0 ≤ y1 ≤ y2 , for each
x > 0,
xy1 − ψ(x) ≤ xy2 − ψ(x)
which implies
sup xy1 − ψ(x) ≤ sup xy2 − ψ(x)
x>0
x>0
iii. ψ ∗ is a convex function.
By the definition of ψ ∗ ,
ψ ∗ ((1 − α)y1 + αy2 ) = sup {(1 − α)xy1 + αxy2 − ψ(x)}
x>0
Now, for any fixed x > 0,
(1 − α)xy1 + αxy2 − ψ(x) = (1 − α) [xy1 − ψ(x)] + α [xy2 − ψ(x)]
≤ (1 − α)ψ ∗ (y1 ) + αψ ∗ (y2 )
Finally, for the convexity, notice that
sup {(1 − α)xy1 + αxy2 − ψ(x)} = sup {x((1 − α)y1 + αy2 ) − ψ(x)}
x>0
x>0
≤ (1 − α)ψ ∗ (y1 ) + αψ ∗ (y2 )
2. To prove the inequality, consider two sets C and D defined as follows:
0
C = Xψ , ∞
0
D = Xψ∗ , ∞ .
CHAPTER 3. MAXIMAL INEQUALITIES AND CHAINING
50
The two sets satisfy the following properties:
! "
X
C∈C ⇔ E ψ
≤1
C
! "
Y
D ∈ D ⇔ E ψ∗
≤ 1.
D
Now, choose two random variables X, Y independent and identically distributed.
Then for any C ∈ C and D ∈ D, we have
!
"
! "
! "
X Y
Y
Y
X
XY
X
∗
·
≤ψ
+ψ
⇒E
≤E ψ
+ E ψ∗
C D
C
D
CD
C
D
which implies
E(X 2 )
≤2
CD
or
X1 2 ≤ 2CD.
Thus, choosing C = Xψ and D = Xψ∗ we obtain the desired inequality
E(X 2 ) = X1 2 ≤ 2 Xψ Xψ∗
≤ 2 max Xψ , Xψ∗
so that
||X||1 ≤
√
2 max ||X||ψ , ||X||ψ∗
2
Exercise 3.5 Suppose that Z is a standard Normal random variable. Show that for all
z ≥ 0,
Solution.
z2
P (|Z| > z) ≤ exp −
2
If Z is a standard Normal random variable then Z 2 is a Chi-square random
variable with one degree of freedom. Notice that
&
'
{|Z| > y} ⇔ Z 2 > z 2
and then for the corresponding probability it holds that
P (|Z| > z) = P (Z 2 > z 2 ).
Then, in order to prove the required inequality, it suffices to show that, for any λ > 0,
P (χ21 > λ) ≤ exp((−1/2)λ).
3.4. EXERCISES
51
But
P (χ21 > λ) ≤ P (χ22 > λ)
= P (exp(1/2) > λ)
1
= e− 2 λ ,
where Exp(1/2) refers to a random variable with Exponential distribution with parameter
2
1/2.
Exercise 3.6 Suppose that
(
)
P r (|X(s) − X(t)| > x) ≤ K exp −Cx2 /d2 (s, t)
for a given stochastic process X and certain positive constant K and C. Then the process
X is sub-Gaussian for a multiple of the distance d.
Solution.
We look for a function D = f (k, c) such that
2
2
min 1, ke−cx ≤ 2e−Dx
for every x ≥ 0.
If such a function exists the conclusion would follow immediately, since we would have
−Dx2
P r (|X(s) − X(t)| > x) ≤ 2 exp
d2 (s, t)
−x2
= 2 exp
2d∗ 2 (s, t)
with d∗ actually a multiple of d
1
2d∗2 (s, t) = d2 (s, t)/D ⇒ d∗2 (s, t) = 1/2Dd2 (s, t) ⇒ d∗ (s, t) = (1/2D) 2 d(s, t)
Now we prove the crucial assumption
2
∃D = f (k, C) : min 1, ke−cx
≤ 2e−Dx
2
for every x ≥ 0.
Let’s restrict for the moment to the case k > 2. We have:
−cx2
ke
−cx2
≤1⇔e
Thus
−cx2
min 1, ke
=
-
≤ 1/k ⇔ cx2 ≥ log k ⇔ x ≥
⎧
⎪
⎨ ke−Cx2
if
x≥
⎪
⎩
if
x<
1
log k
C
log k
C
log k
.
C
1
2
1
2
.
In other words in order to show the inequality we need to ensure the existence of a
function D satisfying the two following conditions:
CHAPTER 3. MAXIMAL INEQUALITIES AND CHAINING
52
i.
−Dx2
2e
≥1
with
x<
ii.
−Dx2
2e
−Cx2
≥ ke
x≥
with
log k
c
log k
c
1/2
1/2
.
Now, it is easy to show that the two conditions are jointly satisfied when it holds:
D≤C
so that we can choose D = C.
It remains to show that D = C satisfy the crucial inequality in the particular case
k ≤ 1. But this is a trivial verification. If k ≤ 1 then
2
2
min 1, ke−cx = ke−cx .
We need to find D such that for any x
2
2
ke−cx ≤ ke−Dx ,
that is
k
2
≤ e(C−D)x .
2
Since k/2 ≤ 1/2, it is sufficient to choose D = C.
2
Chapter 4
Inequalities for sums of
independent processes
In this chapter several inequalities for sums of independent stochastic processes are presented, in particular
1. symmetrization inequalities,
2. Ottaviani inequality,
3. Levy’s inequalities,
4. Hoffmann-Jørgensen inequalities.
4.1
Symmetrization inequalities
Suppose X1 , X2 , . . . , Xn are i.i.d. random variables with probability distribution P on
the measurable space (X , A). For some class of real-valued function F or X, consider
the process
(Pn − P )f =
n
(f (Xi ) − P f ),
f ∈F
i=1
Let 1 , . . . , n be i.i.d. Rademacher random variables, independent of (X1 , . . . , Xn ). It
will be more useful to consider the symmetrized process
1
i f (Xi ),
n
n
P0n f =
f ∈ F,
i=1
or
1
i (f (Xi ) − P f ) = P0n f − n P f ,
n
n
P†n f =
i=1
53
f ∈ F.
54
CHAPTER 4. INEQUALITIES FOR SUMS OF INDEPENDENT PROCESSES
It will be convenient in the following to generalize the treatment beyond the empirical
process setting. Consider sums of independent stochastic processes {Zi (f ) : f ∈ F}.
The processes Zi need not possess any measurability beyond the measurability of all
marginal Zi (f ) but for computing outer expectations it will be understood that the
.
underlying probability space ni=1 (Xi , Ai , Pi ) × (Z, C, Q) and each Zi is a function of
the i-th coordinate of (x, z) = (x1 , . . . , xn , z) only. The additional Rademacher or other
random variables are understood to be functions of the (n + 1)-st coordinate z only. The
empirical process corresponds clearly here to taking Zi (f ) = f (Xi ) − P f .
Lemma 4.1 Let Z1 , . . . , Zn be independent stochastic processes with mean zero. Then
for any nondecreasing, convex Φ : R → R and arbitrary function µi : F → R,
# # # # n
# n
# n
#
#
# #
#
#
1
#
#
#
#
#
#
E∗Φ
≤ E∗Φ #
≤ E∗Φ 2 #
i Zi #
Zi #
i (Zi − µi )#
#
#
#
#
#
#
2#
i=1
F
i=1
F
i=1
F
Proof.
Let Y1 , . . . , Yn be an independent copy of Z1 , . . . , Zn defined formally as the
.
coordinate projections on the last n coordinates in the product space ni=1 (Xi , Ai , Pi ) ×
.
(Z, C, Q) × ni=1 (Xi , Ai , Pi ). Since EYi (f ) = 0, the left side of the proposition is an
average of expressions of the type:
# # n
#1 #
#
#
EZ∗ Φ #
,
ei (Zi (f ) − EYi (f ))#
#2
#
F
i=1
where (e1 , . . . , en ) ∈ {−1, 1}n . By convexity of Φ and the norm ·F , it follows from
Jensen’s inequality that this expression is bounded above by
# # # n
# n
#1 #
#1 #
#
#
#
#
∗
∗
EZ,Y
Φ #
ei (Zi (f ) − Yi (f ))#
Φ #
(Zi (f ) − Yi (f ))#
= EZ,Y
#2
#
#2
#
F
i=1
F
i=1
Finally, apply the triangle inequality and convexity of Φ.
To prove the inequality on the right, note that for fixed values of the Zi ’s we have :
# n
#
* n
* n
*
*
# #
*
*
*
*
#
#
*
*
*
*
Zi # = sup *
(Zi (f ) − EYi (f ))* ≤ EY∗ sup *
(Zi (f ) − Yi (f ))*
#
#
#
*
*
f ∈F *
f ∈F *
i=1
F
i=1
i=1
where EY∗ is the outer expectation with respect to Y1 , . . . , Yn computed for P n for given
fixed values of Z1 , . . . , Zn . this in combination with Jensen’s inequality yields:
⎛#
# #∗Y ⎞
# n
n
# #
#
#
#
#
#
#
Φ #
≤ EY Φ ⎝#
Zi #
(Zi (f ) − Yi (f ))# ⎠
#
#
#
#
i=1
F
i=1
F
where ∗Y denotes the minimal measurable majorant of the supremum with respect to
Y1 , . . . , Yn , still with Z1 , . . . , Zn fixed. Because Φ is continuous and nondecreasing, the
4.1. SYMMETRIZATION INEQUALITIES
55
∗Y inside Φ can be moved to EY∗ . Now take the expectation with respect to Z1 , . . . , Zn to
get
# # # n
# n
# #
#
#
#
#
#
#
∗ ∗
E Φ #
≤ EZ EY Φ #
Zi #
(Zi (f ) − Yi (f ))#
#
#
#
#
∗
F
i=1
F
i=1
Here the repeated outer expectation can be bounded above by the joint outer expectation
E ∗ by Lemma 1.2.6 of van der Vaart and Wellner (1996). Note that adding a minus sign
in front of a term [Zi (f )−Yi (f )] has the effect of exchanging Zi and Yi . By construction of
the underlying probability space as product space, the outer expectation of any function
f (Z1 , . . . , Zn , Y1 , . . . , Yn ) remain unchanged under permutations of its 2n arguments.
The resulting expression
# # n
#
#
#
#
E ∗Φ #
ei (Zi (f ) − Yi (f ))#
#
#
F
i=1
is the same for any n-tuple (e1 , . . . , en ) ∈ {−1, 1}n . Thus
# # # n
# n
# #
#
#
#
#
#
#
∗
≤ E EZ,Y
.
E∗Φ #
Zi #
Φ #
i (Zi (f ) − Yi (f ))#
#
#
#
#
i=1
F
F
i=1
Now add and subtract µi inside the right side and use the triangle inequality and convexity of Φ to show that the right side of the preceding display is bounded above by
# # # n
# n
#
#
#
#
1
1
#
#
#
#
∗
∗
+ E EZ,Y
Φ 2#
i (Zi (f ) − µi (f ))#
Φ 2#
i (Yi (f ) − µi (f ))#
E EZ,Y
#
#
#
#
2
2
F
i=1
F
i=1
∗
Perfectness of coordinate projections implies that the expectation EZ,Y
is the same as EZ∗
and EY∗ in the two terms, respectively. Finally, replace the repeated outer expectations
by a joint outer expectation and note that the two resulting terms are equal.
2
Corollary 4.1 For every nondecreasing, convex Φ : R → R and class of measurable
functions F
#
⎛#
# †# ⎞
# # #Pn #
( # 0# )
# #
∗ ⎝
∗
∗
∗
F⎠
#
#
E Φ
≤ E Φ (Pn − P F ) ≤ E Φ 2 Pn F ∧ E Φ 2 #P†n #
2
F
We will frequently use these symmetrization inequalities with the choice Φ(x) = x.
Although the hypotheses that Φ is convex function rules out the choice Φ(x) = 1{x > a},
there is a corresponding symmetrization inequality for probabilities which is also useful.
Lemma 4.2 For arbitrary stochastic processes Z1 , . . . , Zn and arbitrary functions µ1 ,
. . . , µn : F → R,
#
#
# n
# n
# #
#
#
#
#
#
#
βn (x)P ∗ #
Zi # > x ≤ 2P ∗ 4 #
i (Zi − µi )# > x
#
#
#
#
i=1
F
i=1
F
56
CHAPTER 4. INEQUALITIES FOR SUMS OF INDEPENDENT PROCESSES
(
)
for every x > 0 and βn (x) ≤ inf F P | ni=1Zi (f )|< x2 . In particular this is true for
8
i.i.d. mean-zero processes, and βn (x) = 1 − 4n x2 supf Var [Z1 (f )].
Proof.
Let Y1 , . . . , Yn be an independent copy of Z1 , . . . , Zn , suitably defined on a
product space as previously. If ni=1 Zi F > x, then there is certainly some f for which
| ni=1 Zi (f )| > x. Fix a realization Z1 , . . . , Zn and f for which both are the case. For
this fixed realization
*
* n
*
* x
*
*
β ≤ PY∗ *
Yi (f )* <
*
* 2
i=1
*
* n
n
* x
*
*
*
≤ PY∗ *
Yi (f ) −
Zi (f )* >
* 2
*
i=1
i=1
#
# n
#
# x
#
#
≤ PY∗ #
(Yi − Zi )# >
.
#
# 2
i=1
The far left and far right sides do not depend on the particular f . Integrate the two
sides out with respect to Z1 , . . . , Zn over this set to obtain
#
#
# n
# n
# #
#
#
x
#
#
#
#
βP ∗ #
Zi # > x ≤ PZ∗ PY∗ #
(Yi − Zi )# >
.
#
#
#
#
2
i=1
F
F
i=1
By symmetry, the right side equals
#
# n
#
#
x
#
#
E PZ∗ PY∗ #
i (Yi − Zi )# >
.
#
#
2
F
i=1
In view of the triangle inequality, this expression is not bigger than
#
# n
#
#
x
#
∗ #
2P
i (Yi − µi )# >
.
#
#
#
4
F
i=1
Processes Z1 , . . . , Zn with mean zero satisfy the condition for the given β in view of
2
Chebyshev’s inequality.
Lemma 4.3 (Second symmetrization lemma for probabilities) Suppose that {Z(f ) : f ∈ F}
and {Y (f ) : f ∈ F} are independent stochastic processes indexed by F. Suppose that
x > > 0, then
%
$
βn ()P
∗
sup |Z(f )| > x
f ∈F
where βn () ≤ inf f ∈F P (|Y (f )| ≤ ).
≤P
∗
sup |Z(f ) − Y (f )| > x − ,
f ∈F
4.2. THE OTTAVIANI INEQUALITY
Proof.
57
We suppose that Z and Y are defined on a product space (Ω × Ω
, B × B ). If
ZF > x, then there is some f ∈ F for which |Z(f )| > x. Fix an outcome ω ∈ Ω and
f ∈ F so that |Z(f, ω)| > x. Then we have
βn () ≤ PY∗ (|Y (f )| ≤ ) ≤ PY∗ (|Z(f, ω) − Y (f )| > x − )
≤ PY∗ (Z(·, ω) − Y F > x − ) .
The far left and far right sides do not depend on the particular f , and the inequality
holds on the set {ZF > x}. Integration of the two sides with respect to Z over this
2
set yields the stated conclusion.
4.2
The Ottaviani Inequality
Throughout this section Sn equals the partial sum X1 +· · ·+Xn of independent stochastic
processes X1 , . . . , Xn . The processes Xi is called symmetric if Xi and −Xi have the same
distributions. Independence of the stochastic processes X1 , . . . , Xn is understood in the
sense that each of the processes is defined on a product probability space
∞
/
(Ωi , Ai , Pi )
(4.1)
i=1
with Xi depending on the i-th coordinate of (w1 , w2 , . . .) only.
Proposition 4.1 (Ottaviani inequality) Let X1 , . . . , Xn be independent stochastic
processes indexed by an arbitrary set. Then for λ, η > 0,
P
Proof.
∗
max Sk > λ + η
k≤n
≤
P ∗ (Sn > λ)
.
1 − maxk≤n P ∗ (Sn − Sk > η)
Let Ak defined by
Ak = inf {k > 0 : Sk > λ + η} = {S1 ∗ ≤ λ + η, . . . , Sk−1 ∗ ≤ λ + η, Sk ∗ > λ + η} .
The event on the left is the disjoint union of A1 , . . . , An . Since Sn − Sk ∗ is independent
of S1 ∗ , . . . , Sk ∗ ,
P (Ak ) min P (Sn − Sj ∗ ≤ η) ≤ P (Ak , Sn − Sk ∗ ≤ η)
j≤n
≤ P (Ak , Sn ∗ > λ) ,
since Sk ∗ > λ + η on Ak . Summing up over k yields the result.
2
CHAPTER 4. INEQUALITIES FOR SUMS OF INDEPENDENT PROCESSES
58
4.3
Levy’s Inequalities
Proposition 4.2 (Levy’s inequalities) Let X1 , . . . , Xn be independent, symmetric
stochastic processes indexed by an arbitrary set. Then for every λ > 0
∗
P max Sk > λ ≤ 2P ∗ (Sn > λ) ,
k≤n
P
∗
max Xk > λ ≤ 2P ∗ (Sn > λ) .
k≤n
Let Ak be the event that Sk ∗ is the first Sj ∗ that is strictly greater than λ:
Proof.
Ak = {S1 ∗ ≤ λ, . . . , Sk−1 ∗ ≤ λ, Sk ∗ > λ} .
The event on the left in the first inequality is the disjoint union of A1 , . . . , An . Write
Tn for the sum of the sequence X1 , . . . , Xk , −Xk+1 , . . . , −Xn . By the triangle inequality,
2 Sk ∗ ≤ Sn ∗ + Tn ∗ . It follows that
P (Ak ) ≤ P (Ak , Sn ∗ > λ) + P (Ak , Tn ∗ > λ) = 2P (Ak , Sn ∗ > λ)
since X1 , . . . , Xn are symmetric. Summing up over k yields the first inequality.
To prove the second inequality, let Ak be the event that Xk ∗ is the first Xj ∗
that is strictly greater than λ. Write Tn for the sum of the variables −X1 , . . . , −Xk−1 ,
Xk , −Xk+1 , . . . , −Xn . By the triangle inequality, 2 Xk ∗ ≤ Sn ∗ + Tn ∗ . The rest of
2
the proof goes exactly as before.
4.4
Hoffman-Jørgensen Inequalities
Proposition 4.3 (Hoffman-Jørgensen inequalities) Let X1 , . . . , Xn be independent
stochastic processes indexed by an arbitrary set. Then for any λ, η > 0,
P
∗
max Sk > 3λ + η
k≤n
2
∗
≤ P max Sk > λ + P max Xk > η .
∗
k≤n
k≤n
If X1 , . . . , Xn are independent and symmetric, then also
2
∗
∗
∗
P max Sn > 2λ + η ≤ 4P (Sn > λ) + P max Xk > η .
k≤n
k≤n
Proof. Let Ak = {S1 ∗ ≤ λ, . . . , Sk−1 ∗ ≤ λ, Sk ∗ > λ}. Then Ak ’s are disjoint and
n
∗
k=1 Ak = {maxk≤n Sk > λ}. By the triangle inequality
Sj ∗ ≤ Sk−1 ∗ + Xk ∗ + Sj − Sk ∗ ,
∀j ≥ k.
4.4. HOFFMAN-JØRGENSEN INEQUALITIES
59
By construction of Ak , conclude that on Ak
max Sj ∗ ≤ λ + max Xk ∗ + max Sj − Sk ∗ .
j≥k
k≤n
j>k
Since the processes Xj are independent, we obtain for every k
∗
∗
P Ak , max Sk > 3λ + η
≤
P Ak , max Xk > η
k≤n
k≤n
∗
+ P (Ak )P max Sm − Sk > 2λ
m>k
∗
≤
P Ak , max Xk > η
k≤n
∗
+ P (Ak )P max Sk > λ ,
k≤n
since in the probability on the far right the variable maxm>k Sm − Sk ∗ can be bounded
by 2 maxk≤n Sk ∗ . Next sum over k to obtain the first inequality of the proposition.
To prove the second inequality, first use the same method as above to show that
∗
∗
P (Ak , Sn > 2λ + η) ≤ P Ak , max Xk > η + P (Ak )P (Sn − Sk ∗ > λ)
k≤n
since Sn − Sk ∗ ≤ maxk≤n Sn − Sk ∗ . Then summation over k yields
∗
∗
P (Sn > 2λ + η)
≤
P max Xk > η
k≤n
∗
∗
+ P max Sk > λ P max Sn − Sk > λ .
k≤n
k≤n
The processes Sk and Sn − Sk are the partial sums of the symmetric processes
X1 , . . . , Xn and Xn , . . . , X2 respectively. Application of Levy’s inequality to both probabilities on the far right concludes the proof.
2
Proposition 4.4 (Hoffman-Jørgensen’s inequality for moments) Let 0 < p < ∞
and suppose that X1 , . . . , Xn are independent stochastic processes indexed by an arbitrary index set T . Then there exist constants Cp and 0 < up < 1 such that
p
p
∗
∗
−1
p
E max Sk ≤ Cp E max Xk + F (up ) ,
k≤n
k≤n
where F −1 is the quantile function of the random variable maxk≤n Sk ∗ . Furthermore,
if X1 , . . . , Xn are symmetric, then there exist constants Kp and 0 < υp < 1 such that
p
p
∗
∗
−1
p
E Sn ≤ Kp E max Xk + G (υp ) ,
k≤n
where G−1 is the quantile function of the random variable Sn ∗ . For p ≥ 1, the last
inequality is also valid for mean-zero processes (with different constants).
60
CHAPTER 4. INEQUALITIES FOR SUMS OF INDEPENDENT PROCESSES
Proof.
Take λ = η = t in the first inequality of the preceding proposition to find that,
for any x > 0
E ∗ max Sk p
k≤n
=
4p
∞
P
max Sk ∗ > 4t d (tp )
k≤n
0
∞
2
max Sk ∗ > t d (tp ) +
≤
(4x)p + 4p
≤
2
max Xk ∗ > t d (tp )
k≤n
x
∗
p
p
(4x) + 4 P max Sk > x E ∗ max Sk p + 4p E ∗ max Xk p .
+ 4p
P
x
∞
k≤n
P
k≤n
k≤n
k≤n
Now choose x such that 4p P (maxk≤n Sk ∗ > x) is bounded by 12 . By rearranging terms
the first inequality follows. The second inequality can be proved in a similar way, this
time using the second inequality of the preceding proposition.
The inequality for zero-mean processes follows from the inequality for symmetric
processes by symmetrization and desymmetrization: it follows from Jensen’s inequality
that E ∗ Sn p is bounded by E ∗ Sn − Tn p where Tn is the sum of n independent copies
of X1 , . . . , Xn .
2
Chapter 5
Glivenko-Cantelli Theorems
5.1
Glivenko-Cantelli classes F
In this chapter we will prove two types of Glivenko-Cantelli theorems via symmetrization
and the maximal inequalities developed in Chapter 3.
To begin, we need to first define entropy with bracketing. Let (F·) be a subset of a
normed space of real functions f : X → R; usually we will take · to be the supremum
norm or the Lr (Q) norm for some r ≥ 1 and a probability measure Q on the measurable
space (X , A).
Given two functions l and u on X , the bracket [l, u] is the set of all functions f ∈ F
with l ≤ f ≤ u. The functions l and u need not belong to F, but are assumed to have
finite norms. An − bracket is a bracket [l, u] with u − l ≤ . The bracketing number
N[ ] (, F| · ) is the minimum number of − brackets needed to cover F. The entropy
with bracketing is the logarithm of the bracketing number.
Theorem 5.1 Let F be a class of measurable functions such that
N[ ] (, F, L1 (P )) < ∞
for every . Then F is P –Glivenko-Cantelli, that is
Pn −
Proof.
P ∗F
=
∗
sup |Pn f − P f |
f ∈F
→a.s. 0.
Fix > 0. Choose finitely many − brackets [li , ui ], i = 1, . . . , m with m =
N (, F, L1 (P )) whose union contains F and such that P (ui − li ) < for all 1 ≤ i ≤ m.
Thus, for everyf ∈ F there is a bracket [li , ui ] such that
(Pn − P )f ≤ (Pn − P )ui + P (ui − f ) ≤ (Pn − P )ui + .
61
CHAPTER 5. GLIVENKO-CANTELLI THEOREMS
62
Similarly
(P − Pn )f ≤ (P − Pn )li + P (f − li ) ≤ (P − Pn )li + .
It follows that
sup |(Pn − P )f | ≤ max (Pn − P )ui ∨ max (P − Pn )li + f ∈F
l≤i≤m
l≤i≤m
where the right converges almost surely to by the strong law of large numbers for real
random variables (2m times). Thus lim supn Pn −P ∗F ≤ almost surely for every > 0.
2
We define an envelope function for a class of real functions F on a measurable space
(X , A) any function F on X such that |f (x)| ≤ F (x) for all x ∈ X and all f ∈ F. The
minimal envelope function is x → supf ∈F |f (x)|. From the theorem just proved it follows
that any class F satisfying the bracketing hypothesis automatically has a measurable
envelope function.
One of the simplest settings to which this theorem applies involves a collection of
functions f = f (·, t) indexed or parametrized by t ∈ T , a compact subset of a metric
space (D, d). Here is the basic lemma; it goes back to Wald (1949) and Le Cam (1953).
Lemma 5.1 Suppose that F = {f (·, t) : t ∈ T } where the functions f are s.t. f :
X × T → R, are continuous in t for P –almost all x ∈ X . Suppose that T is compact
and that the envelope function F defined by F (x) = supt∈T |f (x, t)| satisfies P ∗ F < ∞.
Then
N[ ] (, F, L1 (P )) < ∞
for every > 0, and hence F is P –Glivenko-Cantelli.
Proof.
Define, for x ∈ X , t ∈ T , and ρ > 0,
ψ(x; t, ρ) :=
sup
|f (x, s) − f (x, t)|.
s∈T,d(s,t)<ρ
Since f is continuous in t, it happens that for any countable set D dense in {s ∈ T :
d(s, t) < ρ},
ψ(x; t, ρ) :=
sup
|f (x, s) − f (x, t)|,
s∈D,d(s,t)<ρ
and hence ψ(·; t, ρ) is a measurable function for each t ∈ T and ρ > 0. Note that
ψ(x; t, ρ) → 0 as ρ → 0 for P –almost every x and ψ(x; t, ρ) ≤ 2F ∗ (x) with P F ∗ < ∞, so
the dominated convergence theorem yields
P ψ(X; t, ρ) = ψ(x; t, ρ) dP (x) → 0
5.1. GLIVENKO-CANTELLI CLASSES F
63
as ρ → 0.
Fix δ > 0. For each t ∈ T choose ρt so small that P ψ(X; t, ρt ) ≤ δ. This yields an
open cover of T : the balls Bt := {s ∈ T : d(s, t) < ρt } work. By compactness of T there
is a finite sub-cover Bt1 , . . . , Btk of T . In terms of this finite sub-cover, define brackets
for F by
lj (x) = f (x, tj ) − ψ(x; tj , ρtj ),
uj (x) = f (x, tj ) + ψ(x; tj , ρtj ), j = 1, . . . , k.
Then P (uj − lj ) = 2P ψ(X; tj , ρtj ) ≤ 2δ and for t ∈ Btj we have
lj (x) ≤ f (x, t) ≤ uj (x).
Hence
N[ ] (δ, F, L1 (P )) ≤ k.
2
The next lemma further quantify the finiteness given by Lemma 5.1 by imposing a
Lipschitz type condition rather than just continuity.
Lemma 5.2 Suppose that {f (·, t) : t ∈ T } is a class of functions satisfying
|f (x, t) − f (x, s)| ≤ d(s, t)F (x)
∀s, t ∈ T, x ∈ X
for some metric d on the index set, and a function F on the sample space X . Then, for
any norm · ,
N[ ] (2F , F, · ) ≤ N (, T, d).
Proof.
Let t1 , . . . , tk be an − net for T with respect to d. This can be done with
k = N (, T, d) points. Then the brackets [f (·, tj ) − F, f (·, tj ) + F ] cover F, and are of
size at most 2F .
2
Lemma 5.3 Suppose that for every θ in a compact subset U of Rd the class Fθ = {fθ,γ :
γ ∈ Γ} satisfies
1
W
for a constant W < 2 and K not depending on θ. Suppose in addiction that for every
log N[ ] (, Fθ , L2 (P )) ≤ K
θ1 , θ2 , and γ ∈ Γ
|fθ1 ,γ − fθ2 ,γ | ≤ F |θ1 − θ2 |
for a function F with P F 2 < ∞. Then F = ∪θ∈U Fθ satisfies
W
1
log N[ ] (, F, L2 (P )) ≤ d log(1/) + K
.
CHAPTER 5. GLIVENKO-CANTELLI THEOREMS
64
Proof.
Fix d = 2. Take a square Q with side L, s.t. U ⊂ Q. Since U ⊂ R2 is
compact, this is doable. Find the smallest integer M such that
into
M2
subsquares. Let Si denote the i-th square. Now
L
M −1
L
M
< and chip Q
≥ implies M 2 ≤
4L2
2 .
Consider Si ∩ U for each i and suppose that it is no-empty. Fix θi ∈ U ∩ Si and consider
− brackets [l1 , u1 ], . . . , [lNi , uNi ], where Ni ≤ exp K
is the number of − brackets
W
needed to cover Fθi . Consider any other θ ∈ U ∩ Si . Now
|f (θ, γ) − f (θi , γ)| ≤ F θ − θi ,
∀γ
and
f (θi , γ) − F θ − θi ≤ f (θ, γ) ≤ f (θi , γ) + F θ − θi .
Fix r. Then there exist an − bracket [lj , uj ] such that lj ≤ f (θi , γ) ≤ uj , (j ≤ Ni ).
Thus
lj − F θ − θi ≤ f (θ, γ) ≤ uj + F θ − θi .
,
√
2
L
≤ 2 and hence
Now, since θ and θi lie in a square of length M < , θ − θi ≤ 2L
M2
√
√
lj − 2F ≤ f (θ, γ) ≤ uj + 2F .
√
√
i
It follows that [lj − 2F , uj + 2F ]N
j=1 , form a bracket for the class
Fθ .
θ∈U ∩Si
Furthermore note that,
√
√
√
√
uj + 2F − lj + 2F L2 (P ) ≤ uj − lj L2 (P ) + 2 2F ≤ (1 + 2 2F ).
K
2
4L2
brackets of size (1 +
Since there are at most 4L
2 , it follows that at most 2 exp W
√
2 2F ) are needed to cover F = ∪θ∈U Fθ . Conclude that the number of − brackets
needed is dominated by a constant times 12 exp K
2
W .
Theorem 5.2 (Vapnik-Chervonenkis (1981), Pollard (1981), Gin´
e-Zinn (1984))
Let F be a P –measurable class of measurable functions that is L1 (P )–bounded. Then
F is P –Glivenko-Cantelli if and only if
(i) P ∗ F < ∞ and
(ii) limn→∞
E ∗ log N (,FM ,L2 (Pn ))
n
=0
∀M < ∞ and > 0 where FM is the class {f 1{F ≤ M } : f ∈ F}.
Proof.
By the symmetrization inequality given by Corollary 4.1, measurability of the
class F, and Fubini’s theorem,
1
i f (Xi )F
n
n
E ∗ Pn − P F
≤ 2EX E i=1
n
1
≤ 2EX E i f (Xi )FM + 2P ∗ F 1{F > M },
n
i=1
5.1. GLIVENKO-CANTELLI CLASSES F
65
by the triangle inequality, for every M > 0. For sufficiently large M the last term is
arbitrarily small. To prove convergence in mean, it suffices show that the first term
converges to zero for fixed M . To do this, fix X1 , . . . , Xn . If G is an –net over FM in
L2 (Pn ), then it is also an -net in L1 (Pn ) (since L2 (Pn ) norms are larger than L1 (Pn )
norms via Cauchy-Schwarz). Hence it follows that
1
1
i f (Xi )FM ≤ E i f (Xi )G + .
n
n
n
E n
i=1
(5.1)
i=1
The cardinality of G can be chosen equal to N (, FM , L2 (Pn )). We now use the maximal
inequality of Corollary 3.1 with ψ2 (x) = exp(x2 ) − 1, to conclude that the right side of
the last display is bounded by a constant multiple of
+
1
1 + log N (, FM , L2 (Pn )) sup i f (Xi )ψ2 |X + ,
f ∈G n
n
i=1
where the Orlicz norms · ψ2 |X are taken over 1 , . . . , n with X1 , . . . , Xn fixed. By
+
+
Example 3.1, these ψ2 -norms can be bounded by 6/n(Pn f 2 )1/2 ≤ 6/nM since f ∈
G ⊂ FM . Hence the right side of the last display is bounded above by
+
+
1 + log N (, FM , L2 (Pn )) 6/nM + →p in outer probability. This shows that the left side of 5.1 converges to zero in probability.
Since it is bounded by M , its expectation with respect to X1 , . . . , Xn converges to zero
by the dominated convergence theorem.
This concludes the proof that E ∗ Pn −P F → 0. To see that Pn −P ∗F also converges
to zero almost surely, note that it is a reverse sub-martingale with respect to a suitable
filtration, and hence almost sure convergence follows from the reverse sub-martingale
2
convergence theorem.
Before treating examples, it is useful to specialize Theorem 5.2 to the class of indicator
functions of some class of sets C. In this setting the random entropy condition can be
restated in terms of a quantity which will arise naturally in Chapter 8 in the context of
VC theory: for n points x1 , . . . , xn in X and a class C of subsets of X , set
def
∆Cn (x1 , . . . , xn ) = #{C ∩ {x1 , . . . , xn } : C ∈ C}.
Then the sufficiency part of the following theorem follows from Theorem 5.2.
Theorem 5.3 (Vapnik-Chervonenkis-Steele GC theorem) If C is a P -measurable
class of sets, then the following are equivalent:
(i) Pn − P ∗C →a.s. 0,
(ii) n−1 E log ∆C (X1 , . . . , Xn ) → 0.
CHAPTER 5. GLIVENKO-CANTELLI THEOREMS
66
Proof.
We first show that (ii) implies (i). Since F = {1C : C ∈ C} has constant
envelope function 1, the first condition of Theorem 5.2 holds trivially and we need only
show that (ii) implies the random entropy condition in this case. To see this, note that
for any r > 0
N (, F, Lr (Pn )) ≤ N (r
(a)
−1 ∨1
, F, L∞ (Pn )) ≤ (2/r
−1 ∨1
)n
where
f − gLr (Pn ) = {Pn |f − g|r }1/(r∨1) ,
f − gL∞ (Pn ) = max |f (Xi ) − g(Xi )|.
1≤i≤n
Now if C1 , . . . , Ck are k = N (, C, L∞ (Pn )) form an -net for C for the L∞ (Pn ) metric,
and < 1, the if C ∈ C satisfies
max (1C−Cj (Xi ) + 1Cj −C (Xi )) = max |1C (Xi ) − 1Cj (Xi )| < 1≤i≤n
1≤i≤n
for some j ∈ {1, . . . , k}, then the left side must be zero, and hence no Xi is in any C − Cj
or Cj − C. Thus it follows that,
k = #{{X1 , . . . , Xn } ∩ Cj , for some Cj , j = 1, . . . , k}
= #{{X1 , . . . , Xn } ∩ C, C ∈ C};
in other words, for all < 1,
(b)
∆Cn (X1 , . . . , Xn ) = N (, C, L∞ (Pn )).
Combining (a) and (b), we see that condition (ii) of Theorem 5.3 implies the random
entropy condition of Theorem 5.2, and sufficiency of (ii) follows.
2
Example 5.1 Suppose that X = Rd and
F = {x → 1(−∞,t] (x) : t ∈ Rd } = {1C : C ∈ C}
where C = {(−∞, t] : t ∈ Rd }. Then, as will be proved in Chapter 7, for all probability
measure Q on (X , A) = (Rd , Bd ),
N (, F, L1 (Q)) ≤ M (K/)d
for constants M = Md and K and every > 0. Therefore
log N (, F, L1 (Q)) ≤ log M + d log(K/),
and the conditions of Theorem 5.2 hold easily with the constant envelope function F ≡ 1.
Thus F is P –GC for all P on (Rd , Bd ). Note that for ft = 1(−∞,t] ∈ F, the corresponding
5.2. UNIVERSAL AND UNIFORM GLIVENKO-CANTELLI CLASSES
67
functions t → P (ft ) = P (X ≤ t) and t → Pn (ft ) = Pn (X ≤ t) are the classical
distribution function of X ∼ P and the corresponding classical empirical distribution
function. Thus the conclusion may restated as
Pn (X ≤ ·) − P (X ≤ ·)∞ = sup |Pn (X ≤ t) − P (X ≤ t)| →a.s. 0.
t∈Rd
Example 5.2 Suppose that X = Rd and
F = {x → 1(s,t] (x) : s, t ∈ Rd , s ≤ t} = {1C : C ∈ C}
where C = {(s, t] : s, t ∈ Rd , s ≤ t}. Then, as will be proved in Chapter 7, for all
probability measure Q on (X , A) = (Rd , Bd ),
N (, F, L1 (Q)) ≤ M (K/)2d
for constants M = Md and K and every > 0. Therefore
log N (, F, L1 (Q)) ≤ log M + 2d log(K/),
and the conditions of Theorem 5.2 again hold with the constant envelope function F ≡ 1.
Thus F is P –GC for all P on (Rd , Bd ). Since F is in a one-to-one correspondence with
the class of the sets C, the class of all (upper closed) rectangles in this case, we also say
that C is P –Glivenko-Cantelli for all P .
5.2
Universal and Uniform Glivenko-Cantelli classes
If F is P –Glivenko-Cantelli for all P on (X , A), then we say that F is a universal
Glivenko-Cantelli class.
A still stronger Glivenko-Cantelli property is formulated in terms of the uniformity
of the convergence in probability measure P on (X , A). We let P = P(X , A) be the set
of all probability measures on the measurable space (X , A). We say that F is a strong
uniform Glivenko-Cantelli class if for all > 0
∗
sup P rP sup Pm − P F > → 0
P ∈P(X ,A)
n→∞
as
m≥n
where P(X , A) is the set of all probability measures on (X , A).
For x = (x1 , . . . , xn ) ∈ X n , n = 1, 2, . . ., and r ∈ (0, ∞), we define on F the pseudodistances
$
ex,r (f, g) =
n−1
n
%r−1 ∧1
|f (xi ) − g(xi )|r
,
i=1
ex,∞ (f, g) = max |f (xi ) − g(xi )|,
1≤i≤n
f, g ∈ F.
CHAPTER 5. GLIVENKO-CANTELLI THEOREMS
68
Let N (, F, ex,r ) denote the -covering number of (F, ex,r ), > 0. Then define, for
n = 1, 2, . . . , > 0, and r ∈ (0, ∞], the quantities
def
Nn,r (, F) = sup N (, F, ex,r ).
x∈X n
Theorem 5.4 (Dudley, Gin´
e and Zinn (1991)) Suppose that F is a class of uniformly bounded functions such that F is image admissible Suslin. Then the following
are equivalent:
(a) F is a strong uniform Glivenko-Cantelli class.
(b)
log Nn,r (,F )
n
Proof.
→ 0 for all > 0 for some (all) r ∈ (0, ∞].
We first show that (b) with r = 1 implies (a).
Let i be a sequence of
Rademacher random variables independent of Xi . By uniform boundedness of F, M =
F ∞ < ∞. By Lemma 4.2 with x = n and boundedness of F it follows that for all
> 0 and for all n sufficiently large we have
#
$# n
%
#
#
#
#
P r {Pn − P F > } ≤ 4P r #
i f (Xi )# > n/4 .
#
#
F
i=1
For n = 1, 2, . . . , let xn (w) = (X1 (w), . . . , Xn (w)) ∈
X n.
By definition of N (, F, ex,1 ) for
each w there is a function πn = πnw : F → F with card{πn f : f ∈ F} = N (/8, F, exn (w),1 )
and exn (w),1 (f, πn f ) ≤ /8, f ∈ F. By Hoeffding’s inequality:
P r {
n
i f (Xi )F > n/4} ≤ EP P r {
i=1
n
i πn f (Xi )F > n/8}
i=1
≤ 2E{N (/8, F, exn (w),1 )} exp(−n2 /(128M 2 )),
where the interchange of EP and E is justified by the image admissible Suslin condition.
By the hypothesis (b) with r = 1, for all n sufficiently large we have N (/8, F, ex,1 ) ≤
exp(n2 /(256M 2 )) for all x ∈ X n . Therefore we can conclude that
P r{Pn − P F > } ≤ 8 exp (−n2 /(256M 2 ))
for sufficiently large n. Summing up over n, it follows that there is an N so that for
n ≥ N we have
sup
P ∈P k≥n
P r{Pk − P F > } ≤ 8
∞
exp(−k2 /(256M 2 ))
k=n
≤ 8
exp(−n2 /(256M 2 ))
,
1 − exp(−2 /(256M 2 ))
where the right term goes to 0 as n → ∞. This completes the proof of (a).
The proof that (a) implies (b) uses Gaussian symmetrization techniques, so it will be
treated in Chapter 9.
2
5.3. PRESERVATION OF THE GC PROPERTY
5.3
69
Preservation of the GC property
Now our goal is to present several results concerning the stability of the Glivenko-Cantelli
property of one or more classes of functions under composition with functions φ.
Theorem 5.5 Suppose that F1 , . . . , Fk are P –Glivenko-Cantelli classes of functions,
and that φ : Rk → R is continuous. Then H ≡ φ(F1 , . . . , Fk ) is P –Glivenko-Cantelli
provided that it has an integrable envelope function.
Proof.
We will prove the thesis for classes of functions Fi which are appropriately mea-
surable. Let F1 , . . . , Fk and H be integrable envelopes for F1 , . . . , Fk and H respectively,
and set F = F1 ∨ . . . ∨ Fk . For M ∈ (0, ∞), define
HM ≡ {φ(f )1[F ≤M ] : f = (f1 , . . . , fk ) ∈ F1 × . . . × Fk ≡ F}.
Now
(Pn − P )φ(f )F ≤ (Pn + P )H 1[F >M ] + (Pn − P )hHM .
The expectation of the first term on the right converges to 0 as M → ∞. Hence it suffices
to show that HM is P –Glivenko-Cantelli for every fixed M . Let δ = δ() be the δ of
Lemma 5.2 below for φ : [−M, M ]k → R, > 0, and · the L1 (Pn )-norm · 1 . Then
for any (fj , gj ) ∈ Fj , j = 1, . . . , k,
Pn |fj − gj |1[Fj ≤M ] ≤
δ
,
k
j = 1, . . . , k
implies that
Pn |φ(f1 , . . . , fk ) − φ(g1 , . . . , gk )|1[F ≤M ] ≤ .
It follows that
N (, HM , L1 (Pn )) ≤
k
/
j=1
N
δ
, Fj 1[Fj ≤M ] , L1 (Pn ) .
k
Thus E ∗ log N (, HM , L1 (Pn )) = o(n) for every > 0, M < ∞. This implies that
E ∗ log N (, (HM )N , L1 (Pn )) = o(n)
for (HM )N the functions h1{H ≤ N } for h ∈ HM . Thus HM is strong Glivenko-Cantelli
for P by Theorem 5.1. This concludes the proof that H = φ(F) is weak Glivenko-Cantelli.
Because it has an integrable envelope, it is strong Glivenko-Cantelli.
2
Theorem 5.6 (Dudley (1998a)) Suppose that F is a P –Glivenko-Cantelli class for P
with P F < ∞, J is a possible unbounded interval including the ranges of all f ∈ F, φ
is continuous and monotone on J, and for some finite constants c, d, |φ(y)| ≤ c|y| + d for
all y ∈ J. Then φ(F) is also a strong Glivenko-Cantelli class for P .
CHAPTER 5. GLIVENKO-CANTELLI THEOREMS
70
Given classes F1 , . . . , Fk of functions such that fi : X → R and a function
Proof.
φ:
Rk
→ R, let φ(F1 , . . . , Fk ) be the class of functions x → φ(f1 (x), . . . , fk (x)), where
fi ∈ Fi , i = 1, . . . , k. With this assumption the thesis trivially follows from Theorem 5.5
2
Proposition 5.1 (Dudley (1998b)) Suppose that F is a strong Glivenko-Cantelli class
for P with P F < ∞, and g is a fixed bounded function (g∞ = k < ∞, k > 0). Then
the class of functions g · F ≡ {g · f : f ∈ F} is a strong Glivenko-Cantelli class for P .
Proof.
Take F1 = {g}, F2 = F, and φ : R2 → R given by φ(u, v) = u v (is continuous)
in Theorem 5.5. Now F1 and F2 are GC and
g · f = φ(f1 , f2 ) ≤ K F
∀f1 ∈ F1 , f2 ∈ F2
and P (K F ) < ∞. By Theorem 5.5 it follows that g · F is a strong Glivenko-Cantelli
2
class for P .
Proposition 5.2 (Gin´
e and Zinn (1984)) Suppose that F is a uniformly bounded
strong Glivenko-Cantelli class for P , and g ∈ L1 (P ) is a fixed function. Then the class
of functions g · F ≡ {g · f : f ∈ F} is a strong Glivenko-Cantelli class for P .
Proof.
Take F1 = {g}, F2 = F, and φ : R2 → R given by φ(u, v) = u v (is continuous)
in Theorem 5.5. Now F1 and F2 are GC and in view of Theorem 5.5, it suffices to check
that φ(F1 , F2 ) has an integrable envelope. Since F is uniformly bounded, f ∞ ≤ K for
all f ∈ F, for some K > 0 and it is easily checked that K · g is an integrable envelope
for g · F.
2
Lemma 5.4 Any strong P –Glivenko-Cantelli class F is totally bounded in L1 (P ) if and
only if P F < ∞. Furthermore for any r ∈ (1, ∞), if F has an envelope that is contained
in Lr (P ), then F is also totally bounded.
Proof.
A class that is totally bounded is also bounded. Thus for the first statement
we only need to prove that a strong Glivenko-Cantelli class F with P F < ∞ is totally
bounded in L1 (P ).
It is well-know that such a class has an integrable envelope (e.g. see Gin´e and Zinn
(1983) to conclude first that P ∗ f − P f F < ∞). Next the claim follows from the
triangle inequality f F ≤ f − P f F + P F . There is no loss of generality to assume
that the class F possesses an envelope that is finite everywhere).
Now, suppose that there exists a sequence of finitely discrete probability measures
Pn such that
Ln = sup{|(Pn − P )|f − g|| : f, g ∈ F} → 0
5.3. PRESERVATION OF THE GC PROPERTY
71
Then for every > 0, there exists n0 such that Ln0 < . For this n0 there exists a finite
–net f1 , . . . , fN over F relative to the L1 (Pn0 )–norm, because restricted to the support
of Pn0 the functions f are uniformly bounded by the finite envelope and hence covering
F in L1 (Pn0 ) is like covering a compact in Rn0 . Now, for any f ∈ F there is an fi such
that
P |f − fi | ≤ Ln0 + Pn0 |f − fi | < 2.
It follows that F is totally bounded in L1 (P ).
To conclude the proof it is suffices to select a sequence Pn . This can be constructed
as a sequence of realizations of the empirical measure if we know that the class |F − F|
is P –GC. It is immediate from the definition of a Glivenko-Cantelli class that |F − F| is
P –GC. Next, by Dudley’s theorem, and by previous propositions, the classes (F − F)+
and (F − F)− are P –Glivenko-Cantelli. Then the sum of these two classes is P –GC and
hence the proof is complete.
If F has an envelope in Lr (P ), then F is totally bounded in Lr (P ) if the class FM of
functions f · 1{F ≤M } is totally bounded in Lr (P ) for every fixed M. We had proved that
the class FM is P –GC and hence this class is totally bounded in L1 (P ). But the it is
also bounded in Lr (P ), because P |f |r ≤ P |f |M r−1 for any f that is bounded by M and
we can construct the –net over FM in L1 (P ) to consist of functions that are bounded
by M.
2
Lemma 5.5 Suppose that ϕ : K → R is continuous and K ⊂ Rk is compact. Then for
every > 0 there exists δ > 0 such that for all n and for all a1 , . . . , an , b1 , . . . , bn ∈ K ⊂
Rk ,
1
1
ai − bi < δ ⇒
|ϕ(ai ) − ϕ(bi )| < .
n
n
n
n
i=1
i=1
Here · con be any norm on Rk , in particular it can be xr =
[1, +∞) or x∞ ≡ max1≤i≤k |xi | for x = (x1 , . . . , xn ) ∈
Proof.
k
r
i=1 |xi |
1
r
, r ∈
Rk .
Let Un be uniform on {1, . . . , n}, and set Xn = aUn , Yn = bUn . Then we can
write
1
ai − bi = EXn − Yn n
n
i=1
and
1
|ϕ(ai ) − ϕ(bi )| = E|ϕ(Xn ) − ϕ(Yn )|.
n
n
i=1
Hence, it suffices to show that for every > 0 there exists δ > 0 such that for all (X, Y ),
random vectors in K ⊂ Rk ,
EX − Y < δ ⇒ E|ϕ(X) − ϕ(Y )| < .
CHAPTER 5. GLIVENKO-CANTELLI THEOREMS
72
Suppose not. Then for some > 0 and for all m = 1, 2, . . . there exists (Xm , Ym )
such that
1
E|ϕ(Xm ) − ϕ(Ym )| ≥ .
m
But, since {(Xm , Ym )} is tight, there exists (Xm , Ym ) →d (X, Y ). Then, it follows that
EXm − Ym <
EX − Y = lim
E(Xm − Ym = 0
m →∞
so that X = Y a.s., while, on the other hand,
0 = E|ϕ(X) − ϕ(Y )| = lim
E|ϕ(Xm ) − ϕ(Ym )| ≥ > 0.
m →∞
2
Another potentially useful preservation theorem is one based on building up GlivenkoCantelli classes from the restriction of a class of functions to elements of a partition of
the sample space. the following theorem is related to the result of van der Vaart (1996)
for Donsker classes.
Theorem 5.7 Suppose that F is a class of functions on (X , A, P ), and {Xi } is a partition
of X : ∞
i=1 Xi = X , Xi ∩ Xj = ∅ for i = j. Suppose that Fj ≡ {f · 1Xi : f ∈ F} is
P –Glivenko-Cantelli for each j, and F has an integrable envelope function F . Then X
is itself P –Glivenko-Cantelli.
Proof.
Since
f =f
∞
1Xj =
j=1
it follows that
∗
E Pn − P F ≤
∞
∞
,
j=1
E ∗ Pn − P Fj → 0
j=1
by the dominated convergence theorem since each term in the sum converges to zero by
the hypothesis that each Fj is P –Glivenko-Cantelli, and we have,
E ∗ Pn − P Fj ≤ E ∗ Pn (F · 1Xj ) + P (F · 1Xj ) ≤ 2P (F · 1Xj )
where
∞
j=1 P (F
· 1Xj ) = P (F ) < ∞.
2
5.4. EXERCISES
5.4
73
Exercises
Exercise 5.1 Show that, if F is a class of function satisfying the bracketing entropy
hypothesis of Theorem 5.1, then F has a measurable envelope F satisfying P F < ∞.
Solution.
By the entropy hypothesis,
N[ ] (, F, L1 (P )) < ∞ ⇒ F := max {|fj |, |gj |}
1≤j≤n
such that
|F |dP < ∞.
By contradiction,
∞ = f dP =
max {|fj |, |gj |}dP ⇒ ∃j : max {|fj |, |gj |} = ∞,
1≤j≤n
1≤j≤n
but in the definition of [fj , gj ],fj , gj < ∞.
2
Exercise 5.2 Suppose that X = R and that X ∼ P .
(i) For 0 < M < ∞ and a ∈ R, let f (x, t) = | x − t |,
and F = Fa,M = {f (x, t) :| t − a |≤ M }.
(ii) For a ∈ R, let,
f (x, t) =| x − t | − | x − a |, and F = Fa = {f (x, t) :| t − a |≤ M }.
Show that N[ ] (, F, L1 (P )) < ∞ for every > 0 for the classes F in (i) if E|X| < ∞,
and in (ii) without the hypothesis E|X| < ∞. Compute the envelope function for the
two classes.
Solution.
(i) We use Lemma 6.2,
suppose that {f (·, t) : t ∈ T } is a class of function satisfying:
|f (x, t) − f (x, s)| ≤ d(s, t)F (x)
∀s, t ∈ T, x ∈ X for some metric d on the index set. Then for any norm · , (we choose
the L1 (P ) norm)
N[ ] (2F , F, · ) ≤ N (, T, d) or N[ ] (, F, · ) ≤ N
, T, d .
F In the present case,
T = [a − M, a + M ] |f (x, t) − f (x, s)| = ||x − t| − |x − s|| ≤ |t − s| = d(t, s) · 1
for (t, s) ∈ T .
Also, N ( F
, T, d) < ∞ (observe F ≡ 1 then F ≡ 1), T is compact with respect to
the euclidean metric and the brackets are of the form
2
3
f (·, tj ) −
, f (·, tj ) +
,
2F 2F CHAPTER 5. GLIVENKO-CANTELLI THEOREMS
74
where t1 , t2 , . . . , tk is an
2F –
net over T . This can be done with k = N
F , T, d
points. We check,
P (|uj |), P (|lj |) ≤ ∞,
*
*
P (|lj |) = P *f (x, tj ) −
**
+ E[|x − tj |] ≤
+ tj + E[|x|] < ∞
* ≤
2F 2F 2F and similary for uj = f (x, tj ) +
2F .
(ii) In this case a is fixed,
T = [a − M, a + M ]|f (x, t) − f (x, s)| = ||x − t| − |x − s|| ≤ |t − s| = d(t, s) · 1
and as in (i) we can get {[lj , uj ]}kj=1 with k = N F
,
T,
d
< ∞, s.t.
∀f ∈ Fa , ∃ : lj (x) ≤ f (x) ≤ uj (x), lj (x) = f (x, tj ) −
Here t1 , t2 , . . . , tk is an
2F –
2
and
uj (x) = f (x, tj ) + .
2
net over T and
P (|lj |) ≤
+ E[|x − tj | − |x − a|] < ∞.
2
Note that |x − tj | − |xa | is bounded in absolute magnitude by |tj − a|. So the hypothesis
E[|X|] < ∞ is non needed to ensure that the edges of the bracket have finite L1 (P )
2
norms.
Exercise 5.3 Suppose that F is a P –Glivenko-Cantelli class of measurable functions;
that is Pn − P ∗F →a.s. 0 as n → ∞. Show that this implies P ∗ f − P f F < ∞. Thus if
P F = supf ∈F |P f | < ∞, P ∗ F < ∞ for an envelope function F .
Solution.
E[f (xj ) − P f F ] =
∞
0
P ∗ (f (xj ) − P f F > t)dt
and this is finite if and only if
∞
P ∗ (f (xj ) − P f F > n) < ∞.
n=1
But, since the Xi are i.i.d.,
∞
P ∗ (f (xj ) − P f F > n) =
n=1
and
∞
n=1
∞
P ∗ (f (xn ) − P f F > n)
n=1
P ∗ (f (xn ) − P f F > n) < ∞ if P (lim sup An ) = 0,
n
5.4. EXERCISES
75
where, in the present problem, An = {f (xn ) − P f F > n}. We want to use the second
Borel-Cantelli lemma : If {An } is a sequence of independent events, then
P (An ) = ∞ ⇒ P (lim sup An ) = 1,
n→∞
consequently,
P (lim sup An ) = 0 ⇒
n→∞
P (An ) = ∞.
Since the Xn ’s are i.i.d., the {An } are indeed independent.
#∗
#
#
#
We need to show that P # n1 (f (Xn ) − P f )# > 1 i.o. = 0. This holds if we prove
F
#
#∗
#1
#
that # n (f (Xn ) − P f )# →a.s. 0. But,
F
n
1 n − 1 0n−1
1
1
1
1 0
(f (Xi ) − P f ) −
(f (Xn ) − P f ) =
(f (Xi ) − P f ) .
n
n
n
n−1
i=1
i=1
Since,
#1
#∗
n−1
#
#
Pn−1 − P ∗F →a.s. 0,
# (f (Xn ) − P f )# ≤ Pn − P ∗F +
n
n
F
by hypothesis, as n → 0.
2
Exercise 5.4 For a class of function F and 0 < M < ∞ the class FM = {f 1{F ≤M } :
f ∈ F}. Show that the Lr (Q)-entropy numbers N (, FM , Lr (Q)) are smaller than those
of F for any probability measure Q and for numbers M > 0 and r ≥ 1.
Solution.
Consider now, Θ ∈ U ⊂ Rd , U compact.
Let FΘ = {fΘ,ν : ν ∈ Γ}, given log N[ ] (, FΘ , L2 (P )) ≤ k
PF2
depend on Θ. Also: ∃F with
F=
Θ∈U
hence
FΘ
1
0 < w < 2, k does not
< ∞ s.t.
|fΘ1 ,ν − fΘ2 ,ν | ≤ F |Θ1 − Θ2 |,
Then,
w
∀ Θ1 , Θ2 , ν.
w
1
1
log N[ ] (, F, L2 (P )) < d · log
,
+k
,
0
log N[ ] (, F, L2 (P ))d ≤ ∞.
(Do not care about the “” in the supremum of the integral because the bracketing
numbers are going down and then the crucial point is the zero).
2
Exercise 5.5 Suppose that F, F1 , F2 are P –Glivenko-Cantelli classes of functions. Show
that the following classes are also P –Glivenko-Cantelli:
(i) {a1 f1 + a2 f2 : fi Fi , |ai | ≤ 1};
(ii) F1 + F2
(iii) the class of functions that are both the pointwise limit and the L1 (P )–limit of a
sequence in F.
CHAPTER 5. GLIVENKO-CANTELLI THEOREMS
76
Solution.
(i) Now note that:
|(Pn − P )(a1 f1 + a2 f2 )| ≤ |a1 ||(Pn − P )f1 | + |a2 ||(Pn − P )f2 |
≤ |(Pn − P )f1 | + |(Pn − P )f2 |
(for any f1 , f2 ∈ F1 , F2 and |ai |, i = 1, 2 ≤ 1) and
sup
a1 f1 +a2 f2 :fi ∈Fi ,|ai |≤1
|(Pn − P )(a1 f1 + a2 f2 )| ≤
sup |(Pn − P )fi |
1,2 fi ∈Fi
that is
Pn − P F0 ≤ Pn − P F1 + Pn − P F2
thus
Pn − P ∗F0 ≤ Pn − P ∗F1 + Pn − P ∗F2
with
Pn − P ∗Fi →a.s. 0,
and hence
Pn − P ∗F0 →a.s. 0
showing that F0 is a GC class.
(ii) Since Pn − P ∗F1 ∪F2 ≤ Pn − P ∗F1 + Pn − P ∗F2 (as before), the rest of the argument
follows from the previous case.
(iii) Take now g ∈ F. Then,
∃gn : gn (t) → g(t),
∀t
and
|gn − g|dP → 0 ⇒ P gn → P g,
and note that
1
1
gm (xi ) − P gm →
g(xi ) − P g
n
n
n
as
m→∞
i=1
(n is fixed), hence
n
*1 *
*1 *
*
*
*
*
g(xi ) − P g*
gm (xi ) − P gm * → *
*
n
n
i=1
n
n
*1 *
*1 *
*
*
*
*
⇒ sup *
gm (xi ) − P gm * ≥ sup *
g(xi ) − P g*
n
n
m≥1
m≥1
i=1
⇒ (Pn − P )F ≥ |(Pn − P )g|
⇒ (Pn − P )Fˆ ≤ Pn − P F
⇒ (Pn − P Fˆ )∗ →a.s. 0
i=1
5.4. EXERCISES
77
(Note: Pn − P ∗Fˆ →a.s. 0).
2
78
CHAPTER 5. GLIVENKO-CANTELLI THEOREMS
Chapter 6
Donsker Theorems: Uniform
CLT’s
In this chapter we will develop Donsker theorems, or equivalently, uniform Central Limit
Theorems, for classes of functions and sets. The proofs of these theorems will rely heavily
on the techniques developed in Chapter 3 and Chapter 4. An important by-product of
these proofs will be some new bounds on the expectations of suprema of the empirical
process indexed by functions or sets.
6.1
Uniform Entropy Donsker Theorem
Suppose that F is a class of functions on a probability space (X , A, P ), and suppose that
X1 , . . . , Xn are i.i.d. ∼ P. As in Chapter 1 we let {Gn (f ) : f ∈ F} denote the empirical
process indexed by F:
Gn (f ) =
√
n(Pn − P )(f ),
f ∈ F.
To have convergence in law of all of all the finite-dimensional distributions, it suffices
that F ⊂ L2 (P ). If also
Gn ⇒ G in ∞ (F)
where, necessarily, G is a P –Brownian bridge process with almost all sample paths in
Cu (F, ρP ), then we say that F is a P –Donsker.
Our first theorem giving sufficient conditions for a class F to be a P –Donsker class
will be formulated in terms of uniform entropy as follows: suppose that F is an envelope
function for the class F and that
∞
,
sup log N (F Q,2 , F, L2 (Q))d < ∞
0
Q
79
(6.1)
CHAPTER 6. DONSKER THEOREMS: UNIFORM CLT’S
80
where the supremum is taken over all finitely discrete measures Q on (X , A) with
F 2Q,2 = F 2 dQ > 0. Then we say that F satisfies the uniform entropy condition.
Here is the resulting theorem:
Theorem 6.1 Suppose that F is a class of measurable functions with envelope function
F satisfying:
(a) the uniform entropy condition (6.1) holds;
(b) P ∗ F 2 < ∞;
2 are P –measurable ∀δ > 0.
(c) the classes Fδ = {f − g : f, g ∈ F, f − gP,2 < δ} and F∞
Then F is P –Donsker.
Proof.
Let δ > 0. By Markov’s inequality and the symmetrization Corollary 4.1,
∗
P (Gn Fδ
n
#
2 ∗ #
# 1 #
√
> x) ≤ E #
i f (Xi )#
x
n
Fδ
i=1
(remember that X is symmetric if L(−X) = L(X) and Levy’s inequality works for sums
of independent variables. The random variable X is symmetric since − =d and is
a Rademacher random variable independent of X).
Now, the supremum on the right side is measurable by the assumption (c), so Fubini’s
theorem applies and the outer expectation can be calculated as EX E . Thus we fix
random variables
X1 , . . . , Xn , and bound the inner expectation over the Rademacher
1 n
√
i , i = 1, . . . , n. By Hoeffding’s inequality, the process f →
i=1 i f (Xi ) is subn
Gaussian for the L2 (Pn )-seminorm f n given by
1 2
= Pn f =
f (Xi ).
n
n
f 2n
2
i=1
Thus the maximal inequality for sub-Gaussian processes Corollary 3.5 yields
(a)
∞+
n
# 1 #
#
#
E # √
i f (Xi )# log N (, Fδ , L2 (Pn ))d.
n
Fδ
0
i=1
The set Fδ fits in a single ball of radius once is larger than θn given by
θn2
= sup
f ∈Fδ
f 2n
n
#1 #
#
#
=#
f 2 (Xi )# .
n
Fδ
i=1
Also, note the covering numbers of the class Fδ are bounded by covering numbers of
F∞ = {f − g : f, g ∈ F}, and the latter satisfy N (, F∞ , L2 (Q)) ≤ N 2 ( 2 , F, L2 (Q)) for
every measure Q.
Thus we can limit the integral in (a) to the interval (0, θn ), change variables, and bound
6.1. UNIFORM ENTROPY DONSKER THEOREM
81
the resulting integral above by a supremum over measures Q: we find that the right side
of (a) is bounded by
θn
0
+
log N (, Fδ , L2 (Pn ))d ≤
√ 2
≤
√ 2
θn /F n
0
+
log N (F n , F, L2 (Pn ))d · F n
,
θn /F n
sup
0
Q
log N (F Q,2 , F, L2 (Q))d · F n .
The integrand is integrable by assumption (a). Furthermore, F 2n is bounded below by
F∗ 2n which converges almost surely to its expectation which may be assumed positive.
Now apply the Cauchy-Schwartz inequality to conclude that (up to an absolute constant)
the expected value of the bound in the last display is bounded by
$
(b)
EX
θn
F n
0
,
sup
Q
2 % 12
log N (F Q,2 , F, L2 (Q))d
1
{EX (F 2n )} 2 .
This bound converges to something bounded above by
δ
F∗ P,2
(c)
0
,
sup
Q
log N (F Q,2 , F, L2 (Q))d · F ∗ P,2
if we can show that
(d)
θn∗ ≤ δ + op (1).
To show that this holds, note first that sup{P f 2 : f ∈ Fδ } ≤ δ2 . Since Fδ ⊂ F∞ , (d)
holds if
Pn f 2 − P f 2 ∗F∞ →p 0;
2 is a weak P –Glivenko-Cantelli class. But F 2 has an integrable envelope (2F )2 ,
i.e. if F∞
∞
2 , L (P ))
and is measurable by assumption. Furthermore, the covering number N (2F 2n , F∞
1 n
is bounded by the covering number N (F n , F∞ , L2 (Pn )) since, for any pair f, g ∈ F∞ ,
Pn |f 2 − g2 | = Pn |f − g||f + g| ≤ Pn (|f − g|(4F )) ≤ f − gn 4F n ≤ 4F 2n .
By the uniform entropy assumption (i), N (F n , F∞ , L2 (Pn )) is bounded by a fixed
number, so its logarithm is certainly op (n), as required by the Glivenko-Cantelli Theorem
5.2. Letting δ 0 we see that the asymptotic equicontinuity holds.
It remains only to prove that F is totally bounded in L2 (P ). By the result of the
previous paragraph, there exist a sequence of discrete measures Pn with Pn f 2 −P f 2 F∞
converging to zero. Choose n sufficiently large so that the supremum is bounded by 2 .
√
By assumption N (, F, L2 (Pn )) is finite. But an -net for F in L2 (Pn ) is a 2-net in
L2 (P ). Thus F is P –Donsker by Theorem 2.2.
2
CHAPTER 6. DONSKER THEOREMS: UNIFORM CLT’S
82
It will be useful to record the result of the method of the proof used in terms of a
general inequality. For a class of function F with envelope function F and δ > 0, let
J(δ, F) = sup
Q
δ
0
,
1 + log N (F Q,2 , F, L2 (Q))d,
(6.2)
where the supremum is over all discrete probability measures Q with F Q,2 > 0. It is
clearly true that J(1, F) < ∞ if F satisfies the uniform-entropy condition (6.1).
Theorem 6.2 Let F be a P –measurable class of measurable functions with measurable
envelope function F . Then, for p ≥ 1,
#
#
#
#
#
#
#
#
#Gn ∗F # #J(θn , F)F n #
P,p
P,p
J(1, F)F P,2∨p .
(6.3)
Here θn = (supf ∈F f n )∗ /F n where ·n is the L2 (Pn )-seminorm and the inequalities
are valid up to constants depending only on p. In particular, when p = 1
EGn ∗F E{J(θn , F)F n } J(1, F)F P,2 .
Proof.
See van der Vaart and Wellner (1996), page 240.
(6.4)
2
Proposition 6.1 (Le Cam (1981), Gin´
e and Zinn (1984), (1986)) Suppose that
F ⊂ L2 (X , A, P ). Suppose that the functions f in F take values in [−1, 1] and are
centered: P f = 0 for all f ∈ F.
(i) Let Mn ≡
√
√
n supf ∈F P f 2 ≡ nσ 2 and suppose that t, ρ are positive numbers such
1/2
that λ ≡ t1/2 − 21/2 Mn − 2ρ > 0. Then
$ n
%
#
#
#
∗ #
2
1/2
≤ E ∗ 1 ∧ 8N (ρ/n1/4 , F, L2 (Pn )) exp(−λ2 n1/2 /4) .
Pr #
f (Xi )# > tn
F
i=1
√
This implies that for all v ≥ 47σ > 2(2 + 21/2 )σ,
#
#+
&
'
#
#
P r ∗ # Pn f 2 # > v ≤ E ∗ 1 ∧ 8N (σ, F, L2 (Pn )) exp(−v 2 n/16) .
F
(ii) In particular, if σ 2 is any number satisfying supf ∈F P f 2 ≤ σ 2 ≤ P F 2 and F
satisfies
N (, F, L2 (Q)) ≤
AF Q,2
V
,
0 < < F Q,2
for some A ≥ e and V ≥ 1, then, for all t ≥ 47nσ 2 > 4 (2 + 21/2 )2 nσ 2 ,
$ n
$
%
%
#
#
AF V
#
Q,2
∗ #
2
∗
Pr #
f (Xi )# > t ≤ E 1 ∧ 8
exp(−t/16)
.
σ
F
i=1
6.1. UNIFORM ENTROPY DONSKER THEOREM
Proof.
83
Let 1 , . . . , n be i.i.d. Rademacher random variables that are independent of
the Xi ’s. Set
S+ (f ) =
f 2 (Xi ),
{i≤n:i =1}
S− (f ) =
f 2 (Xi ).
{i≤n:i =−1}
From the above definition, it follows that
S+ (f ) =
n i + 1 2
f (Xi ),
2
S− (f ) =
i=1
n 1 − i i=1
2
f 2 (Xi ).
Then S+ (f ) and S− (f ) have the same distribution and are conditionally independent
given {i }ni=1 . Moreover,
S+ (f ) − S− (f ) =
n
i f 2 (Xi ),
S+ (f ) + S− (f ) =
i=1
n
f 2 (Xi ),
i=1
and, recalling the definition of Mn ,
1
1
1/2
E [S− (f )]2 = E{S− (f )} = nP f 2 ≤ n1/2 Mn .
2
2
By the triangle inequality for the Euclidean distance in Rn and observing that
√ √
2 a + b, we have
√
√
a+ b ≤
*
*
* 1/2
*
1/2
1/2
1/2
*S+ (f ) − S− (f ) − S+ (g) − S− (g) *
$ n
%1/2 $ n
%1/2
i + 1 1 − i 2
2
+
≤
f (Xi ) − g(Xi )
f (Xi ) − g(Xi )
2
2
i=1
i=1
%1/2
$ n
√
√ √ &
'1/2
2
f (Xi ) − g(Xi )
≤ 2
= 2 n Pn (f − g)2
.
i=1
Hence it follows
$ n
%
#
#
#
#
2
1/2
Pr #
= P r S+ (f ) + S− (f )F > tn1/2
f (Xi )# > tn
i=1
F
1/2
≤ 2P r S+ (f )F > t1/2 n1/4 /21/2
1/2
1/2
≤ 4E PX S+ (f ) − S− (f )F > t1/2 n1/4 /21/2 − Mn1/2 n1/4
1/2
1/2
= 4EX P S+ (f ) − S− (f )F > (t1/2 − 21/2 Mn1/2 ) n1/4 /21/2 .
(6.5)
To get the second inequality we have used the symmetrization lemma for probabilities,
applying the Chebyshev’s inequality to calculate βn ().
CHAPTER 6. DONSKER THEOREMS: UNIFORM CLT’S
84
Suppose that Fρ/n1/4 is a finite subset of F, ρ/n1/4 -dense with respect to L2 (Pn ). It
follows that Fρ/n1/4 can be chosen to be of cardinality N (ρ/n1/4 , F, L2 (Pn )). So, we have
1/2
1/2
P S+ (f ) − S− (f )F > (t1/2 − 21/2 Mn1/2 ) n1/4 /21/2
≤ N (ρ/n1/4 , F, L2 (Pn )) ×
1/2
1/2
sup P |S+ (f ) − S− (f )| > (t1/2 − 21/2 Mn1/2 − 2ρ) n1/4 /21/2
f ∈Fρ/n1/4
= N (ρ/n1/4 , F, L2 (Pn )) ×
$
%
|S+ (f ) − S− (f )|
sup P
> (t1/2 − 21/2 Mn1/2 − 2ρ) n1/4 /21/2 .
1/2
1/2
f ∈Fρ/n1/4
S+ (f ) + S− (f )
Making use of inequality
√
x+y ≤
√
x+
√
(6.6)
y in Eq. (6.6), and setting λ ≡ t1/2 − 21/2 Mn1/2 − 2ρ,
we obtain
N (ρ/n1/4 , F, L2 (Pn )) ×
$
%
|S+ (f ) − S− (f )|
sup P
> (t1/2 − 21/2 Mn1/2 − 2ρ) n1/4 /21/2
1/2
1/2
f ∈Fρ/n1/4
S+ (f ) + S− (f )
≤ N (ρ/n1/4 , F, L2 (Pn )) ×
n
| i=1 i f 2 (Xi )|
1/4 1/2
sup P
> λ n /2
.
{ ni=1 f 2 (Xi )}1/2
f ∈Fρ/n1/4
(6.7)
From Eq. (6.7) it follows that
n
| i=1 i f 2 (Xi )|
1/4 1/2
N (ρ/n , F, L2 (Pn )) sup P
> λ n /2
{ ni=1 f 2 (Xi )}1/2
f ∈Fρ/n1/4
n
| i=1 i f (Xi )|
1/4
1/4 1/2
≤ N (ρ/n , F, L2 (Pn )) sup P
> λ n /2
{ ni=1 f 2 (Xi )}1/2
f ∈Fρ/n1/4
1/4
λ2 n1/2 ≤ N (ρ/n1/4 , F, L2 (Pn )) 2 exp −
,
(6.8)
4
where we have used hypothesis |f | ≤ 1 to get the first inequality, and the Hoeffding’s
inequality for Eq. (6.8).
From Eqs. (6.6), (6.7), (6.8) we have
1/2
1/2
P S+ (f ) − S− (f )F > (t1/2 − 21/2 Mn1/2 ) n1/4 /21/2
λ2 n1/2 ≤ N (ρ/n1/4 , F, L2 (Pn )) 2 exp −
.
4
Combining this result with Eq. (6.5), we finally obtain the first conclusion of point (i).
√
To obtain the second conclusion, take ρ/n1/4 = σ and t = v 2 n so that λ = n1/4 {v −
√
√
(2 + 2) σ}. So, for v ≥ 2 (2 + 2) σ, it follows that
√
λ2 n1/2
n
v2 n
= {v − (2 + 2) σ}2 ≥
.
4
4
16
6.2. BRACKETING ENTROPY DONSKER THEOREMS
85
Part (ii) of the proposition easily follows from the first.
2
6.2
Bracketing Entropy Donsker Theorems
The second main empirical central limit theorem uses bracketing entropy rather than
uniform entropy. The simplest version of this theorem is due to Ossiander (1987).
Theorem 6.3 Suppose that F is a class of measurable functions satisfying
∞,
log N[ ] (, F, L2 (P ))d < ∞.
(6.9)
0
Then F is P –Donsker.
We will actually prove a more general result from van der Vaart and Wellner (1996).
The finiteness of the L2 (P )–bracketing integral implies that P ∗ F 2 < ∞ for an envelope
function F . This condition is not necessary for a class F to be Donsker. On the contrary,
we know that every P –Donsker class F satisfies P (f − P f ∗F > x) = o(x−2 ) as x tends
to infinity. Consequently, if P f F < ∞, then F possesses an envelope function F with
a weak second moment (meaning that P (F > x) = o(x−2 ) as x → ∞). Similarly, the
L2 (P )-norm used in the brackets can be replaced by a weaker norm that makes the
bracketing numbers smaller and the convergence of the integral easier. So, we define the
L2,∞ (P )-norm of a function f as
f P,2,∞ = sup{x2 P (|f (X)| > x)}1/2 .
x>0
Actually this is not a norm because it does not satisfy the triangle inequality. However,
it can be shown that there exists a norm equivalent to · 2,∞ up to a constant multiple.
Note that f P,2,∞ ≤ f P,2 , so that the bracketing numbers relative to L2,∞ (P ) are
smaller.
Theorem 6.4 Let F be a class of measurable functions such that
∞,
∞+
log N[ ] (, F, L2,∞ (P ))d +
log N (, F, L2 (P ))d < ∞.
0
(6.10)
0
Suppose also that the envelope function F of F has a weak second moment
lim x2 P (F (X) > x) = 0.
x→∞
Then F is P –Donsker.
Proof.
The following proof is from van der Vaart and Wellner (1996).
N
q
For each natural number q, there exists a partition {Fqi }i=1
of F into Nq disjoint subsets
such that
CHAPTER 6. DONSKER THEOREMS: UNIFORM CLT’S
86
(a)
2−q
+
log Nq < ∞,
q≥1
(b) ( sup |f − g|)∗ P,2,∞ < 2−q ,
f,g∈Fqi
(c)
sup f − gP,2 < 2−q .
f,g∈Fqi
To see this, first cover F separately with the minimal numbers of L2 (P )-balls and
L2,∞ (P )-brackets of size 2−q , disjointify, and then take the intersection of the two partitions. If Nq1 and Nq2 are the number of sets in the two partitions, the total number of
sets in the new partition will be Nq = Nq1 Nq2 . Noting that
+
,
log Nq =
log Nq1 + log Nq2 ≤
,
log Nq1 +
,
log Nq2 ,
condition (a) holds if it is satisfied for both Nq1 and Nq2 . Conditions (b) and (c) follow
from how the partition is constructed. The sequence of partitions can, without loss of
generality, be chosen as successive refinements. Indeed, first construct a sequence of
N
N
q
q
partitions {F qi }i=1
, (q = 1, 2, . . .), F = ∪i=1
F qi , possibly without this property. Then,
take the partition at stage q consisting of the intersections of the form ∩qp=1 F pip . So, we
obtain partitions into Nq = N 1 · · · N q sets. Noting that (see Exercise 6.1)
∞
−q
2
+
log Nq ≤ 2
q=1
∞
−p
2
,
log N p ,
p=1
we conclude that condition (a) continues to hold.
Now for each q, we choose a fixed element fqi from each partitioning set Fqi , and
define
if f ∈ Fqi ,
πq f = fqi
(d)
∗
∆q f = sup |h − g|
g,h∈Fqi
if f ∈ Fqi .
By this definition, if f runs through F, πq f and ∆q f run through a set of just Nq
functions. Recalling Theorem 2.3, to conclude that F is a P –Donsker class, it suffices to
P
show that the sequence Gn (f − πq0 f )∗F → 0 as n → ∞ followed by q0 → ∞.
For each fixed n and q ≥ q0 define truncation levels aq and indicator functions Aq f ,
Bq f
+
aq = 2−q / log Nq+1 ,
√
√
Aq−1 f = 1{∆q0 f ≤ naq0 , . . . , ∆q−1 f ≤ naq−1 },
√
Bq f = Aq−1 f 1{∆q f > naq },
√
Bq0 f = 1{∆q0 f > naq0 }.
6.2. BRACKETING ENTROPY DONSKER THEOREMS
87
Being the partitions nested, Aq f and Bq f are constant in f on each set Fqi of the
partition at level q. Now, consider the following decomposition (pointwise in x)
(e)
f − πq0 f = (f − πq0 f )Bq0 f +
∞
(f − πq f )Bq f +
q=q0 +1
∞
(πq f − πq−1 f )Aq−1 f,
q=q0 +1
based on the idea to write
f − πq0 f = (f − πq1 f ) +
q1
(πq f − πq−1 f )
q=q0 +1
for the largest q1 = q1 (f, x) such that each link |πq f − πq−1 f | is bounded by
√
naq (note
that |πq f − πq−1 f | ≤ ∆q−1 f ). To obtain decomposition (e) rigorously, note that for
indicator function Bq f there are only two possible cases:
(i) Bq f = 0 for all q,
(ii) there is a unique q = q1 such that Bq1 f = 1.
In case (i), being Bq f = 0 for all q, we have Aq f = 1 for all q. So, in the right
side of decomposition (e), the first two terms vanish and the third is an infinite series,
∞
(πq f − πq−1 f ), whose q-th partial sum telescopes out to πq f − πq0 f and converges,
q=q0 +1
by definition of Aq f , to f − πq0 f , i.e. the left side. In case (ii), condition Bq1 f = 1 yields
Aq−1 f = 1 if and only if q ≤ q1 . So, decomposition (e) immediately follows.
√
Now apply the empirical process Gn = n(Pn − P ) to each of the three terms of
decomposition (e) separately, and take the suprema over f ∈ F for each term. It will be
shown that the resulting three terms converge to zero in probability as n → ∞ followed
by q0 → ∞. For the first term, we have
|f − πq0 f |Bq0 f ≤ 2F 1{2F >
so that
Gn (f − πq0 f )Bq0 f F ≤
√
√
naq0 },
n(Pn + P ) 2F 1{2F >
√
naq0 }.
(6.11)
From Eq. (6.11), taking the expected values, we finally obtain
√
√
E ∗ Gn (f − πq0 f )Bq0 f F ≤ 4 nP ∗ F 1{2F > naq0 }.
(6.12)
Recalling that each random variable X with a weak second moment satisfies E|X|1{|X| >
t} = o(t−1 ) as t → ∞ (see Exercise 6.2), and that, by hypothesis, the envelope function
F has a weak second moment, we conclude that the right side of Eq. (6.12) converges to
zero as n → ∞, for each fixed q0 .
CHAPTER 6. DONSKER THEOREMS: UNIFORM CLT’S
88
Aiming to study the second and third term, note that for a fixed bounded function
f , Bernstein’s inequality yields
P (|Gn (f )| > x) ≤ 2 exp
1
x2
√
−
.
2 P f 2 + (1/3)f ∞ x/ n
So, by virtue of Proposition 3.2, for any finite set F with cardinality at least 2, we have
+
f ∞
(f)
EGn (f )F max √ log |F| + max f P,2 log |F|,
f
f
n
i.e., the left side EGn (f )F is bounded by a constant times the right side of equation
(f).
We begin studying the second term. Since the partitions are nested, it follows that
∆q f Bq f ≤ ∆q−1 f Bq f . Moreover, for any non-negative random variable X we have the
following inequalities (see Exercise 6.3)
X22,∞ ≤ sup t EX1{X > t} ≤ 2X22,∞ .
(6.13)
t>0
So, making use of Eq. (6.13) and condition (b), we obtain
√
√
√
naq P (∆q f Bq f ) =
naq P (∆q f Aq−1 f 1{∆q f > naq })
√
√
≤
naq P (∆q f 1{∆q f > naq })
≤ 2∆q f 22,∞ ≤ 2(2−q )2 = 2 · 2−2q .
(6.14)
Moreover, for q > q0 , it is
∆q f Bq f ≤ ∆q−1 f Bq f ≤
√
naq−1 ,
so that, by Eq. (6.14),
P (∆q f Bq f )2 ≤
√
√
aq−1 −2q
naq−1 P (∆q f 1{∆q f > naq }) ≤ 2
2 .
aq
(6.15)
Note that
√
Gn (f − πq f )Bq f F = n(Pn − P )(f − πq f )Bq f F
√
√
≤ n(Pn + P )∆q f Bq f F = n(Pn + P + P − P )∆q f Bq f F
√
√
≤ n(Pn − P )∆q f Bq f F + 2 nP ∆q f Bq f F
√
= Gn ∆q f Bq f F + 2 nP ∆q f Bq f F .
(6.16)
Taking the expectation of each side of Eq. (6.16) and making use of Eq. (6.14) and
condition (f ), we finally obtain
∞
∞
∞
#
#
√
#
#
E∗#
Gn (f − πq f )Bq f # ≤
E ∗ Gn ∆q f Bq f F +
2 nP ∆q f Bq f F
q0 +1
∞
aq−1 log Nq +
q0 +1
F
-
q0 +1
q0 +1
√ 2 · 2−2q
aq−1 −q +
.
2
log Nq + 2 n √
aq
naq
6.2. BRACKETING ENTROPY DONSKER THEOREMS
89
Since aq is decreasing, the ratio aq−1 /aq in the last term of the previous display can be
replaced by its square so that
∞
∞ #
#
aq−1 −q +
4 −2q
#
E #
aq−1 log Nq +
Gn (f − πq f )Bq f # 2
log Nq +
2
aq
aq
F
q +1
q +1
∗#
0
=
∞
0
+
+
+
2−(q−1) log Nq + 2−(q−1) log Nq+1 + 4 · 2−q log Nq+1 .
(6.17)
q0 +1
The series (6.17) can be bounded by a multiple of
∞
2−q
+
log Nq and this up-
q0 +1
per bound is independent of n and converges to zero as q0 → ∞. We conclude that
series (6.17) converges to zero as q0 → ∞.
Finally, we have to analyse the third term of decomposition (e). Since the partitions
are nested, it follows that
|πq f − πq−1 f |Aq−1 f ≤ ∆q−1 f Aq−1 f ≤
√
naq−1 .
(6.18)
Moreover, being |πq f − πq−1 f | ≤ ∆q−1 f ≤ 2−(q−1) , we obtain
)2
(
P |πq f − πq−1 f |2 ≤ 2−(q−1) = 4 · 2−2q .
(6.19)
Noting that there are at most Nq functions πq f − πq−1 f and at most Nq−1 functions
Aq−1 f and making use of condition (f ) and Eqs. (6.18), (6.19), we thus have
∞
∞ #
#
+
#
E #
aq−1 log Nq + 2−q log Nq .
Gn (πq f − πq−1 f )Aq−1 f # ∗#
F
q0 +1
(6.20)
q0 +1
Again this upper bound (6.20) is independent of n and converges to zero as q0 → ∞.
2
This completes the proof.
In the following we derive bounds on the expected value of Gn f F for classes F that
posses a finite bracketing entropy integral (6.9). More generally, for a given norm, · ,
we can define the bracketing integral of a class of functions F by
δ,
J[ ] (δ, F, · ) =
1 + log N[ ] (F , F, · )d.
0
The basic bracketing maximal inequality uses the L2 (P )-norm.
Theorem 6.5 Let F be a class of measurable functions with measurable envelope function F . For a given η > 0, set
a(η) = ,
ηF P,2
1 + log N[ ] (ηF P,2 , F, L2 (P ))
.
CHAPTER 6. DONSKER THEOREMS: UNIFORM CLT’S
90
Then, for every η > 0,
√
√
E ∗ Gn f F J[ ] (η, F, L2 (P ))F P,2 + nP F 1{F > na(η)}
,
+f P,2 F 1 + log N[ ] (ηF P,2 , F, L2 (P )).
If f P,2 < δF P,2 for every f ∈ F, then taking η = δ in the last display yields
E ∗ Gn f F J[ ] (δ, F, L2 (P )) F P,2 +
√
nP F 1{F >
√
na(δ)}.
Hence, for any class F,
E ∗ Gn f F J[ ] (1, F, L2 (P )) F P,2 .
6.3
Donsker Theorem for Classes Changing with Sample
Size
The Glivenko-Cantelli and Donsker theorems concern the empirical process for different
n, but each time with the same indexing class F. This is sufficient for a large number of
applications, but in other cases it may be necessary to allow the class F to change with
n.
Suppose that Fn is a sequence of classes of measurable functions fn,t : X → R indexed
by a parameter t which belongs to a common index set T , i.e. Fn = {fn,t : t ∈ T }. We
want to study the weak convergence of the stochastic processes
Zn (t) = Gn fn,t
(6.21)
as elements of l∞ (T ). We know that weak convergence is equivalent to marginal convergence and asymptotic tightness. The marginal convergence to a Gaussian process follows
under the conditions of the Lindeberg theorem; sufficient conditions for tightness can be
given in terms of the entropies of the classes Fn .
We shall assume that there is a semimetric ρ for the index set T for which (T, ρ) is
totally bounded and that relates to the L2 -metric in that
sup
P (fn,s − fn,t )2 → 0
f or every
δn ↓ 0.
(6.22)
ρ(s,t)<δn
Furthermore,we suppose that the classes Fn possess envelope functions Fn that satisfy
the Lindeberg condition
$
P Fn2 = O(1),
√
P Fn2 1{Fn > n} → 0
for every > 0.
(6.23)
6.3. DONSKER THEOREM FOR CLASSES CHANGING WITH SAMPLE SIZE 91
The other hypothesis needed is the control on entropy: we can use either bracketing or
uniform entropy. However, it will be convenient to formulate the hypothesis in terms of
the modified bracketing entropy integral:
δ,
˜
J[ ] (δ, F, · ) =
log N[ ] (, F, · )d.
0
Theorem 6.6 Let Fn = {fn,t : t ∈ T } be a class of measurable functions indexed by
(T, ρ) which is totally bounded. Suppose that conditions (6.22) and (6.23) hold. If either
J˜[ ] (δn , Fn , L2 (P )) → 0, for every δn ↓ 0, or J(δn , Fn , L2 ) → 0, for every δn ↓ 0 and all
the classes Fn are P –measurable, then the processes {Zn (t) : t ∈ T } defined by (6.21)
converge weakly to a tight Gaussian process Z provided that the sequence of covariance
functions Kn (s, t) = P (fn,s fn,t )−P (fn,s ) P (fn,t ) converges pointwise on T ×T . If K(s, t),
s, t ∈ T , denotes the limit of the covariance functions, then it is a covariance function
and the limit process Z is a mean zero Gaussian process with covariance function K.
Proof.
The following proof is under the bracketing entropy condition.
For every δ > 0, we can use the semimetric ρ and condition (6.22) to partition T into
finitely many sets T1 . . . , Tk such that, for every sufficiently large n,
max sup P (fn,s − fn,t)2 < δ2 .
1≤i≤k s,t∈Ti
Next we apply Theorem 6.5 to obtain
√
P F˜n2 1{F˜n > a
˜n (δ) n}
˜
˜
E max sup |Gn (fn,s − fn,t)| J[ ] (δ, Fn , L2 (P )) +
,
1≤i≤k s,t∈Ti
a
˜n (δ)
(6.24)
where a
˜n (δ) is the number a(δ/F˜n P,2 ) of Theorem 6.5 evaluated for the class of functions F˜n = Fn − Fn with envelope F˜n :
a
˜n (δ) = ,
δ
1 + log N[ ] (δ, F˜n , L2 (P ))
.
The number a
˜n (δ) can be bounded below, up to constants, by the corresponding number
an (δ) and envelope for Fn , i.e.
an (δ) = ,
δ
1 + 2 log N[ ] (δ/2, Fn , L2 (P ))
.
Because J˜[ ] (δn , Fn , L2 (P )) → 0, for every δn ↓ 0, we must have that J˜[ ] (δ, Fn , L2 (P )) =
O(1), for every δ > 0 and hence an (δ) is bounded away from zero. Consequently, the
number a
˜n (δ) is also bounded away from zero and so, by the Lindeberg condition (6.23),
the second term in the right side of Eq. (6.24) converges to zero as n → ∞, for every
CHAPTER 6. DONSKER THEOREMS: UNIFORM CLT’S
92
fixedδ > 0. The first term in the right side of Eq. (6.24) can be made arbitrarily small as
n → ∞ by choosing δ sufficiently small. This shows that the asymptotic equicontinuity
holds.
Convergence of the finite dimensional distributions follows from the Lindeberg condition (6.23) and from the hypothesized convergence of the covariance functions.
6.4
2
Universal and Uniform Donsker Classes
If F is P –Donsker for all probability measures P on (X , A), then we say that F is a
universal Donsker class.
Moreover, denoting by P = P(X , A) the set of all probability measures on the measurable space (X , A), we define F a uniform Donsker class if
sup
P ∈P(X ,A)
d∗BL (Gn,P − GP ) → 0
as n → ∞;
here d∗BL is the dual-bounded Lipschitz metric
d∗BL (Gn,P − GP ) = sup |E ∗ H(Gn,P ) − EH(GP )|,
H∈BL1
and BL1 is the collection of all functions H : l∞ (F) → R which are uniformly bounded
by 1 and satisfy |H(z1 ) − H(z2 )| ≤ z1 − z2 F .
We define F a bounded Donsker class if
Gn,P ∗F = OP (1).
(6.25)
If the envelope function F of class F is such that sup F (x) < ∞, then, using the Hoffmanx
Jørgensen’s inequality, it can be shown (see Exercise 6.4) that condition (6.25) is equivalent to
lim sup EP∗ Gn,P F < ∞.
(6.26)
n→∞
If the condition (6.25) holds for every P ∈ P, we say that F is a universal bounded
Donsker class. Similarly, if
lim sup sup EP∗ Gn,P F < ∞,
n→∞ P ∈P
(6.27)
we say that F is a uniform bounded Donsker class.
Theorem 6.7 Let C be a countable class of sets in X satisfying the universal bounded
Donsker class property:
lim lim sup P ∗ {Gn,P C > M } = 0
M →∞ n→∞
Then C is a VC–class.
f or all P ∈ P.
6.4. UNIVERSAL AND UNIFORM DONSKER CLASSES
Proof.
93
Applying the Hoffman-Jørgensen and symmetrization inequalities we obtain,
respectively, the following two equations:
√
sup nEPn − P C < ∞,
(6.28)
n
n
# # 1 #
#
E #
i (1C (Xi ) − P C)# ≤ 2EPn − P C .
n
C
(6.29)
i=1
Making use of Eqs. (6.28) and (6.29) it follows that
√
n
# # 1 √
#
#
nE #
i (1C (Xi ) − P C)# ≤ 2 nEPn − P C < ∞.
n
C
(6.30)
i=1
From Eq. (6.30), we have
n
n
#
√
1 #
1 ** **
#
#
√ E#
i 1C (Xi )# ≤ √ E *
i * + 2 nEPn − P C ,
n
n
C
i=1
(6.31)
i=1
so that, for the Rademacher complexity of C at P
n
#
1 #
#
#
R(P ) ≡ sup √ E #
i 1C (Xi )#
n
C
n
i=1
we obtain the following bound
n
√
1 ** **
R(P ) ≤ sup √ E *
i * + sup 2 nEPn − P C
n
n
n
i=1
√
√
≤ sup 2 nEPn − P C + 2π < ∞,
n
where we used Hoeffding’s inequality at the last step. Thus,
R(P ) < ∞,
∀P.
(6.32)
We now show that there exists a constant M < ∞ such that
R(P ) ≤ M,
∀P.
(6.33)
To this end, we consider two measures on (X , A), P 0 and P 1 , and define P = αP 0 +
(1 − α)P 1 . We want to show that R(P ) ≥ αR(P 0 ). Suppose Xi0 , Xi1 respectively, be
i.i.d. P 0 , P 1 respectively and λi i.i.d. Bernoulli random variables with parameter 1 − α
independent of the Xi0 ’s and Xi1 ’s. By these definitions, we have Xi =d Xiλi and, by the
contraction principle,
n
n
#
#
#
#
#
#
#
#
E#
i 1C (Xi )# ≥ E #
i 1C (Xi0 )1[λi =0] # .
i=1
C
i=1
C
CHAPTER 6. DONSKER THEOREMS: UNIFORM CLT’S
94
From the above inequality, using Jensen’s inequality, we obtain
n
n
#
#
#
#
#
#
#
#
E#
i 1C (Xi )# ≥ αE #
i 1C (Xi0 )# ,
C
i=1
C
i=1
and hence R(P ) ≥ αR(P 0 ).
Now suppose that Eq. (6.33) is false. It means that there exists a sequence of measures
Pk on (X , A) such that R(Pk ) ≥ 4k for every k. Then, defining P as
P =
∞
2−j Pj = 2−k Pk + (1 − 2−k )
2−j Pj
,
1 − 2−k
j=k
j=1
we find R(P ) ≥ 2−k R(Pk ) ≥ 2−k 22k = 2k for every k, and this yelds R(P ) = ∞,
contradicting condition (6.32). Thus condition (6.33) holds.
Now suppose that C is not VC. Then, for every k there is a set A = Ak = {x1 , . . . , xk } ⊂
X such that C shatters A, i.e. #{C ∩ A : C ∈ C} = 2k . Then for each α ∈ Rk we have
k
|αi | =
α+
i +
α−
i ≤ 2 max
α+
i ,
α−
i
i=1
k
#
#
#
#
≤ 2#
αi 1C (xi )# ,
i=1
(6.34)
C
where the inequality (6.34) holds equality when C picks out the set of xi ’s correspond +
−
ing to those αi ’s yelding the maximum between
αi and
αi . Now take P =
k
−1
2
2
k
i=1 δxi and choose n so large that n > (4M ) . Then choose k > 2n and let
Ω0 ≡ ∩i=j Xi = Xj . We have
(
)
)
(
P (Ωc0 ) = P ∪i=j≤n Xi = Xj ≤
P Xi = Xj ≤ n2 k−1 < 1/2,
i=j≤n
so that P (Ω0 ) ≥ 1/2. Thus, recalling that R(P ) ≤ M and the inequality (6.34) we obtain
%
$ n
n
#
#
#
#
√
#
#
#
#
M n ≥ E#
i 1C (Xi )# ≥ E #
i 1C (Xi )# 1Ω0
i=1
C
i=1
C
n
n
P (Ω0 ) ≥ .
2
4
√
This inequality yields n ≤ 4M n and it contradicts our choice of n > (4M )2 . It follows
≥
that C is VC.
2
6.5. EXERCISES
6.5
95
Exercises
Exercise 6.1 Suppose that {N q }∞
q=1 satisfy
N 1 · · · N q also satisfies
2−q (log N q )1/2 < ∞. Show that Nq =
q
−q
2
1/2
(log Nq )
< ∞.
q
Solution.
9
: /
q
+
:
−q
−q ;
2
log Nq =
2
log
Np
q≥1
p=1
q≥1
9
:
: q
−q ;
=
2
log N p
p=1
q≥1
≤
=
2−q
q≥1
∞ ,
p=1
∞
= 2
q ,
log N p
p=1
log N p
2−p
2−q
q≥p
,
log N p .
p=1
Being
∞
2−p
,
log N p < ∞ by hypothesis, it follows that
p=1
2−q
+
log Nq < ∞.
2
q≥1
Exercise 6.2 Suppose that X is a random variable satisfying the weak second moment
condition t2 P (|X| > t) → 0 as t → ∞. Show that tE{|X|1{|X| > t}} → 0 as t → ∞.
Without loss of generality, we can assume X ≥ 0 because for a general
Solution.
random variable it is sufficient replacing X by |X|.
We have
∞
t E(X 1{X > t}) = t
P (X 1{X > t} > x) dx
0
t
∞
=t
P (X 1{X > t} > x) dx + t
P (X 1{X > t} > x) dx.
0
t
Being
P (X 1{X > t} > x) =
⎧
⎨ P (X > t)
if 0 < x ≤ t,
⎩ P (X > x)
if x > t,
Eq. (6.35) becomes
2
t E(X 1{X > t}) = t P (X > t) + t
∞
P (X > x) dx.
t
(6.35)
CHAPTER 6. DONSKER THEOREMS: UNIFORM CLT’S
96
By hypothesis, given > 0, we can find T such that for t ≥ T we have t2 P (X > t) < /2,
i.e. P (X > t) < /2t2 . Then, for t ≥ T ,
∞
∞
t2 P (X > t) + t
P (X > x) dx ≤ + t
dx ≤ ,
2
2x2
t
t
so that t E(X 1{X > t}) ≤ .
2
Exercise 6.3 Show that for any non-negative random variable X we have the inequalities
X22,∞ ≤ sup tE{X1{X > t}} ≤ 2X22,∞ ,
t>0
where
X22,∞ = sup t2 P (X > t).
t>0
Solution.
Recalling Exercise 6.2, we have
∞
2
t E(X 1{X > t}) = t P (X > t) + t
P (X > x) dx
t
∞
∞
1 2
1
2
2
2
= t P (X > t) + t
x P (X > x) dx ≤ t P (X > t) + t X2,∞
dx
2
x
x2
t
t
= t2 P (X > t) + X22,∞ ≤ 2X22,∞ .
It follows that
t E(X 1{X > t}) ≤ 2X22,∞
and hence
sup t E(X 1{X > t}) ≤ 2X22,∞ .
t>0
On the other hand,
t E(X 1{X > t}) ≥ t2 P (X > t),
so that
sup t E(X 1{X > t}) ≥ sup t2 P (X > t) = X22,∞ .
t>0
t>0
2
Exercise 6.4 If the envelope function of class F, F , is such that sup F (x) < ∞, then
x
Gn,P ∗F = OP (1) ⇐⇒ lim sup EP∗ Gn,P F < ∞.
n→∞
Solution.
Let lim sup EP∗ Gn,P F < ∞. Using Markov’s inequality, we have
n→∞
P (Gn,P ∗F > k) ≤
EP∗ Gn,P F
k
6.5. EXERCISES
97
so that
lim sup P (Gn,P ∗F > k) ≤
n→∞
Being
1
lim sup E ∗ Gn,P F .
k n→∞ P
1
lim sup EP∗ Gn,P F = 0,
k→∞ k n→∞
lim
we obtain
lim lim sup P (Gn,P ∗F > k) = 0,
k→∞ n→∞
and so Gn,P ∗F = OP (1).
Let Gn,P ∗F = OP (1). By Hoffman-Jørgensen’s inequality we have that if X1 , . . . , Xn
are independent mean zero stochastic processes indexed by an arbitrary set T , then there
exist constants Kp and 0 < vp < 1, such that
(
)
∗
p
∗
p
−1
p
E Sn ≤ Kp E max Xk + Gn (vp ) .
(6.36)
k≤n
Take p = 1, T = F, Xi (f ) =
√1
n
(
)
f (Xi ) − P f . Then
n
n
)
√
1 (
1 √ f (Xi ) − P f = √
Sn (f ) =
f (Xi ) − n P f ≡ Gn,P
n
n
i=1
i=1
∗
and so Sn = Gn,P ∗F and G−1
n is the quantile function of Gn,P F . Hence Eq. (6.36)
becomes
1
∗
−1
E Gn,P F ≤ K1 E √ max f (Xi ) − P f F + Gn (v1 ) .
n i≤n
∗
(6.37)
Since Gn,P ∗F is OP (1), G−1
n (v1 ), for a fixed v1 , is O(1). Moreover, by the hypothesis
of a finite envelope function, there exists a constant M < ∞ such that sup |f (x)| ≤ M .
f ∈F
It follows that
max f (Xi ) − P f F ≤ 2M .
i≤n
From Eq. (6.38), recalling Eq. (6.37), we conclude that E ∗ Gn,P F = O(1).
(6.38)
2
98
CHAPTER 6. DONSKER THEOREMS: UNIFORM CLT’S
Chapter 7
VC-theory: bounding uniform
covering numbers
In this chapter we will treat some classes of sets which are defined through combinatorial
properties. It is important to remark that these classes satisfy the entropy conditions
for the Donsker theorem and the Glivenko-Cantelli theorem. Thus they are P –GlivenkoCantelli and P –Donsker under suitable moment conditions on their envelope function, if
these ones are measurable.
7.1
Introduction
Let X be a set and C a collection of subsets of X . Consider an arbitrary n-point set
{x1 , . . . , xn }. Then we can give the following definitions.
Definition 7.1 We say that the collection C picks out a certain subset from {x1 , . . . , xn } if
this subset can be written as C ∩ {x1 , . . . , xn }, for some set C ∈ C .
Definition 7.2 The collection C is said to shatter {x1 , . . . , xn } if C picks out each of its
2n subsets.
For all finite point set {x1 , . . . , xn } in X , we set
∆n (C , x1 , . . . , xn ) ≡ # { C ∩ {x1 , . . . , xn } : C ∈ C } ,
so that ∆n (C , x1 , . . . , xn ) denotes the number of subsets of {x1 , . . . , xn } picked out by the
collection C . Moreover, if we set
mC (n) ≡ max ∆n (C , x1 , . . . , xn ) ,
x1 ,...,xn
99
CHAPTER 7. VC-THEORY: BOUNDING UNIFORM COVERING NUMBERS
100
we can define the following numbers
&
'
V (C ) ≡ inf n : mC (n) < 2n
&
'
S(C ) ≡ sup n : mC (n) = 2n .
Remark 7.1 In words, we say that the VC-index V (C ) of the class C is the smallest n
such that C shatters no set of size n. By definition, a VC-class of sets picks out strictly
less than 2n subsets from any set of n ≥ V (C ) elements.
Remark 7.2 Note that the infimum over the empty set is taken to be infinity and the
supremum over the empty set is taken to be −1. So we can conclude that V (C ) = ∞ if
and only if C shatters sets of arbitrarily large size.
Remark 7.3 It’s easy to show that the next equality holds
S(C ) ≡ V (C ) − 1.
Now we are able to give the following definition.
Definition 7.3 A collection C is called a VC-class if V (C ) < ∞, or equivalently S(C ) <
∞.
Next property is very interesting and is trivially demonstrable.
Proposition 7.1 A class C of subsets of a set X has V (C ) = 0, or equivalently S(C ) =
−1, if and only if C = ∅. Also, V (C ) = 1, or equivalently S(C ) = 0 if and only if
C contains exactly one set. Thus S(C ) ≥ 1 if and only if C contains at least two sets.
Proof.
The first two statements are consequences of the definition. A class C shatters
the empty set if and only if C contains at least one set. If C contains at least two sets, then
for some A, B ∈ C and x ∈ X , x ∈ A\B. Then C shatters {x}, so S(C ) ≥ 1. Conversely
if S(C ) ≥ 1, then C contains at least two sets.
2
Before introducing the main result of this chapter, involving the covering numbers of
any VC-class, it can be sharp to give some interesting examples and preliminary results.
Example 7.1 Let X = R and C = { (−∞, b ] , b ∈ R } . C shatters no two-point set
{x1 , x2 }, because it cannot pick out {x1 ∨ x2 }. Hence its VC-index is 2 and C is a VCclass.
Example 7.2 Let X = R and C = { (a, b ] , a, b ∈ R, a < b, } . C shatters no three-point
set {x1 , x2 , x3 }, because it cannot pick out {x1 , x3 }. Hence its VC-index is 3 and C is a
VC-class.
7.1. INTRODUCTION
101
&
'
(−∞, b ] , b ∈ Rd , it can be shown that
&
'
C is a VC-class and its VC-index is d + 1. Similarly, if C = (a, b ] , a, b ∈ Rd a < b, , it
Remark 7.4 Let X = Rd . Suppose that C =
is trivial to prove that C is VC with VC-index 2d + 1.
For a VC-class, the following lemma is very interesting and it will imply that mC (n) grows
as a polynomial in n.
Lemma 7.1 (VC, Sauer and Shelah) For a VC-class of sets with VC-index V (C ) ,
setting S = S(C ) , it holds
mC (n) ≤
S
≤
j
j=0
Proof.
n
ne S
S
, for n ≥ S.
(7.1)
Begin with the first inequality. By definition, for a VC-class all shattered sets
are among those ones of size at most V (C ) − 1. Now, the sum in (7.1) is just the
number of sets shattered by C and we know that this number gives an upper bound
on ∆n (C , x1 , . . . , xn ) . It follows that mC (n) is bounded above by the same sum. The
second inequality is obtained trivially. Suppose that Y ∼ Binomial(n, 1/2), due to
Markov inequality we get
S
S
n
n
1 n
= 2n
= 2n P (Y ≤ S) ≤ 2n E r Y −S , for any r ≤ 1.
2
j
j
j=0
j=0
Recalling that Y is a binomial random variable, we can compute its mean and obtain
the following result
n
2 E r
Y −S
n −S
= 2 r
1 r
+
2 2
n
= r −S (1 + r)n ;
by choosing r = S/n and recalling the definition of e, the previous quantity becomes
n S n S
S n
1+
≤
eS .
S
n
S
2
Hereafter there are two sufficient conditions for S(C ) = 1.
Theorem 7.1 Suppose that C is a collection of at least two subsets of a set X , then
S(C ) = 1 if either of the following statements hold:
(i) the collection C is linearly ordered by inclusion,
(ii) any two sets in C are disjoint.
CHAPTER 7. VC-THEORY: BOUNDING UNIFORM COVERING NUMBERS
102
Proof.
Due to Proposition 7.1, in any case it is S(C ) ≥ 1. To show that (i) holds,
suppose that C shatters {x, y}. Let A, B ∈ C , A ∩ {x, y} = {x} , B ∩ {x, y} = {y} .
Because C is linearly ordered by inclusion, it must be A ⊂ B or B ⊂ A, yielding a
contradiction. If the sets in C are disjoint, then we can argue as in part (i), taking C ∈ C
with {x, y} ⊂ C, but C cannot be disjoint from A or B, so it gives a contradiction.
2
Now we are going to introduce an example of classes of sets for which the VC-property
fails.
Example 7.3 Let X = [0, 1] and let C be the class of all finite subsets of X . Let P be the
uniform (Lebesgue) law on [0, 1]. It is S(C ) = ∞ and so C is not a VC-class. Moreover,
for any possible value of Pn , we have Pn (A) = 1 for some A = {X1 , . . . , Xn } ∈ C while
P (A) = 0. Thus Pn − P C = supA∈ C | (Pn − P )(A) | = 1, for all n. It follows that C is
not a Glivenko-Cantelli class for P , neither a Donsker class.
Now we find an upper bound to covering numbers of VC-classes.
Theorem 7.2 There exists a universal constant K such that for any VC-class C of sets,
any probability measure Q, any r ≥ 1, and 0 < ≤ 1,
N (, C , Lr (Q)) ≤
K log(3e/r ) r
S(C)
Moreover,
≤
˜ V (C )
N (, C , Lr (Q)) ≤ K
K
4e r
r S(C)+ δ
, δ > 0.
(7.2)
S(C)
,
(7.3)
˜ is universal.
where K
Proof.
The proof of (7.3) is very long, so it is omitted. Whereas we are going to
show inequality (7.2) which is a weaker result. The upper bound for a general r is
an easy consequence of the bound for r = 1. Thus let r = 1, fix 0 < ≤ 1 and let
m be the packing number for the collection C , i.e. m = D (, C , L1 (Q)). We know
that N (, C , L1 (Q)) ≤ D (, C , L1 (Q)), so if m ≤ (K log 3e)S(C) the claim is trivially
obtained. Thus assuming m > (K log K)S(C) , it suffices to show the claimed bound when
log m > S(C ) ≥ 1 or m > e > 2. By definition of packing number, there exist m sets
C1 , . . . , Cm ∈ C such that for any pair i = j
Q (Ci # Cj ) = EA | 1Ci − 1Cj | > .
Let X1 , . . . , Xn be i.i.d. Q. Observe that Ci and Cj pick out the same subset {X1 , . . . , Xn }
if and only if Xk ∈
/ Ci # Cj for all k ≤ n. If each Ci # Cj contains some Xk , then all Ci ’s
7.1. INTRODUCTION
103
pick out different subsets, and C picks out at least m subsets from {X1 , . . . , Xn }. Thus
we compute the probability that this event does not occur
Q ([ for all i = j, Xk ∈ Ci # Cj for some k ≤ n ]c )
= Q ([ for some i = j, Xk ∈
/ Ci # Cj for all k ≤ n ])
≤
Q ([ Xk ∈
/ Ci # Cj for all k ≤ n ])
i<j
m
≤
2
m
≤
max [1 − Q (Ci # Cj )]n
(1 − )n ≤
2
m
2
e−n .
(7.4)
The latter holds for n large enough. In this case the expression (7.4) is strictly less than
1. Especially this holds if
log
n>
m
2
=
log (m(m − 1)/2)
.
For all m ≥ 1 it is m(m − 1)/2 ≤ m2 , thus (7.4) holds if
n = 2 log m/,
for this value of n, it is
Q ([ for all i = j, Xk ∈ Ci # Cj for some k ≤ n ]) > 0.
Thus we can find n points X1 (ω), . . . , Xn (ω) such that
m ≤ ∆n (C, X1 (ω), . . . , Xn (ω))
≤ max ∆n (C , x1 , . . . , xn )
x1 ,...,xn
en S
≤
S
(7.5)
where (7.5) holds in virtue of Lemma 7.1 with S = S(C ) = V (C ) − 1. If n = 2 log m/,
inequality (7.5) implies that
m≤
3e log m
S
S
⇐⇒
m1/S
3e
3e
≤
⇐⇒ g(m1/S ) ≤ ,
log m
S
where the function g is defined as g(x) = x/ log x. Inequality (7.6) yields
e 3e
3e
1/S
m
≤
log
e−1 (7.6)
(7.7)
104
CHAPTER 7. VC-THEORY: BOUNDING UNIFORM COVERING NUMBERS
or
D (, C , L1 (Q)) = m ≤
e 3e
log
e−1 3e
S
.
(7.8)
Recalling that N (, C , L1 (Q)) ≤ D (, C , L1 (Q)), inequality (7.2) holds for r = 1 with
K = 3e2 /(e − 1). If r > 1, it is
1C − 1D L1 (Q) = Q(C# D) = 1C − 1D rLr (Q) ,
so that
N (, C , Lr (Q)) = N ( , C , L1 (Q)) ≤
r
K
−r
log
K
r
S
.
2
This completes the proof of (7.2).
Definition 7.4 Let f : X → R be a function, the subgraph of f will be the set
{(x, t) ∈ X × R : t < f (x)} .
Definition 7.5 Let F be a class of real-valued functions on X . If the collection of all
the subgraphs of the functions in F forms a VC-class of sets in X × R, we say that F is
a VC-subgraph class, or just VC-class. We denote with V (F ) the VC-index of the set
of subgraphs of functions in F .
The following theorem gives an important result which involves covering numbers of
VC-classes of functions. These are bounded by a polynomial in 1/.
Theorem 7.3 For a VC-subgraph class with envelope function F , for any r ≥ 1, for
any probability measure Q with F Q, r > 0, and for 0 < < 1 there exists a universal
constant K such that
V (F )
N ( F Q, r , F , Lr (Q)) ≤ K V (F ) (16e)
Proof.
r(V (F )−1)
1
(7.9)
For each f ∈ F , denote with Cf its subgraph and with C the collection of all
Cf ’s. Let λ be the Lesbegue measure on R, in virtue of Fubini’s theorem we obtain
Q|f − g| = Q × λ(Cf # Cg ).
Renormalize Q × λ to a probability measure on {(x, t) : | t| ≤ F (x)} by defining P =
(Q × λ)/(2QF ). Due to theorem 7.2 we can find a universal constant K such that
V (F )−1
4e
N ( 2QF, F , L1 (Q)) = N (, C, L1 (P )) ≤ K V (F )
.
For r > 1, denote with R the probability measure with density F r−1 /Q(F r−1 ) with
respect to Q, it is
Q| f − g |r ≤ Q| f − g |(2F )r−1 = 2r−1 R| f − g | Q(F r−1 ).
7.1. INTRODUCTION
105
1/r
Thus the Lr (Q)-distance is bounded by the distance 2 (Q(F r−1 ))1/r f − g R, 1 . By (7.3)
we conclude that
N ( 2F Q, r , F , Lr (Q)) ≤ N ( RF, F , L1 (R)) ≤ K V (F )
r
8e r
V (F )−1
.
2
The following propositions give basic methods for generating VC-classes of sets and
functions. Let’s introduce the following definition.
Definition 7.6 Let F be a collection of real-valued functions on a set X . We can define
the following sets
pos( f ) = {x : f (x) > 0} ,
pos(F ) = {pos( f ) : f ∈ F } ;
nn( f ) = {x : f (x) ≥ 0} ,
nn(F ) = {nn( f ) : f ∈ F } .
Proposition 7.2 Let F be a r-dimensional real vector space of functions on X , let g
be any real function on X , and let g + F ≡ {g + f : f ∈ F } . Then
(i) S(pos(g + F )) = S(nn(g + F )) = r
(ii) S(pos(F )) = S(nn(F )) = r
(iii) S(F ) ≤ r + 1
Proof.
It will be shown (ii). Suppose that v = dim(F ) + 1 = r + 1 and let x1 , . . . , xv
be v distinct points of X . Let’s define the map A : F → Rv as A(f ) = (f (x1 ), . . . , f (xv )).
Since dim(F ) = r = v − 1, then it is also dim(A(F )) ≤ v − 1. Thus we can find a vector
b = (b1 , . . . , bv ) ∈ Rv which is orthogonal to A(F ), i.e.
0=
v
bi f (xi )
for all f ∈ F ,
i=1
and thus
i: bi ≥0
bi f (xi ) = −
bi f (xi ).
i: bi <0
Assume, without loss of generality, that {i ≤ v : bi < 0} is not empty. If there were a
function f ∈ F such that {f ≥ 0} ∩ {x1 , . . . , xv } = {xi : bi ≥ 0}, the left side of the last
equality would be greater than zero, while the right side would be strictly negative. This
yields a contradiction. Thus there exists a subset {x1 , . . . , xv } which is not obtained as
intersection of {x1 , . . . , xv } and {f ≥ 0}. Hence nn(F ) is VC and S(nn(F )) ≤ r. But
since dim(F ) = r, then there is some subset {x1 , . . . , xr } with A(F ) = Rr , thus all
106
CHAPTER 7. VC-THEORY: BOUNDING UNIFORM COVERING NUMBERS
subsets of {x1 , . . . , xr } are of the form B ∩ {x1 , . . . , xr } for B ∈ nn(F ). This implies
that S(nn(F )) ≥ r. Hence we have shown that S(nn(F )) = r. Finally, S(pos(F )) = r
because we know that for any set X and for any C ⊂ 2X the complement in X of C , say
D, is such that S(D) = S(C ). Hence S(pos(F )) = r by taking complements.
2
Example 7.4 Suppose that X = Rd and set
H(u, t) = y ∈ Rd : $y, u% ≤ t .
&
'
Consider the set C = Hd = H(u, t) : u ∈ Rd−1 , t > 0 . Let F be the space spanned by
1 and x1 , . . . , xd , then dim(F ) = d + 1. Moreover,
H(u, t) = x ∈ Rd : $x, u% ≤ t = x ∈ Rd : t − $x, u% ≥ 0 = {x : ft,u (x) ≥ 0}
where ft,u (x) = t − $x, u% ranges in F . By Proposition 7.2 S( Hd ) = d + 1.
Example 7.5 Suppose that X = Rd and consider
B(x, t) = y ∈ Rd : | y − x | ≤ t ,
&
'
Set C = Bd = B(x, t) : x ∈ Rd , t > 0 . Let F be the space spanned by the functions
fj (x) = xj , (j = 1, . . . , d) and the constant function 1, and let g be the function defined
as g(x) = −| x |2 . Thus, dim(F ) = d + 1. Moreover,
&
'
B(x, t) = {y : | y − x | ≤ t} = y : | y |2 − 2$y, x% + | x |2 ≤ t
&
'
= y : 2$y, x% − | y |2 − | x |2 + t ≥ 0
= {y : g(y) + ft,x (y) ≥ 0}
where ft,x (y) = 2$y, x%−| x |2 +t ranges in F . Since Bd = nn( g+F ) and S(nn( g+F )) =
d + 1 by Proposition 7.2, it follows that S(Bd ) = d + 1.
Next propositions may be considered as stability properties.
Proposition 7.3 Assume that C and D are VC-classes of subsets of a set X , and suppose
that φ : X → Y and ψ : Z → X are fixed functions. Then each of the following
statements holds
(i) C c = {C c : C ∈ C } is VC and S(C c ) = S(C ),
(ii) C & D = {C ∩ D : C ∈ C , D ∈ D} is VC,
(iii) C ' D = {C ∪ D : C ∈ C , D ∈ D} is VC,
(iv) φ(C ) is VC if φ is one-to-one,
7.1. INTRODUCTION
107
(v) ψ −1 (C ) is VC and S(ψ −1 (C )) ≤ S(C ) with equality if ψ is onto X ,
(vi) the sequential closure of C for pointwise convergence of indicator functions is VC,
(vii) for VC-classes C and D in sets X and Y , C × D = {C × D : C ∈ C , D ∈ D} is VC.
Proof.
By definition C c picks out the points of a given set {x1 , . . . , xm } that C does not
pick out. Hence if C shatters a given set of points, so does C c . Thus C is VC if and only if
C c is VC and the VC indices are equal. The proof that the collection of all intersection
is VC is easy upon using Lemma 7.1, according to which a VC-class can picks out only a
polynomial number of subsets. From n points C can pick out at most O(nS(C) ) subsets;
from each of these subsets D can pick out at most O(nS(D) ) further subsets. Thus we
get that C ∩ D can pick out at most O(nS(C)+S(D) ) subsets. For large n this is well
below 2n . The result for the unions follows from combination of (i) and (ii), because
C ∪ D = (C c ∩ D c )c . To prove (iv), observe that if φ(C ) shatters {y1 , . . . , yn }, then each
yi must be in the range of φ and there are x1 , . . . , xn such that φ is a bijection between
x1 , . . . , xn and y1 , . . . , yn . Hence C must shatter {x1 , . . . , xn }. To prove (v), note that if
ψ −1 (C ) shatters {z1 , . . . , zn }, then all ψ(zi ) must be different, and the restriction of ψ to
z1 , . . . , zn is a bijection on its range, so ψ −1 (C ) is VC for the previous result. The proof
of (vii) is an immediate consequence of (ii), in fact C ×Y and X ×D are VC-classes, and
hence their intersection C × D is VC by (ii). Let’s prove (vi). Consider any set of points
x1 , . . . , xn and any set C¯ in the sequential closure. Suppose that C¯ is the pointwise limit
of a net Cα , then for large α we get
1C¯ (xi ) = 1Cα (xi )
for each i.
¯
For such α the set Cα picks out the same subset at C.
2
Proposition 7.4 Let = &, ', or × and set
S(j, k) = max {S(C D) : S(C ) = j, S(D) = k} .
Then
for each j, k ∈ N,
S (j, k) = S (j, k) = S× (j, k) = S(j, k)
and
S(j, k) ≤ sup {r ∈ N :rC≤ j r C≤ k ≥ 2r } = T (j, k)
where
r C≤ j
=
j
l=0
n
j
.
108
CHAPTER 7. VC-THEORY: BOUNDING UNIFORM COVERING NUMBERS
Proof.
The first equality follows taking complements. For the second, for given k and
m, we can consider large enough finite sets in place of X and Y , so we can assume
X = Y . We have (j, k) ≤
× (j, k)
by restricting to diagonal in X × X. On the other
hand, let ΠX and ΠY be the projection of X × Y onto X and Y . Let C ⊂ 2X and
&
'
&
'
A ⊂ 2Y . Set F = Π−1
and B = Π−1
X (C) : C ∈ C
Y (A) : A ∈ A . Then S(F ) = S(C )
−1
and S(B) = S(A). Since Π−1
X (C) ∩ ΠY (A) ≡ C × A it follows that S(F & B) ≥ S(C × A),
and thus the proof is complete.
2
Next proposition is the stability property for VC-classes of functions.
Proposition 7.5 Assume that F and G are VC-subgraph classes of functions on a set
X , and suppose that g : X → R and φ : R → R and ψ : Z → X are fixed functions.
Then each of the following statements holds
(i) F ∨ G = {f ∨ g : f ∈ F , g ∈ G} is VC-subgraph,
(ii) F ∧ G = {f ∧ g : f ∈ F , g ∈ G} is VC-subgraph,
(iii) {F > 0} = {{f > 0} : f ∈ F } is VC,
(iv) −F is VC-subgraph,
(v) g + F = {g + f : f ∈ F } is VC-subgraph,
(vi) g · F = {g · f : f ∈ F } is VC-subgraph,
(vii) F ◦ ψ(C ) = {f (ψ) : f ∈ F } is VC-subgraph,
(viii) φ ◦ F = {φ( f ) : f ∈ F } is VC-subgraph for monotone φ.
Proof.
The subgraphs of suprema and infima are the intersection and union of the
subgraphs of f and g, thus (i) and (ii) are consequences of Proposition 7.3. To see
that (iii) holds, note that the sets {f > 0} are one-to-one images of the intersections of
the subgraphs with the set X × {0}. Hence the class {F > 0} is VC by (ii) and (iv)
of the preceding proposition. The subgraphs of the class −F are the images of the
open subgraphs of F under the map (x, t) → (x, −t), and the open subgraphs are the
complements of the closed subgraphs which are VC. Thus (iv) follows from the previous
proposition. For (v), observe that the subgraphs of F + g shatter a given set of points
(x1 , t1 ), . . . , (xn , tn ) if and only if the subgraphs of F shatter the set (xi , ti − g(xi )).
Thus we get that g + F is VC-subgraph. The subgraphs of the function f g is the union
7.1. INTRODUCTION
109
of the sets
C + = {(x, t) : t < f (x)g(x), g(x) > 0} ,
C − = {(x, t) : t < f (x)g(x), g(x) < 0} ,
C 0 = {(x, t) : t < 0, g(x) = 0} .
Thus it sufficies to show that these sets are VC in (X ∩ {g > 0}) × R, (X ∩ {g < 0}) ×
R, (X ∩ {g = 0}) × R. For instance, let {i : (xi , ti ) ∈ C − } be the set of indices of the
points (xi , ti /g(xi )) picked out by the open subgraphs of F . These are the complements
of the closed subgraphs and hence form a VC-class. The subgraphs of F ◦ ψ are the
inverse of the subgraphs of function in F under the map (z, t) → (ψ(z), t). Hence
(vii) follows from (v) of Proposition 7.3. To prove the statement (viii), assume that
the subgraphs of φ ◦ F shatter the set of points (x1 , t1 ), . . . , (xn tn ). Choose f1 , . . . , fm
from F such that the functions φ ◦ fj pick out all m = 2n subsets. For each i, set
si = max {fj (xi ) : φ( fj (xi ) ) ≤ ti }. Now, si < fj (xi ) if and only if ti < φ ◦ fj (xi ), for
every i and j, and the subgraphs of f1 , . . . , fm shatter the points (xi , si ). This completes
2
the proof.
Definition 7.7 A class of real functions on a set X is called Euclidean class for the
envelope function F if there exist constants A and V such that, for 0 < ≤ 1, one has
N ( F Q,1 , F , L1 (Q)) ≤ A −V ,
with 0 < F Q,1 = QF < ∞.
Remark 7.5 It is important to observe that constants A and V may not depend on Q.
Remark 7.6 Note that if F is Euclidean, then for each r > 1 and 0 < ≤ 1, one has
N ( F Q,r , F , Lr (Q)) ≤ A 2rV −rV ,
whenever 0 < QF r < ∞, as follows from the definition of N (2(/2)r F µ,1 , F , L1 (µ))
for the measure µ(·) = Q(· (2F )r−1 ).
For Euclidean class some properties of stability hold, such as the following one.
Proposition 7.6 Assume that F and G are Euclidean classes of functions with envelopes
F and G respectively, and suppose that Q is a measure with QF r < ∞ and QGr < ∞
for some r ≥ 1. Then the class of functions
F + G = {f + g : f ∈ F , g ∈ G}
is Euclidean for the envelope F + G and
N ((2 + 2δ) F + G Q,r , F + G, L2 (Q))
≤ N ( F Q,r , F , Lr (Q)) + N (δ G Q,r , G, Lr (Q)).
CHAPTER 7. VC-THEORY: BOUNDING UNIFORM COVERING NUMBERS
110
7.2
Convex Hulls
Definition 7.8 Let Y be a vector space, and A ⊂ Y, then the convex hull of A is the set
⎧
⎫
k
⎨
⎬
conv(A) =
ti yi , yi ∈ A, ti ≥ 0,
tj ≤ 1
⎩
⎭
i=1
j
for some integer k.
Definition 7.9 Let F be a class of functions, the convex hull of F is defined as the
following set
conv(F ) =
$m
αi fi , fi ∈ F , αi > 0,
i=1
m
%
αi = 1 .
i=1
Definition 7.10 Let F be a class of functions, the symmetric convex hull of F is
$m
%
m
sconv(F ) =
αi fi , fi ∈ F ,
αi ≤ 1 .
i=1
i=1
Definition 7.11 A collection of measurable functions F is a VC-hull class if there exists
a VC-class G of functions, such that f ∈ F is the pointwise limit of a sequence of
functions fm contained in sconv(G).
Suppose that F is a class of measurable function, then an upper bound for the covering numbers of the convex hull conv(F ) can be obtained in L2 -norm, once it is known
an upper bound for the covering numbers for the class F in L2 -norm.
Theorem 7.4 Assume that Q is a probability measure on (X , A), and let F be a class
of measurable functions with measurable square integrable envelope F such that 0 <
QF 2 < ∞, and for 0 < ≤ 1
V
1
N ( F Q,2 , F , L2 (Q)) ≤ C
.
Then there exists a constant K depending on C and V only such that
2V /(V +2)+δ
1
log N ( F Q,2 , conv(F ), L2 (Q)) ≤ K
.
Proof.
The power 2V /(V + 2) is sharp, in fact for any V < ∞ it is 2V /(V + 2) < 2.
Hence we can say that the convex hull G = conv(F ) of a polynomial class F satisfies
the uniform entropy condition
∞
,
sup log N ( G Q,2 , G, L2 (Q)) d < ∞,
0
provided G 2Q,2 ≡
Q
G2 dQ is finite for some envelope function G of G.
2
7.2. CONVEX HULLS
111
This result can be extended to Lr -metrics for 1 < r < ∞.
Theorem 7.5 Assume that Q is a probability measure on (X , A), and let F be a class of
measurable functions with measurable envelope F such that QF r < ∞, and for 0 < < 1,
and r > 1,
V
1
N ( F Q,r , F , Lr (Q)) ≤ C
.
Then there exists a constant K depending on C and V and r such that
1
1 min(1− r1 , 21 )+ V1
log N ( F Q,r , conv(F ), Lr (Q)) ≤ K
.
We complete the chapter with an example.
'
&
Example 7.6 Consider the class of all distribution functions on Rd . Set Gd ≡ 1[ t, ∞) : t ∈ Rd .
Gd is VC and V (Gd ) = d + 1. The envelope function is 1. Hence an upper bound for the
covering numbers is given by equation (7.3) of Theorem 7.2:
N (, Gd , Lr (Q)) ≤ K −rd ,
for 0 < ≤ 1. The entropy of conv(Gd ) is given by
log N (, conv(Gd ), Lr (Q)) ≤ K −γ(r,d) ,
where
$
γ(r, d) =
2rd
(rd+2) ,
rd
(r−1)d+1 ,
r≥2
1 < r ≤ 2.
Compute γ for the extreme values, we obtain γ(2, d) = 2d/(d + 1) and γ(r, d) d as
r 1. Especially γ(2, 1) = 1 = γ(1, 1) and γ(2, 2) = 4/3, γ(r, 2) 2 as r 1.
112
CHAPTER 7. VC-THEORY: BOUNDING UNIFORM COVERING NUMBERS
Chapter 8
Bracketing Numbers
We have already seen two ways of controlling bracketing numbers; recall Lemma 5.1 and
Lemma 5.2. Our goal here is to describe some of other available results for larger classes
of functions.
Control of bracketing numbers typically comes via results in approximation theory.
Bounds are available in the literature for many interesting classes: see for example Kolmogorov and Tikhomirov (1959), Birman and Solomjak (1967), Clements (1963), Devore
and Lorentz (1993), and Birg´e and Massart (2000). We give a few examples in this
chapter.
Many of the available results are stated in terms of the supremum norm ·∞ ; these
yield bounds on Lr (Q) bracketing via the following easy lemma (see Exercicse 9.1).
Lemma 8.1 For any class of measurable real-valued functions F on (X , A), and any
1 ≤ r < ∞,
N (, F, Lr (Q)) ≤ N[ ] (, F, Lr (Q)),
and
N[ ] (N (, F, Lr (Q)) ≤ N (/2, F, ·∞ )
for every > 0.
Proof.
Let k = N[ ] (, F, Lr (Q)) and let {[li , ui ]}ki=1 be the -brackets. Then, for any
f ∈ F exist i such that ui (x) ≥ f (x) ≥ li (x) for every x.
It follows that
#
#
#
#
#
#
#
#f
#f
#
li #
ui #
#f − li + ui #
#
#
#
≤ # − #
+
− #
#
2 #Lr (Q)
2
2 Lr (Q) # 2
2 #Lr (Q)
1
1
f − li Lr (Q) + f − ui Lr (Q) .
=
2
2
113
CHAPTER 8. BRACKETING NUMBERS
114
Because ui − f ≤ ui − li and 0 ≤ ui − f we have
|ui − f |r dQ ≤
|ui − li |r dQ that is
f − ui Lr (Q) + li − ui Lr (Q) < .
Similarly we obtain f − li Lr (Q) < , and then
#
#
#
#
#f − li + ui #
< .
#
2 #Lr (Q)
Finally
F⊆
k
B
i=1
ui + li
, , Lr (Q) .
2
This proves the first inequality.
)
(
Now we show N[ ] (N (, F, Lr (Q)) ≤ N (/2, F, ·∞ ) = l. Let F ⊆ ∪ki=1 B fi , 2 , · ∞ .
Then for any f ∈ F, exist fi such that
f − fi ∞ < .
2
Then pointwise we have
fi −
and clearly
8.1
&
fi − 2 , fi +
2
'l
i=1
≤ f ≤ fi +
2
2
forms l -brackets that cover F.
2
Smooth Functions
First consider the collection of smooth function on a bounded set X in Rd with uniformly
bounded derivatives of a given order α > 0 defined as follows: let α denote the greatest
integer smaller than α, and for any vector k = (k1 , . . . , kd ) of d integers, let
Dk =
where k =
d
∂k
∂xk11 · · · ∂xkdd
,
Then for a function f : X → R, define
* k
*
*
*
*D f (x) − D k f (y)*
* k
*
f α = max inf *D f (x)* + max sup
,
k≤α x
k=α x,y
y − xα−α
j=1 kj .
α (X )
where the suprema hare taken over all x, y in the interior of X with x = y. Let CM
be the set of all continuous functions f : X → R with f α ≤ M . The following theorem
goes back to Kolmogorov and Tikhomirov (1959).
Theorem 8.1 Suppose that X is a bounded, convex subset of Rd with nonempty interior. Then there exists a constant K depending only on α and d such that
d/α
K
α
1
log N (, C1 (X ), · ∞ ) ≤ λ(X )
(8.1)
8.1. SMOOTH FUNCTIONS
115
for every > 0. (Here λ(X 1 ) is the Lebesgue measure of the set X 1 = {x : x − X < 1}).
By application of Lemma 8.1, this yields the following corollary.
Corollary 8.1 Let X be a bounded, convex subset of Rd with nonempty interior. Then
there exists a constant K depending only on α, λ(X 1 ), and d such that
log N[ ] (, C1α (X ), Lr (Q))
d/α
1
≤K
for every r ≥ 1, > 0, and probability measure Q on Rd .
Example 8.1 Let Fα = C1α [0, 1] for 0 < α ≤ 1, the class of all Lipschitz functions of
degree α ≤ 1 on the unit interval [0, 1]. Then log N (, C1α [0, 1], L2 (Q)) ≤ K(1/)1/α for all
> 0, and hence Fα is universal Donsker for α > 1/2. Similarly, for Fd,α = C1α [0, 1]d ,we
conclude that Fd,α is universal Donsker for α > d/2. [It follow from a results of Strassen
and Dudley (1969) that this is sharp in a sense: if α = d/2, then the class Fd,α is not
even pre-Gaussian for Q = λ on [0, 1]d .]
If we replace the uniform bounds in the definition of the norm used to define the
α (X ) by bounds on L -norms of derivative, then the resulting classes of funcclasses CM
p
tions are the Sobolev classes Wpα (X ) defined as follows. For α ∈ N and p ≥ 1, define
$
f p,α = f Lp +
%1/p
= αD
k
f pLp
,
k
where Lp = Lp (X , B, λ). If α is not an integer, define
f p,α = f Lp
⎧
⎫1/p
⎨ **D k f (x) − D k f (y)**p
⎬
+
dxdy
.
⎩
⎭
x − yp(α−α)+d
k=α X X
The Sobolev space Wpα (X ) is the set of real valued functions on X with f p,α < ∞.
α,p
Let DM
(X ) = {f ∈ Wpα (X ) : f p,α ≤ M }. Birman and Solomjak (1967) proved the
following entropy bound.
Theorem 8.2 (Birman and Solomjak) Suppose that X is a bounded, convex subset
of Rd with nonempty interior. Then there exists a constant K depending only on r and
d such that
log N (, D1α,p ([0, 1]d ), · Lp ) ≤
K
d/α
(8.2)
for every > 0 and 1 ≤ q ≤ ∞ when p > d/α, 1 ≤ q < q ∗ := p(1 − pα/d)−1 when
p ≤ d/α.
CHAPTER 8. BRACKETING NUMBERS
116
Theorem 8.2 has recently been extended to balls in the Besov space Bp,∞ ([0, 1]d )
by Birg´e and Massart (2000). Here is the definition of these spaces in the case d = 1
following DeVore and Lorentz (1993). Suppose that [a, b] is a compact interval in R. For
an integer r define the r-th order differences of a function f : [a, b] → R by
r r
∆rh (f, x) =
(−1)r−k f (x + kh)
k
k=0
where x, x + kh ∈ [a, b]. The Lp -modulus of smoothness ωr (f, y, [a, b])p is then defined
by
b−rh
[ωr (f, y, [a, b])p ]p = sup
0<h<y
|∆rh (f, x)|p dx
for y > 0.
a
For given α > 0 and p > 0, define f Bpα by
f Bpα = sup y −α ωr (f, y, [a, b])p .
y>0
The Besov space Bp,∞([a, b])p is the collection of all functions f ∈ Lp ([a, b]) with f Bpα <
∞.
This generalizes to functions on bounded subsets of Rd as follows:
Theorem 8.3 (Birg´
e and Massart) Suppose that p > 0 and 1 ≤ q ≤ ∞. Let
α
α
VM (Bp,∞
([0, 1]d )) = {f ∈ Bp,∞
([0, 1]d ) : f Bpα ≤ M }.
Then, for a constant k depending on d, α, p, and q,
α
log N (, VM (Bp,∞
([0, 1])d )), Lp )
≤K
M
d/α
provided that α > (d/p − d/q)+ .
The results stated so far in this section apply to function f defined on a bounded
subset X of Euclidean space. By adding hypotheses in the form of moment conditions
on the underlying probability measure, the entropy bounds can be generalized to classes
of functions on Rd . Here is an extension of this type for the H¨
older classes treated for
bounded domains in Theorem 8.1.
d
Corollary 8.2 (van der Vaart) Suppose that Rd = ∪∞
j=1 Ij is a partition of R into
bounded, convex sets Ij with nonempty interior, and let F be a class of functions f :
α (I ) for every j. Then there is a
Rd → R such that the restrictions F|Ij are in CM
j
j
constant K depending only on α, V, r and d such that
⎛
⎞ V +r
r
V ∞
V
r
V
1
1 V r+r
V +r
⎝
⎠
V
+r
log N[ ] (, F, Lr (Q)) ≤ K
λ(Ij )
Mj Q(Ij )
,
j=1
for every > 0, V ≥ d/α, and probability measure Q.
8.2. MONOTONE FUNCTIONS
See van der Vaart and Wellner (1996), page 158, and van der Vaart (1994). 2
Proof.
8.2
117
Monotone Functions
As we have seen in Chapter 6, the class F of bounded monotone functions on R has
L2 (Q) uniform entropy bounded by a constant times 1/ via the convex hull Theorem
7.4. It follows that F is Donsker for every probability measure P on R. Another way to
prove this is via bracketing. The following theorem was proved by Van de Geer (1991)
by use of the methods of Birman Solomjak (1967).
Theorem 8.4 Let F be the class of all monotone function f : R → [0, 1]. Then
log N[ ] (, F, Lr (Q)) ≤
K
for every probability measure Q, every r ≥ 1, and a constant K depending on r only.
See van der Vaart and Wellner (1996), pages 159-162 for a complete proof. 2
Proof.
The bracketing entropy bound is very useful in applications because of the relative
case of bounding suprema of empirical processes in terms of bracketing integrals, as
developed in Chapter 7.
8.3
Convex Functions and Convex Sets
To deal with convex sets in a metric space (D, d), we first introduce a natural metric,
the Hausdorff metric: for C, D ⊂ D, let
h(C, D) = sup d(x, D) ∨ sup d(x, C).
x∈C
x∈D
When restricted to closed subsets, this yields a metric (which can be infinite). The
following result of Bronˇstein (1976) gives the entropy of the collection of all compact,
convex subsets of a fixed, bounded subset X of Rd with respect to the Hausdorff metric.
Lemma 8.2 Suppose that Cd is the class of all compact, convex subsets of a fixed
bounded subset X of Rd with d ≥ 2. Then there are constants 0 < K1 < K2 < ∞
such that
Proof.
(d−1)/2
(d−1)/2
1
1
K1
≤ log N (, C, h) ≤ K2
.
See Bronˇstein (1976) or Dudley (1999), pages 269-281.
2
CHAPTER 8. BRACKETING NUMBERS
118
There is an immediate corollary of Lemma 8.2 for Lr (Q) bracketing numbers when Q
is absolutely continuous with respect to Lebesgue measure on X with bounded density.
Corollary 8.3 Let Cd be the class of all compact, convex subsets of a fixed bounded
subset X of Rd with d ≥ 2, and suppose that Q is a probability distribution on X with
bounded density q. Then
(d−1)r/2
1
log N[ ] (, Cd , Lr (Q)) ≤ K
,
for every ε > 0 and a constant K depending only on X , q∞ , and d.
Proof.
2
See van der Vaart and Wellner (1996), page 163.
Note that for r = 2 the exponent in the bound in Corollary 8.3 is d − 1, which is < 2
for d = 2 (and hence C2 is P –Donsker for measures P with bounded Lebesgue density),
but is ≥ 2 when d ≥ 3. Bolthausen (1978) showed that C2 is Donsker. Dudley (1984),
(1999) studied the boundary case d = 3 and shows that the when P is Lebesgue measure
λ = λd on [0, 1]d , for each δ > 0 there is an M = M (δ) < 0 such that
)
(
P Gn C3 > M (log n)1/2 (log log n)−δ−1/2 → 1 as
n → ∞;
it follow in particular that C3 is not λd –Donsker.
Now consider convex function f : X → R where X is a compact, convex subset of
Rd . If we also require that the functions be uniformly Lipschitz, then an entropy bound
with respect to the uniform metric can be derived from the preceding result.
Corollary 8.4 Suppose that F is the class of all convex functions f : X → [0, 1] defined
on a compact, subset X of Rd satisfying |f (x) − f (y)| ≤ Lx − y for every x, y ∈ X .
Then
log N (, F, · ∞ ) ≤ K(1 + L)
d/2
d/2
1
for all > 0 for a constant K that depend on d and the set X only.
Proof.
8.4
See van der Vaart and Wellner (1996), page 164.
2
Lower layers
A set C ⊂ Rd is called a lower layer if and only if x ∈ C and y ≤ x implies y ∈ C. Here
y ≤ x means that yj ≤ xj for j = 1, . . . , d where y = (y1 , . . . , yd ) and x = x1 , . . . , xd . Let
LLd denote the collection of all lower layers in Rd with nonempty complement, and let
LLd,1 = {L ∩ [0, 1]d : L ∈ LLd , L ∩ [0, 1]d = ∅}.
8.4. LOWER LAYERS
119
Lower layers arise naturally in connection with problems connected with functions f :
Rd → R that are monotone in the sense of being increasing (nondecreasing) in each of
their arguments. For such a function the level sets appear as the boundaries of sets which
are lower layers: for t ∈ R
{x ∈ Rd : f (x) ≤ t} = C
is a lower layer (if t is the interior of the range of f ?). Recall that for a metric space
(D, d), x ∈ D, and a set A ⊂ D,
d(x, A) = inf{d(x, y) : y ∈ A}.
Further, the Hausdorff pseudometric h for sets A, B ⊂ D is given by
h(A, B) = max{sup d(x, B), sup d(y, A)}.
x∈A
y∈B
It is not hard to show that h is a metric on the class of closed, bounded, nonempty
subsets of D.
The following Theorem concerning the behavior of the covering numbers and bracketing numbers for lower layers is from Dudley (1999), page 266.
Theorem 8.5 For d ≤ 2, as ↓ 0 the following assertions hold:
log N (, LLd,1 , h) ( log N (, LLd,1 , L1 (λ)) ( 1−d ,
and
log N[ ] (, LLd,1 , L1 (λ)) ( 1−d .
For other results on lower layers and related statistical problems involving monotone
functions, see Wright (1981) and Hanson, Pledger, and Wright (1973).
CHAPTER 8. BRACKETING NUMBERS
120
8.5
Exercises
Exercise 8.1 Suppose that F is the class of all differentiable function f from [0, 1] with
f ∞ ≤ 1. Show that for some constant K
K
(Hint: Consider approximations of the form
log N (, F, · ∞ ) ≤
>0
B
f (0)
1(aj−1 ,aj ] + 1{0}
j=1
C D
= M , aM +1 = 1, and M = 1 , so that M + 1 ≤ 2/).
f˜(x) =
with a0 = 0, a1 = , . . . , aM
for all
M
+1
?a @
A
j
Suppose that x ∈ (aj−1 , aj ], then
* A
*
B
*
*
* f (aj )
*
*˜
*
*
− f (x)**
*f (x) − f (x)* = *
* A
*
B
* f (aj )
*
= **
− f (aj )** + |f (aj ) − f (x)|
*A
*
B
* f (aj )
*
*
f (aj ) **
*
= *
x)*
−
+ |aj − x| *f (˜
*
≤ 2
x
˜ ∈ [x, aj ]
*
*
*˜
*
by hypothesis and construction of aj ’s. Note that also *f (0) − f (0)* ≤ ≤ 2, because
f − f˜∞ ≤ 2.
Solution.
+1
The number of partitions: {0}, (aj−1 , aj ]M
j=1 is M + 2, and M + 2 ≤ 2/, for
sufficiently small . Total number of choices for f˜(0) is clearly 1/ (because 0 ≤ f (0) ≤
1).
But
*
* *
*
*
*
*˜
* *
*
*
*
*f (ak ) − f˜(ak−1 )* ≤ *f˜(ak ) − f (ak )* + *f (ak ) − f (ak−1 )*
*
*
*
*
+ *f˜(ak−1 ) − f (ak−1 )* ≤ 3.
Then
f˜(ak−1 ) − 3 ≤ f˜(ak ) ≤ f˜(ak−1 ) + 3
A
B
B
f (ak−1 )
f (ak−1 )
− 3 ≤ f˜(ak ) ≤ + 3
A
B
A
B
f (ak−1 )
f˜(ak )
f (ak−1 )
−3 ≤
≤
+ 3.
A
So, once f˜(0) is fixed, there are at most 7 choices for f˜(aj ) and then at most 7 for f˜(a2 ),
and so on. Hence the total number of possible
A B
1 M +1
≤
7
functions is
1 2/
(7)
8.5. EXERCISES
121
and finally
1
N (2, F, · ∞ ) ≤ (49)1/ ,
leading to the conclusion of the proposition.
Exercise 8.2 Suppose that F =
2
*
* 1
f : [0, 1] → [0, 1]* 0 (f (x))2 dx ≤ 1 . Show that for
λ=Lebesgue measure on [0, 1] there is a constant K so that
K
log(K/)
log N (, F, L2 (λ)) ≤
Solution.
for all
> 0.
As in previous Exercise (8.1), get a0 = 0, a1 = , . . . , aM = M , aM +1 = 1
with M = 1/, so that M + 1 ≤ 2/.
For f ∈ F, define
ak
and let
g˜ =
f (x)dx
ak−1
gk =
ak − ak−1
M
+1
g k 1(ak−1 ,ak ] .
k=1
For x ∈ (ak−1 , ak ], we have
2
|g(x) − gk | ≤
2
ak
ak−1
|g
(u)|2 du
ak − ak−1
.
Note then that
1
0
M
+1 ak
2
|g(x) − g˜(x)| dx =
k=1 ak−1
ak
M
+1
2
≤
k=1
Next define
∗
g =
M
+1
k=1
|g(x) − g k |2 dx
* *2
*g (u)* du ≤ 2 .
ak−1
A
B
gk
1(ak−1 ,ak ] .
Now, the total number of all such functions is dominated by (K1 /)2/ , since the total
number of sets of the form (ak−1 , ak ] is ≤ 2/, and gk / ≤ 2/ since 0 ≤ g k ≤ 1.
We will show that
g∗ − g˜L2 (λ) ≤ .
This together with the fact that ˜
g − gL2 (λ) ≤ gives g∗ − gL2 (λ) ≤ 2. Thus
N (, F, L2 (λ)) ≤
K
K/
for
K > max{K1 , 2}
CHAPTER 8. BRACKETING NUMBERS
122
This then establishes the desired result. It remain to show the intermediate technical
steps.
Consider that
g∗ − g˜L2 (λ) =
1
2M +1
0
gk 1(ak−1 ,ak ] −
k=1
M
+1 ak
!
k=1 ak−1
M
+1
2 (ak
k=1
2
− ak−1 )
A
M
+1
k=1
32
B
gk
1(ak−1 ,ak ] dλ
A
B"2
gk
dλ
k=1 ak−1
! A B"2
M
+1 ak
2 g k gk
dλ
=
a
k−1
k=1
M
+1 ak
≤
2 dλ
=
=
gk − = .
Finally, to show that
2
|g(x) − gk | ≤
consider
=
=
=
=
=
≤
≤
2
ak
ak−1
|g
(u)|2 du
ak − ak−1
x ∈ (ak−1 , ak ]
*
ak
*2
*
*
ak−1 g(y)dy *
*
*g(x) −
*
*
ak − ak−1 *
*
*
* ak g(y) − g(x) *2
*
*
dy *
*
* ak−1 ak − ak−1
*
* "
" **2
ak ! y
*
x ! x
1
*
*
g
(u)du dy +
g
(u)du dy *
*−
*
(ak − ak−1 )2 *
ak−1
y
x
x
* *2
2
3
ak ! ak "
*
*
x
u
1
*
*
−
du
g
(u)du
+
dy
g
(u)du
*
*
2
*
(ak − ak−1 ) *
ak−1
ak−1
x
u
* *
ak
*
*2
x
1
*
*
(u − ak−1 )g (u)du +
(ak − u)g (u)du*
*−
*
(ak − ak−1 )2 *
ak−1
x
*
*2
* ak ξ(u)g
(u) *
*
*
du* with ξ(u) = (ak−1 − u)1{u ≤ x} + (ak − u)1{(x < u)}
*
* ak−1 ak − ak−1 *
ak
1
2
ξ 2 (u)g
(u)du
(ak − ak−1 ) ak−1
ak *
*
2
* 2 *
g
(u)
*
* du.
(ak − ak−1 ) ak−1
8.5. EXERCISES
Where the last inequality follow because ξ 2 (u) ≤ 2 .
123
2
124
CHAPTER 8. BRACKETING NUMBERS
Chapter 9
Multiplier Inequalities and CLT
9.1
The unconditional multiplier CLT
If we write Zi = δXi −P , then the empirical process can be rewritten as Gn =
√1
n
n
i=1 Zi .
The Donsker theorem says that
1 √
Zi ⇒ G
n
n
in ∞ (F).
i=1
where G is a tight Brownian bridge process.
Now suppose that ξ1 , . . . , ξn are i.i.d. real random variables which are also indepen
dent of Z1 , . . . , Zn and consider the process √1n ni=1 ξi Zi . If the ξi have mean zero and
satisfy a certain moment condition the hypothesis that F is a Donsker class is necessary
and sufficient for the multiplier CLT :
1 √
ξi Zi ⇒ σG
n
n
in
∞ (F)
i=1
where σ 2 = Var(ξ1 ).
For a random variable ξ, set
ξ 2,1 =
∞+
0
P (| ξ |> t)dt.
125
CHAPTER 9. MULTIPLIER INEQUALITIES AND CLT
126
It is easily seen that ξ 2,1 < ∞ implies that ξ 2 < ∞. In fact
2
E[| ξ | ] =
0
∞
∞
=
0
P (| ξ |2 > t)dt
P (| ξ |> t)2t dt
(by a change of variable)
+
P (| ξ |> t) P (| ξ |> t)dt
0
∞ +
1
≤ 2
t 2 E[| ξ |2 ] P (| ξ |> t)dt
t
0
∞+
+
= 2 E[| ξ |2 ]
P (| ξ |> t)dt
=
∞
2t
+
(by Markov inequality)
0
and then
+
E[| ξ
|2 ]
≤ 2
∞+
0
P (| ξ |> t)dt
ξ 2 ≤ 2 ξ 2,1 .
The following multiplier inequalities give an upper and a lower bound of the expectation to the sup norm of the multiplier process in terms of a symmetrized version by
Rademacher random variables.
Lemma 9.1 Suppose that Z1 , . . . , Zn are i.i.d. stochastic processes with E ∗ Zi F < ∞
independent of the Rademacher variables 1 , . . . , n . Suppose that ξ1 , . . . , ξn are i.i.d.
mean zero random variables independent of Z1 , . . . , Zn satisfying ξ 2,1 < ∞. Then, for
any 1 ≤ n0 ≤ n,
#
#
n
#
#
1
#
∗# 1
√
ξ 1 E #
i Zi #
# n
#
2
i=1
#
#
n
# 1 #
#
√
≤ E #
ξi Zi #
# n
#
∗#
F
i=1
F
| ξi |
≤ 2(n0 − 1)E ∗ Z1 F E max √
1≤i≤n
n
#
#
k
# 1 #
#
#
+4 ξ 2,1 max E ∗ # √
i Zi # .
#
#
n0 ≤k≤n
k i=n0
F
If the ξi ’s are symmetric about zero, then the constant 1/2, 2 and 4 can be replaced by
1.
Proof.
Define 1 , . . . , n independent of ξ1 , . . . , ξn on their own factor of a product
probability space. Suppose that the ξi ’s are symmetric, then the random variable i | ξi |
9.1. THE UNCONDITIONAL MULTIPLIER CLT
127
have the same distribution as the ξi ’s, and the inequality on the left follows from
# n
#
# n
#
#
#
#
#
#
#
#
#
E∗ #
ξi Zi #
= E∗ #
i |ξi | Zi #
#
#
#
#
i=1
i=1
F
F
# n
#
#
#
#
#
= E ∗ Eξ #
i |ξi | Zi #
(by property of conditional expectation)
#
#
i=1
F
# 2
3#
n
#
#
#
∗ #
≤ EZ
i |ξi | Zi #
(by Jensen and convexity of the sup norm)
#Eξ
#
#
i=1
F
# n
#
#
#
#
#
= E∗ #
i Zi Eξ |ξi |#
(by independence)
#
#
i=1
F
#
#
n
#
#
#
#
= ξ1 E ∗ #
i Zi # .
#
#
i=1
F
For the general case, let η1 , . . . , ηn be an independent copy of ξ1 , . . . , ξn . Then ξi 1 ≤
ξi − ηi 1 because
* '
& E |ξi − ηi | = E E |ξi − ηi | *ξi ≥ E |E [ξi − ηi |ξi ]| = E |ξi − Eηi | .
Therefore ξi 1 can be replaced by ξi − ηi 1 on the left side. Consider that the variable
ξi − ηi is symmetric and apply the inequality proved above to it, obtaining
#
#
#
#
n
n
#
#
#
#
#
#
∗# 1
∗# 1
√
√
ξ 1 E #
i Zi #
≤ ξ − ηi 1 E #
i Zi #
# n
#
# n
#
i=1
i=1
F
#
# F
n
# 1 #
#
#
≤ E∗ # √
i |ξi − ηi | Zi #
# n
#
i=1
F
#
#
n
# 1 #
#
#
≤ E∗ # √
(ξi − ηi ) Zi #
# n
#
i=1
F
#
#
n
# 1 #
#
#
≤ 2E ∗ # √
ξi Zi # .
# n
#
i=1
F
We have used in the last step the triangle inequality and the identity in distribution of
the ξi ’s and the ηi ’s. Thus the inequality on the left has been proved.
To prove the inequality on the right side, start again with the case of symmetric ξi ’s.
Let ξ˜i ≥ . . . ≥ ξ˜n be the reversed order statistics of the random variables |ξ1 | , . . . , |ξn |.
By the definition of Z1 , . . . , Zi as fixed functions of the coordinates on the product space
(X n , B n ), it follows that for any fixed ξ1 , . . . , ξn ,
# n
#
# n
#
#
#
#
#
#
#
#
#
E EZ∗ #
i |ξi | Zi # = E EZ∗ #
i ξ˜i Zi # .
#
#
#
#
i=1
F
i=1
F
CHAPTER 9. MULTIPLIER INEQUALITIES AND CLT
128
By the Fubini’s theorem for outer measures (Lemma 1.2.7 of van der Vaart and
Wellner (1996)), the joint outer expectation E ∗ can be replaced by Eξ, EZ∗ . Thus it
follows by the triangle inequality that, for any n0 ≤ n,
#
#
#
#
#
#
n
n
n
#
#
#
#
#
#
#
#
#
∗#
∗ #
∗ #
˜
E #
ξi Zi #
= Eξ, EZ #
i |ξi | Zi # = Eξ, EZ #
i ξi Zi #
#
#
#
#
#
#
i=1
i=1
i=1
F
#n
# F
# n
# F
0
#
#
#
#
#
#
#
#
≤ Eξ, EZ∗ #
i ξ˜i Zi # + Eξ, EZ∗ #
i ξ˜i Zi #
#
#
#
#
i=1
i=n0
F
#n
#F
# n
#
0
#
#
#
#
#
#
#
#
= Eξ, EZ∗ #
i ξ˜i Zi # + E ∗ #
i ξ˜i Zi #
(by Fubini).
#
#
#
#
F
i=1
i=n0
F
For the first term in the last display we have
#n
#
#n
#
0
0
#
#
#
#
#
#
#
∗ #
∗
Eξ, EZ #
i ξ˜i Zi #
≤ Eξ, EZ #
i ξ˜1 Zi #
#
#
#
#
i=1
i=1
F
F
#
#
n0
#
#
#
#
= Eξ, EZ∗ #(ξi : |ξi | = ξ˜1 )
Zi #
#
#
i=1
#n F #
0
#
#
*
*
#
#
*
*
= E *(ξi : |ξi | = ξ˜1 )* Eξ, EZ∗ #
Zi #
#
#
i=1
∗
F
≤ E ξ˜1 (n0 − 1)E Z1 F (by triangle inequality).
Now write ξ˜i = nk=i ξ˜k − ξ˜k+1 in the second term (with ξ˜n+1 = 0) and change the
order of summation to find that the second term equals
#
#
# n
#
# n #
k
#
#
#
#
#
∗#
∗#
˜
˜
˜
E #
ξk − ξk+1
i ξi Zi #
= E #
i Zi #
#
#
#
#k=n0
#
i=n0
i=n0
F
F
⎧
⎫
#
#
n
k
# 1 #
⎨ √ ⎬
#
#
≤ E
k ξ˜k − ξ˜k+1
i Zi # .
max E ∗ # √
# k
#
⎩
⎭ n0 ≤k≤n
i=n0
k=n0
F
Since k = # {i ≤ n : |ξ| ≥ t} on ξ˜k+1 < t < ξ˜k , the first expectation in the last display
can be written as
E
n k=n0
ξ˜k
ξ˜k+1
√
kdt ≤
E
≤
+
0
≤
∞
0
0
# {i ≤ n : |ξi | ≥ t}dt
∞+
E# {i ≤ n : |ξi | ≥ t}dt
∞+
nP (|ξi | ≥ t)dt =
(by Jensen)
√
n ξ2,1 .
Combining these pieces yields the upper bound in the case of symmetric variable ξi .
9.1. THE UNCONDITIONAL MULTIPLIER CLT
129
For asymmetric multipliers ξi , note that
#
#
#
#
#
#
n
n
n
#
#
#
#
#
#
#
#
#
#
#
#
E∗ #
ξi Zi # = E ∗ #
(ξi − Eηi ) Zi # ≤ E ∗ #
(ξi − ηi ) Zi # .
#
#
#
#
#
#
i=1
F
F
i=1
i=1
F
Then apply the bound already derived for symmetric multipliers to the right side in the
above display
# n
#
#
#
#
∗#
E #
(ξi − ηi ) Zi #
#
#
i=1
|ξ1 − ηi | ∗
√
E Z1 F +
n
#
#
k
# 1 #
#
#
i Zi # .
+ ξ − η2,1 max E ∗ # √
# k
#
n0 ≤k≤n
i=n0
≤ (n0 − 1)E max
1≤i≤n
F
F
For the first term use the triangle inequality ξ − η1 ≤ 2 ξ1 , while for the second one
note that ξ − η2,1 ≤ 4 ξ2,1 In fact for any pair of random variable ξ and η we have
that
P (|ξ + η| > t) ≤ P (|ξ| > t/2) + P (|η| > t/2)
√
√
√
and a + b ≤ a + b for a, b ≥ 0, so
∞+
ξ + η2,1 =
P (|ξ + η| > t)dt
0 ∞ +
∞+
≤
P (|ξ| > t/2)dt +
P (|η| > t/2)dt
0
0
∞+
∞+
= 2
P (|ξ| > t)dt + 2
P (|η| > t)dt
0
0
= 2 ξ2,1 + 2 η2,1 .
2
This completes the proof.
The main application of Lemma 9.1 is to the unconditional multiplier central limit
theorem.
Theorem 9.1 Suppose that F is a class of measurable functions on a probability space
(X , A, P ). Suppose that ξ1 , . . . , ξn are i.i.d. real random variables with mean zero,
variance 1, and ξ1 2,1 < ∞, independent of X1 , . . . , Xn . Then the sequence
n
−1/2
n
ξi (δXi − P )
i=1
converges to a tight limit process in ∞ (F) if and only if F is P –Donsker. When either
convergence holds, the limit process in each case is a (tight) P –Brownian bridge process
G.
CHAPTER 9. MULTIPLIER INEQUALITIES AND CLT
130
Since both the empirical process n−1/2 ni=1 (δXi − P ) and the multiplier pro
cess n−1/2 ni=1 ξi (δXi − P ) do not change if indexed by the class of functions {f − P f :
Proof.
f ∈ F} instead of F, it may be assumed without loss of generality that P f = 0 for every
f . Marginal convergence of both sequences of processes is equivalent to F ⊂ L2 (P ).
It suffices to show that the asymptotic equicontinuity conditions for the empirical and
the multiplier processes are equivalent. If F is Donsker, then its envelope function F
)
(
possesses a weak second moment: P ∗ (F > x) = o x−2
as x → ∞ (see e.g. Lemma
2.3.9 in van der Vaart and Wellner (1996), page 113). By the same lemma convergence of
)
(
the multiplier process to a tight limit implies that P ∗ (|ξF | > x) = o x−2 . In particular
P ∗ F < ∞. For Zi = δXi − P , since P f = 0 ∀f ∈ F we have that
Since ξ2,1
E ∗ Z1 F = E ∗ F (X) = P ∗ F < ∞.
( )
< ∞ implies that E ξ 2 < ∞, it follows that
√
E max |ξi | / n → 0.
1≤i≤n
Consider the multiplier inequalities of Lemma 9.1; using what we claimed above the first
term on the far right side converge to 0 and we have
#
#
#
#
n
n
#
#
#
#
1
#
#
∗# 1
∗# 1
ξ 1 lim sup E # √
i Zi #
≤ lim sup E # √
ξi Zi #
#
#
#
#
2
n
n
n→∞
n→∞
i=1
i=1
Fδ
Fδ
#
#
k
# 1 #
#
∗#
≤ +4 ξ 2,1 sup E # √
i Zi #
#
#
k
k≥n0
i=n0
Fδ
for every n0 and δ > 0. By the symmetrization 4.1, the Rademacher random variables
in these inequalities can be deleted at the cost of changing the constants by factors of
two. Consider that ξ < ∞ and ξ2,1 < ∞; this yields the conclusion that
#
#
#
#
n
n
#
#
# 1 #
#
#
#
#
E ∗ #n−1/2
Zi # → 0 if and only if E ∗ # √
ξi Zi # → 0.
#
#
# n
#
i=1
Fδ
i=1
Fδ
These are the L1 -versions of the asymptotic equicontinuity conditions. But they are
equivalent to the probability version. Consider for example the case of the empirical process. Since F is Donsker, the random variable Z1 ∗F possesses a weak second moment.
This implies that
Z1 ∗F
√
→ 0.
1≤i≤n
n
In view of the triangle inequality, the same is true with F replaced by Fδn . AsympE ∗ max
totic equicontinuity condition correspond to the convergence to zero in probability of
#
# −1/2 n
#
#n
i=1 Zi F ; this implies pointwise convergence to zero of the sequence of their
δ
quantile functions. Apply Hoffman-Jørgensen’s inequality 4.4 and obtain the condition
in terms of moment.
2
9.2. CONDITIONAL MULTIPLIER CLT’S
9.2
131
Conditional multiplier CLT’s
While the unconditional multiplier CLT Theorem 9.1 is useful, the deeper conditional
multiplier CLT involve conditioning on the original Xi ’s and examining the convergence
properties of the resulting sums as a function of the random multipliers. The following
two theorems assert weak convergence of G
n = n−1/2 ni=1 ξi (δXi − P ) in probability
and given every sequence X1 , X2 , . . . , and are of interest for statistics in connection with
bootstrap results (see Chapter 14).
Conditional weak convergence in probability must be formulated in terms of a metric
on conditional laws. Since conditional laws do not exist without proper measurability,
we utilize the bounded dual Lipschitz distance based on outer expectations. In fact weak
convergence, Gn ⇒ G, of a sequence of random elements, Gn in ∞ (F), to a separable
limit, G, is equivalent to
sup |E ∗ H (Gn ) − EH (G)| → 0.
H∈BL1
Here BL1 is the set of all functions H : ∞ (F) → [0, 1] such that |H(z1 ) − H(z2 )| ≤
z1 − z2 F for every z1 , z2 .
Theorem 9.2 Suppose that F is a class of measurable functions and that ξ1 , . . . , ξn
are i.i.d. random variables with mean zero, variance 1, and ξ2,1 < ∞, independent
of X1 , . . . , Xn . Let G
n = n−1/2 ni=1 ξi (δXi − P ). Then the following assertions are
equivalent:
(i) F is Donsker.
(ii) supH∈BL1 |Eξ H (G
n ) − EH (G)| → 0 in outer probability, and the sequence G
n is
asymptotically measurable.
Also for the almost sure conditional convergence we utilize a condition in terms of
bounded dual Lipschitz distance.
Theorem 9.3 Suppose that F is a class of measurable functions and that ξ1 , . . . , ξn
are i.i.d. random variables with mean zero, variance 1, and ξ2,1 < ∞, independent
of X1 , . . . , Xn . Let G
n = n−1/2 ni=1 ξi (δXi − P ). Then the following assertions are
equivalent:
(i) F is Donsker and P ∗ f − P f 2F < ∞.
(ii) suph∈BL1 |Eξ H (G
n ) − EH (G)| → 0 outer almost surely, and the sequence Eξ H (G
n )∗ −
EH (G
n )∗ converges almost surely to zero for every H ∈ BL1 .
132
CHAPTER 9. MULTIPLIER INEQUALITIES AND CLT
Note that (ii) implies that |Eξ H (G
n ) − EH (G)| → 0 for every H ∈ BL1 , for almost
every sequence X1 , X2 , . . . . By the portmanteau theorem, this is then also true for every
continuous, bounded H. Thus, the sequence G
n convergences in distribution to G given
almost every sequence X1 , X2 , . . . .
Part II
Empirical Processes: Applications
133
Chapter 10
Consistency of Maximum
Likelihood Estimators
Consistency of maximum likelihood estimator is well established for regular parametric
models. For nonparametric models, even the definition of maximum likelihood estimator
poses certain problems, and it is clear that they are not consistent in general. We first
prove a general result for nonparametric maximum likelihood estimation in a convex
class of densities. The results in this chapter are based on the papers of Pfanzagl (1988)
and Van de Geer (1993), (1996). Consider a class P of densities on a measurable space
(X , A), with respect to a fixed σ-finite measure µ. Suppose that X1 , . . . , Xn are i.i.d. P0
with density p0 ∈ P. Let
pˆn ≡ arg max Pn log p .
p
For 0 < α ≤ 1, let ϕα (t) =
(tα
−
1)/(tα
+ 1) for t ≥ 0, ϕ(t) = −1 for t < 0. Then ϕα is
bounded and continuous for each α ∈ (0, 1]. For 0 < β < 1 define
h2β (p, q) ≡ 1 − pβ q 1−β dµ .
1
√
√
{ p − q}2 dµ
2
yields the Hellinger distance between p and q. By H¨
older’s inequality, hβ (p, q) ≥ 0 with
Note that
h21/2 (p, q) ≡ h2 (p, q) =
equality if and only if p = q a.e. µ.
Proposition 10.1 Suppose that P is convex. Then
pˆn
2
h1−α/2 (ˆ
.
pn , p0 ) ≤ (Pn − P0 ) ϕα
p0
In particular, when α = 1 we have, with ϕ ≡ ϕ1 ,
pˆn
2ˆ
pn
2
2
h (ˆ
= (Pn − P0 )
.
pn , p0 ) = h1/2 (ˆ
pn , p0 ) ≤ (Pn − P0 ) ϕ
po
pˆn + p0
135
136 CHAPTER 10. CONSISTENCY OF MAXIMUM LIKELIHOOD ESTIMATORS
Corollary 10.1 Suppose that {ϕ(p/p0 ) : p ∈ P} is a P0 –Glivenko-Cantelli class. Then
for each 0 < α ≤ 1, h1−α/2 (ˆ
pn , p0 ) →a.s. 0.
Proof.
Since P is convex and pˆn maximizes Pn log p over P, it follows that
Pn log
pˆn
≥0
(1 − t)ˆ
pn + tp1
for all 0 ≤ t ≤ 1 and every p1 ∈ P; this holds in particular for p1 = p0 . Note that
equality holds if t = 0. Differentiation of the left side with respect to t at t = 0 yields
Pn
p1
≤1
pˆn
for every p1 ∈ P .
If L : (0, ∞) → R is increasing and t → L(1/t) is convex, then Jensen’s inequality yields
1
pˆn
p1
Pn L
≥L
.
≥ L(1) = Pn L
p1
Pn (p1 /ˆ
pn )
p1
Choosing L = ϕα and p1 = p0 in this last inequality and noting that L(1) = 0, it follows
that
(a)
0 ≤ Pn ϕα (ˆ
pn /p0 ) = (Pn − P0 )ϕα (ˆ
pn /p0 ) + P0 ϕα (ˆ
pn /p0 ) ;
see van der Vaart and Wellner (1996), page 330, and Pfanzagl (1988), pages 141-143.
Now we show that
(b)
P0 ϕα (p/p0 ) =
pα − pα0
β 1−β
dP0 ≤ − 1 − p0 p
dµ
pα + pα0
for β = 1 − α/2. Note that this holds if and only if
pα
−1 + 2
p0 dµ ≤ −1 + pβ0 p1−β dµ,
α
α
p0 + p
or
pβ0 p1−β dµ ≥ 2
But this holds if
pβ0 p1−β ≥ 2
pα0
pα
p0 dµ.
+ pα
pα p0
.
pα0 + pα
With β = 1 − α/2, this becomes
+
1 α
α/2
(po + pα ) ≥ p0 pα/2 = pα0 pα ,
2
and this holds by the arithmetic mean-geometric mean inequality. Thus (b) holds. Combining (b) with (a) yields the claim of the proposition. The corollary follows by noting
that ϕ(t) = (t − 1)/(t + 1) = 2t/(t + 1) − 1.
2
The bound given in Proposition 10.1 is one of a family of results of this type. Here
is another one which does not require that the family P be convex.
137
Proposition 10.2 (Van de Geer) Suppose that pˆn maximizes Pn log p over P. Then
E
p
ˆ
n
h2 (ˆ
pn , p0 ) ≤ (Pn − P0 )
− 1 1{p0 > 0} .
p0
Since pˆn maximizes Pn log p,
1
pˆn
0 ≤
dPn
log
2 [p0 >0]
p0
E
pˆn
≤
− 1 dPn
since log(1 + x) ≤ x
p0
[p0 >0]
E
E
pˆn
pˆn
=
− 1 d(Pn − P0 ) + P0
− 1 1{p0 >0}
p0
p0
[p0 >0]
E
pˆn
=
− 1 d(Pn − P0 ) − h2 (ˆ
pn , p0 ),
p0
[p0 >0]
Proof.
where the last equality follows by direct calculation and the definition of the Hellinger
2
metric h.
Proposition 10.3 (Birg´
e and Massart) If pˆn maximizes Pn log p over P, then
1
pˆn + p0
pn + p0 )/2, p0 ) ≤ (Pn − P0 )
log
h2 ((ˆ
1[p0 >0] ,
2
2p0
and
2
2
h (ˆ
pn , p0 ) ≤ 24h
Proof.
pˆn + p0
, p0
2
.
By concavity of log,
pˆn + p0
1
pˆn
log
1[p0 >0] ≥ log
1[p0 >0] .
2p0
2
p0
Thus
pˆn
1
Pn
log
1[p0 >0]
4
p0
1
pˆn + p0
1[p0 >0]
Pn
log
2
2p0
1
1
pˆn + p0
pˆn + p0
(Pn − P0 )
1[p0 >0] + P0
1[p0 >0]
log
log
2
2p0
2
2p0
1
1
pˆn + p0
(Pn − P0 )
1[p0 >0] − K(P0 , (Pˆn + P0 )/2)
log
2
2p0
2
1
pˆn + p0
(Pn − P0 )
1[p0 >0] − h2 (P0 , (Pˆn + P0 )/2),
log
2
2p0
0 ≤
≤
=
=
≤
where we used Exercise 10.2 at the last step. The second claim follows from Exercise
10.3.
2
138 CHAPTER 10. CONSISTENCY OF MAXIMUM LIKELIHOOD ESTIMATORS
+
p/p0 −
Corollary 10.2 (Hellinger consistency
of
MLE)
Suppose
that
either
{(
p+p0
1
1)1{p0 > 0} : p ∈ P} or { 2 log 2p0 1{p0 > 0} : p ∈ P} is a P0 –Glivenko-Cantelli
class. Then h(ˆ
pn , p0 ) →a.s. 0.
The following examples show how the Glivenko-Cantelli preservation theorems of
Chapter 5 can be used to verify the hypotheses of Corollary 10.1 and Corollary 10.2.
Example 10.1 (Interval censoring, case I) Suppose that Y ∼ F on R+ and T ∼ G.
Here Y is the time of some event of interest, and T is an “observation time”. Unfortunately, we do not observe (Y, T ); instead what is observed is X = (1{Y ≤ T }, T ) ≡
(∆, T ). Our goal is to estimate F , the distribution of Y . Let P0 be the distribution
corresponding to F0 , and suppose that (∆1 , T1 ), . . . , (∆n , Tn ) be i.i.d. as (∆, T ). Note
that the conditional distribution of ∆ given T is simply Bernoulli(F (T )), and hence the
density of (∆, T ) with respect to the dominating measure #×G (here # denotes counting
measure on {0, 1}) is given by
pF (δ, t) = F (t)δ (1 − F (t))1−δ .
Note that the sample space in this case is
X = {(δ, t) : δ ∈ {0, 1}, t ∈ R+ } = {(1, t) : t ∈ R+ }
{(0, t) : t ∈ R+ } := X1
X2 .
Now, the class of functions {pF : F a d.f. on R+ } is a universal Glivenko-Cantelli class
by an application of Theorem 5.7, since on X1 , pF (1, t) = F (t), while on X2 , pF (0, t) =
1 − F (t) where F is a distribution (and hence bounded and monotone nondecreasing).
Furthermore the class of functions {pF /pF0 : F a d.f. on R+ } is P0 –Glivenko by an
application of Theorem 5.6: take F1 = {pF : F a d.f. on R+ } and F2 = {1/pF0 }, and
ϕ(u, v) = uv. Then both F1 and F2 are P0 –Glivenko-Cantelli classes, ϕ is continuous,
and H = ϕ(F1 , F2 ) has P0 –integrable envelope 1/pF0 . Finally, by a further application
of Theorem 5.6 with ϕ(u) = (t − 1)/(t + 1) shows that the hypothesis of Corollary 10.1
holds: {ϕ(pF /pF0 ) : F a d.f. on R+ } is P0 –Glivenko-Cantelli. Hence the conclusion of
the corollary holds and we conclude that
h2 (pFˆn , pF0 ) →a.s. 0
as
n → ∞.
Now note that h2 (p, p0 ) ≥ d2T V (p, p0 )/2 and we compute
dT V (pFˆn , pF0 ) =
|Fˆn (t) − F0 (t)| dG(t) + |1 − Fˆn (t) − (1 − F0 (t))| dG(t)
= 2 |Fˆn (t) − F0 (t)| dG(t) ,
so we conclude that
|Fˆn (t) − F0 (t)| dG(t) →a.s. 0
139
as n → ∞. Since Fˆn and F0 are bounded (by one), we can also conclude that
|Fˆn (t) − F0 (t)|r dG(t) →a.s. 0
for each r ≥ 1, in particular for r = 2.
Example 10.2 (Mixed case interval censoring) Our goal in this example is to use
the theory developed so far to give a proof of the consistency result of Schick and Yu
(2000) for the Maximum Likelihood Estimator (MLE) Fˆn for “mixed case” interval censored data. Our proof is based on Proposition 10.1 and Corollary 10.1.
Suppose that Y is a random variable taking values in R+ = [0, ∞) with distribution
function F ∈ F = {all df’s F on R+ }. Unfortunately we are not able to observe Y itself.
What we do observe is a vector of times TK = (TK,1 , . . . , TK,K ) where K, the number
of times, is itself random, and the interval (TK,j−1 , TK,j ] into which Y falls (with TK,0 ≡
0, TK,K+1 ≡ ∞). More formally, we assume that K is an integer-valued random variable,
and T = {Tk,j , j = 1, . . . , k, k = 1, 2, . . .}, is a triangular array of “potential observation
times”, and that Y and (K, T ) are independent. Let X = (∆K , TK , K), with a possible
value x = (δk , tk , k), where ∆k = (∆k,1 , . . . , ∆k,k ) with ∆k,j = 1(Tk,j−1 ,Tk,j ] (Y ), j =
1, 2, . . . , k + 1, and Tk is the k-th row of the triangular array T . Suppose we observe n
(i)
(i)
i.i.d. copies of X; X1 , X2 , . . . , Xn , where Xi = (∆K (i) , TK (i) , K (i) ), i = 1, 2, . . . , n. Here
(Y (i) , T (i) , K (i) ), i=1,2,. . . are the underlying i.i.d. copies of (Y, T , K).
We first note that conditionally on K and TK , the vector ∆K has a multinomial
distribution:
(∆K | K, TK ) ∼ MultinomialK+1 (1, ∆FK ),
where
∆FK ≡ (F (TK,1 ), F (TK,2 ) − F (TK,1 ), . . . , 1 − F (TK,K )).
Suppose for the moment that the distribution Gk of (TK |K = k) has density gk and
pk ≡ P (K = k). Then a density of X is given by
(1)
pF (x) ≡ pF (δ, tk , k) =
k+1
/
(F (tk,j ) − F (tk,j−1 ))δk,j gk (t)pk ,
j=1
where tk,0 ≡ 0, tk,k+1 ≡ ∞. In general,
pF (x) ≡ pF (δ, tk , k) =
k+1
/
(F (tk,j ) − F (tk,j−1 ))δk,j
j=1
(2)
=
k+1
j=1
δk,j (F (tk,j ) − F (tk,j−1 ))
140 CHAPTER 10. CONSISTENCY OF MAXIMUM LIKELIHOOD ESTIMATORS
is a density of X with respect to the dominating measure ν where ν is determined by
the joint distribution of (K, T ), and it is this version of the density of X with which we
will work throughout the rest of the paper. Thus the log-likelihood function for F of
X1 , . . . , Xn is given by
n K (i) +1
1
1 (i)
(i)
(i)
∆K,j log F (TK (i) ,j ) − F (TK (i) ,j−1 ) = Pn mF ,
ln (F |X) =
n
n
i=1
j=1
where
mF (X) =
K+1
∆K,j log(F (TK,j ) − F (TK,j−1 )) ≡
j=1
K+1
∆K,j log(∆FK,j )
j=1
and where we have ignored the terms not involving F . We also note that
⎛
⎞
K+1
PmF (X) = P ⎝
∆F0,K,j log(∆FK,j )⎠ .
j=1
The (Nonparametric) Maximum Likelihood Estimator (MLE) Fˆn is the distribution function Fˆn (t) which puts all its mass at the observed time points and maximizes the loglikelihood ln (F |X). It can be calculated via the iterative convex minorant algorithm
proposed in Groeneboom and Wellner (1992) for case 2 interval censored data.
By Proposition 10.1 with α = 1 and ϕ ≡ ϕ1 as before, it follows that
h2 (pFˆn , pF0 ) ≤ (Pn − P0 )(ϕ(pFˆn /pF0 )),
where ϕ is bounded and continuous from R to R. Now the collection of functions
G ≡ {pF : F ∈ F}
is easily seen to be a Glivenko-Cantelli class of functions. This can be seen by first
applying Theorem 5.7 to the collections Gk , k = 1, 2, . . . obtained from G by restricting
to the sets K = k. Then for fixed k, the collections Gk = {pF (δ, tk , k) : F ∈ F} are
P0 –Glivenko-Cantelli classes since F is a uniform Glivenko-Cantelli class, and since the
functions pF are continuous trasformations of the classes of functions x → δk,j and x →
F (tk,j ) for j = 1, . . . , k + 1, and hence G is P –Glivenko-Cantelli by Theorem 5.6. Note
that single function pF0 is trivially P0 –Glivenko-Cantelli since it is uniformly bounded,
and the single function (1/pF0 ) is also P0 –Glivenko-Cantelli since P0 (1/pF0 ) < ∞. Thus
by Proposition 5.2 with g = (1/pF0 ) and F = G = {pF : F ∈ F}, it follows that
G ≡ {pF /pF0 : F ∈ F} is P0 –Glivenko-Cantelli. Finally another application of Theorem
5.6 shows that the collection
H ≡ {ϕ(pF /pF0 ) : F ∈ F}
141
is also P0 –Glivenko-Cantelli. When combined with Corollary 10.1, this yields the following theorem.
Theorem 10.1 The NPMLE Fˆn satisfies
h(pFˆn , pF0 ) →a.s. 0.
To relate this result to a recent theorem of Schick and Yu (2000), it remains only to
understand the relationship between their L1 (µ) and the Hellinger metric h between pF
and pF0 . Let B denote the collection of Borel sets in R. On B we define measures µ and
µ
˜, as follows: for B ∈ B,
(3)
µ(B) =
∞
P (K = k)
(4)
µ
˜(B) =
∞
k=1
P (Tk,j ∈ B|K = k),
j=1
k=1
and
k
P (K = k)
k
1
P (Tk,j ∈ B|K = k).
k
j=1
Let d be the L1 (µ) metric on the class F; thus for F1 , F2 ∈ F,
d(F1 , F2 ) = |F1 (t) − F2 (t)|dµ(t) .
The measure µ was introduced by Schick and Yu (2000); note that µ is a finite measure
if E(K) < ∞. Note that d(F1 , F2 ) can also be written in terms of an expectation as
⎡
⎤
K+1
(5)
d(F1 , F2 ) = E(K,T ) ⎣
|F1 (TK,j ) − F2 (TK,j )|⎦ .
j=1
As Schick and Yu (2000) observed, consistency of the NPMLE Fˆn in L1 (µ) holds under
virtually no further hypotheses.
Theorem 10.2 (Schick and Yu). Suppose that E(K) < ∞. Then d(Fˆn , F0 ) →a.s. 0.
Proof.
We will show that Theorem 10.2 follows from Theorem 10.1 and the following
2
Lemma.
Lemma 10.1
Proof.
1
2{
|Fˆn − F0 |d˜
µ}2 ≤ h2 (pFˆn , pF0 ).
We know that
h2 (pFˆn , pF0 ) ≤ dT V (pFˆn , pF0 ) ≤
√
2h(pFˆn , pF0 )
142 CHAPTER 10. CONSISTENCY OF MAXIMUM LIKELIHOOD ESTIMATORS
where, with yk,0 = −∞, yk,k+1 = ∞,
2
h (pFˆn , pF0 ) =
∞
P (K = k)
k+1 {[Fˆn (yk,j ) − Fˆn (yk,j−1 )]1/2
j=1
k=1
−[F0 (yk,j ) − F0 (yk,j−1 )]1/2 }2 dGk (y)
while
dT V (pFˆn , pF0 ) =
∞
P (K = k)
k+1 |[Fˆn (yk,j ) − Fˆn (yk,j−1 )]
j=1
k=1
−[F0 (yk,j ) − F0 (yk,j−1)]|dGk (y).
Note that
k+1
|[Fˆn (yk,j ) − Fˆn (yk,j−1 )] − [F0 (yk,j ) − F0 (yk,j−1)]|
j=1
=
k+1
|(Fˆn − F0 )(yk,j−1 , yk,j )| ≥
j=1
max |Fˆn (yk,j ) − F0 (yk,j )| ,
1≤j≤k+1
so integrating across this inequality with respect to Gk (y) yields
k+1 |[Fˆ n (yk,j ) − Fˆn (yk,j−1 )] − [F0 (yk,j ) − F0 (yk,j−1)]| dGk (y)
j=1
≥ max
1≤j≤k
|Fˆn (yk,j ) − F0 (yk,j )| dGk,j (yk,j )
k 1
≥
|Fˆn (yk,j ) − F0 (yk,j )| dGk,j (yk,j ).
k
j=1
By multiplying across by P (K = k) and summing over k, this yields
dT V (pFˆn , pF0 ) ≥ |Fˆn − F0 |d˜
µ,
and hence
(a)
1
h (pFˆn , pF0 ) ≥
2
2
2
µ
|Fˆn − F0 |d˜
.
2
The measure µ
˜ figuring in Lemma 10.1 is not the same as the measure µ of Schick
and Yu (2000) because of the factor 1/k. Note that this factor means that the measure
µ
˜ is always a finite measure, even if E(K) = ∞. It is clear that
µ
˜(B) ≤ µ(B)
for every Borel set B, and that µ ) µ
˜. The following lemma (Lemma 2.2 of Schick and
Yu (2000)) together with Lemma 10.1 shows that Theorem 10.1 implies the result of
Schick and Yu once again.
143
Lemma 10.2 Suppose that µ and µ
˜ are two finite measures, and that g, g1 , g2 , . . . are
measurable functions with range in [0,1]. Suppose that µ is absolutely continuous with
respect to µ
˜. Then |gn − g|d˜
µ → 0 implies that |gn − g|dµ → 0.
Proof.
Write
|gn − g|dµ =
|gn − g|
dµ
d˜
µ
d˜
µ
and use the dominated convergence theorem applied to a.e. convergent subsequences. 2
Example 10.3 (Exponential scale mixtures) Suppose that P = {PG : G a d.f. on R}
where the measures PG are scale mixtures of exponential distributions with mixing dis
tribution G:
pG (x) =
∞
ye−yx dG(y).
0
We first show that the map G → pG (x) is continuous with respect to the topology of
vague convergence for distributions G. This follows easily since kernels for our mixing
family are bounded, continuous, and satisfy ye−x y → 0 as y → ∞ for every x > 0. Since
vague convergence of distribution functions implies that integrals of bounded continuous functions vanishing at infinity converge, it follows that p(x, G) is continuous with
respect to the vague topology for every x > 0. This implies, moreover, that the family
F = {pG /(pG + p0 ) : G is a d.f. on R} is pointwise, for a.e. x, continuous in G with
respect to the vague topology. Since the family of sub-distribution functions G on R is
compact for (a metric for) the vague topology (see e.g. Bauer (1972), page 241), and
the family of functions F is uniformly bounded by 1, we conclude from Lemma 5.1 that
N[ ] (ε, F, L1 (P )) < ∞ for every ε > 0. Thus it follows from Corollary 10.1 that the MLE
ˆ n of G0 satisfies
G
h(pGˆ n , pG0 ) →a.s. 0.
ˆ n converges weakly to G0 with
By uniqueness of Laplace transforms, this implies that G
probability 1. This method of proof is due to Pfanzagl (1988); in this case we recover a
result of Jewell (1982). (See also Van de Geer (2000), Example 4.2.4, page 54).
Example 10.4 (k-monotone densities) Suppose that Pk = {PG : G a d.f. on R}
where the measures PG are scale mixtures of Beta(1, k) distributions with mixing distribution G:
pG (x) =
∞
0
k/x y x k−1
y x k−1
y 1−
dG(y) =
y 1−
dG(y) ,
k +
k
0
x > 0.
With k=1,the class P1 coincides with the class of monotone decreasing functions on R
studied by Prakasa Rao (1969); the class P2 corresponds to the class of convex decreasing
densities studied by Groeneboom, Jongbloed, and Wellner (2001). Of course the case
144 CHAPTER 10. CONSISTENCY OF MAXIMUM LIKELIHOOD ESTIMATORS
k = ∞ is just Example 10.3. To prove consistency of the MLE, we again show that
the map G → pG (x) is continuous with respect to the topology of vague convergence
for distributions G. This follows easily since kernels for this mixing family are bounded,
continuous, and satisfy y(1 − yx/k)k−1
→ 0 as y → 0 or ∞ for every x > 0. Since
+
vague convergence of distribution functions implies that integrals of bounded continuous
functions vanishing at infinity convergence, it follows that p (x, G) is continuous with
respect to the vague topology for every x > 0. By the same argument as in Example
10.3 it follows that N[ ] (, F, L1 (P )) < ∞ for every > 0, and hence from Corollary 10.1
ˆ n of G0 satisfies
that the MLE G
h(pGˆn , pG0 ) →a.s. 0.
ˆ n , G0 ) →a.s. 0 for any metric τ for the vague topology (see Exercise
This implies that τ (G
ˆ n , G0 ) →a.s. 0 (since G0 is a proper distribution function).
10.4), and hence that dBL (G
This gives another proof of a result of Balabdaoui (2003).
Example 10.5 (Current status competing risks data) Suppose that (X1 , X2 , . . . ,
Xj , T ) is a J + 1-vector of non-negative, real valued random variables. We assume
that T is independent of (X1 , . . . , Xj ), and that T ∼ G. Let X(1) be the minimum of
X1 , X2 , . . . , Xj , let Fj be the cumulative incidence function for Xj ,
Fj (t) = P (Xj ≤ t, Xj = X(1) ),
and define
S(t) = 1 −
J
Fj (t) ≡ 1 − F (t) .
j=1
Let ∆∗j = 1{Xj = X(1) } and ∆j = 1{X(1) ≤ T }∆∗j for j = 1, . . . , J. Suppose we observe
(∆1 , . . . , ∆j , T ) .
Finally, set ∆. =
J
j=1 ∆j
= 1{X(1) ≤ T }. Then, conditionally on T = t the distribution
of (∆1 , . . . , ∆J , 1 − ∆.) is multinomial:
(∆1 , . . . , ∆J , 1 − ∆.) ∼ MultinomialJ+1 (1, (F1 (t), . . . , Fj (t), S(t))).
Note that the Fj
s are monotone nondecreasing, while S is monotone nonincreasing. Thus
the joint density pF for one observation is given by
pF (δ1 , . . . , δJ , δJ+1 , t) =
J=1
/
Fj (t)δj
j=1
with respect to # × G where # denotes counting measure on {0, 1}J+1 , δJ+1 = 1− δJ and
FJ+1 = S, and F = (F1 , . . . , FJ ) ∈ FJ , the class of J-tuples of nondecreasing functions
summing pointwise to no more than 1.
145
Suppose we observe
(∆1i , . . . , ∆Ji , Ti ) ,
i = 1, . . . , n
i.i.d. as (∆1 , . . . , ∆J , T ). Our goal is to estimate F1 , . . . , FJ . These models are of current
interest in the biostatistics literature; see e.g. Jewell and Kalbfleish (2001) or Jewell, van
der Laan, and Henneman (2001).
This is a convex model, so Proposition 10.1 and Corollary 10.1 apply. To show that
the class of functions {φ(pF /pF0 ) : F = (F1 , . . . , FJ ) ∈ FJ } is P0 –Glivenko-Cantelli,
we first use Theorem 5.7 applied to {pF : F ∈ FJ } and the partition {X }J+1
j=1 where
Xj = {(0, . . . , 1, 0, . . . , 0, t) : t ∈ R} where the 1 is in the j-th position for j = 1, . . . , J
and XJ+1 = {(0, . . . , 0, t) : t ∈ R}. Then the functions pF |Xj are bounded and monotone
nondecreasing for j = 1, . . . , J, and bounded and monotone nonincreasing for j = J +1,
and hence are (universal) Glivenko-Cantelli. The conclusion from Theorem 5.7 is that
P = {pF : F ∈ F} is Glivenko-Cantelli. The next step is just as in both Examples 10.1
and 10.2: since 1/pF0 is P0 = PF0 integrable, the collection P is uniformly bounded, and
ϕ(u, v) = uv is continuous, it follows from Proposition 5.2 that P/pF0 = {pF /pF0 : F ∈
Fj } is P0 –Glivenko-Cantelli. Finally, it follows from Theorem 5.6 that {ϕ(pF /pF0 ) : F ∈
Fj } with ϕ(t) = (t − 1)/(t + 1) is P0 –Glivenko-Cantelli. We conclude that
h(pFˆn , pF0 ) →a.s. 0
as n → ∞ .
By the familiar inequality relating Hellinger and total variation distance, we conclude
that
dT V (pFˆn , pF0 ) =
J+1
|Fˆnj (t) − Foj (t)| dG(t) →a.s. 0.
j=1
Example 10.6 (Cox model with interval censored data) Suppose that conditional
on a covariate vector Z, Y has conditional survival function
1 − F (y|Z) = (1 − F (y))exp(β
T Z)
where β ∈ Rd , Z ∈ Rd , and F is a distribution function on R+ . For simplicity of notation
we will write this in terms of survival functions as S(y|z) = S(y)exp(β
T z)
. Suppose that
conditionally on Z the pair of random variables (U ,V ) has conditional distribution G(.|Z)
with P (U < V |Z) = 1, and that conditionally on Z the pair (U ,V ) is independent of Y .
Finally, suppose that Z has distribution H on Rd . Suppose that we observe only i.i.d.
copies of X = (∆1 , ∆2 , ∆3 , U, V, Z) where
∆ = (∆1 , ∆2 , ∆3 ) = (1[0,U ] (Y ), 1(U,V ] (Y ), 1(V,∞) (Y )).
146 CHAPTER 10. CONSISTENCY OF MAXIMUM LIKELIHOOD ESTIMATORS
Based on X1 , . . . , Xn i.i.d. as X our goal is to estimate β and F .
The parameter space is Θ = Rd × {all d.f.’s on R+ }. The conditional distribution of
∆ given U ,V ,Z is just multinomial with one trial, three cells, and cell-probabilities
(1 − S(U |Z), S(U |Z) − S(V |Z), S(V |Z)).
Thus
pβ,F (δ, u, v, z) = (1 − S(u|z))δ1 (S(u|z) − S(v|z))δ2 S(v|z)δ3
with respect to the dominating measure given by the product of counting measure on
{0, 1}3 × G × H.
As in the previous examples, we first use Theorem 5.7 applied to {pβ,F : F a d.f. on
R+ , β
∈ Rd }, and the partition {Xj }3j=1 where Xj corresponds to δj = 1 for j = 1, 2, 3.
On X1 the class of functions we need to consider is {1−S(t)exp(β
T z)
: F a d.f. on R+ , β ∈
Rd }. Up to the leading constant 1, this is of the form φ(G1 , G2 ) where G1 = {S = 1 − F :
F a d.f. on R+ }, G2 = {exp (β T z) : β ∈ Rd }, and φ(r, s) = r s . Now G1 is a universal
Glivenko-Cantelli class (since it is a class of uniformly bounded decreasing functions),
and G2 is a Glivenko-Cantelli class if we assume that β ∈ K ⊂ Rd for some compact set
K. Then |β T Z| ≤ M |Z| for M = supβ∈K |β| is an envelope for β T Z and, hence G2 (x) =
exp(M |z|) is an integrable envelope for exp(β T z, β ∈ K) if E exp(M |Z|) < ∞. Thus
G2 is P –Glivenko-Cantelli under these two assumptions. Furthermore, all the functions
φ(g1 , g2 ) = g1g2 with gi ∈ Gi for i = 1, 2 are uniformly bounded by 1. We conclude from
Theorem 5.6 that the class {pβ,F (1, 0, 0, u, v, z) : F a d.f. on R+ , β ∈ K} is a P –GlivenkoCantelli class of functions under these same two assumptions. Similarly, under these same
assumptions the class {pβ,F (0, 0, 1, u, v, z) : F a d.f. on R+ , β ∈ K} is a P –GlivenkoCantelli class of functions, and so is {pβ,F (0, 1, 0, u, v, z) : F a d.f. on R+ , β ∈ K} since
it is the difference of two P –Glivenko-Cantelli classes. Much as in Examples 10.1 and
10.2 it follows that {ϕ(pβ,F /pβ0 ,F0 ) : F a d.f. on R+ , β ∈ K} is P –Glivenko-Cantelli
√
where ϕ(t) = t.
Thus it follows from Proposition 10.2 that the MLE θˆn = (βˆn , Fˆn ) satisfies
h(pβˆn ,Fˆn , pβ0 ,F0 ) →a.s. 0 .
Since convergence in the Hellinger metric implies convergence in the total variation metric, the convergence in the last display implies that the total variation distance also
147
converges to zero where
dT V (pβˆn ,Fˆn , pβ0 ,F0 )
=
+
+
≥
(1)
ˆ
|Sˆn (u)exp(βn z) − S0 (u)exp(β0 z) |dG(u, v|z)dH(z)
(
ˆ
ˆ
|Sˆn (u)exp(βn z) − Sˆn (v)exp(βn z) − S0 (u)exp(β0 z) +
)
− S0 (v)exp(β0 z) |dG(u, v|z)dH(z)
ˆ
|Sˆn (v)exp(βn z) − S0 (v)exp(β0 z) |dG(u, v|z)dH(z)
ˆ
|Sˆn (t)exp(βn z) − S0 (t)exp(β0 z) |dµ(t, z).
In this last inequality of the last display we have dropped the middle term and combined
the two end terms by defining the measure µ on R × Rd by
µ(A × C) =
G(A × ν|z)dH(z) +
G(U × A|z)dH(z)
C
C
= P (U ∈ A, Z ∈ C) + P (V ∈ A, Z ∈ C)
f or
A ∈ B, C ∈ Bd .
We will examine the special case in which d = 1 and Z takes on the two values 0 and
1 with probabilities 1 − p and p respectively with p ∈ (0, 1). We will assume, moreover,
that F is continuous. In this special case the right side of (1) can be rewritten as
ˆ
|Sˆn (t)exp(βn z) − S0 (t)exp(β0 z) | dµ(t, z)
(2)
|Sˆn (t) − S0 (t)|dµ(t, 0) +
=
ˆ
|Sˆn (t)exp(βn ) − S0 (t)exp(β0 ) | dµ(t, 1).
Since the left side of (1) converges to zero almost surely, we conclude that Sˆn (t) →a.s.
S0 (t) for µ(·, 0) a.e. t. If µ(·, 1) µ(·, 0), then it follows immediately by dominated
convergence that
ˆ
|Sˆn (t)exp(βn ) − S0 (t)exp(β0 ) | dµ(t, 1) →a.s. 0,
and hence also, from (2), that
ˆ
|Sˆ0 (t)exp(βn ) − S0 (t)exp(β0 ) | dµ(t, 1) →a.s. 0 .
If µ((supp(S0 ))◦ , 1) > 0 (where (supp(S0 ))◦ denotes the interior of the support of the
measure corresponding to S0 ), this implies that βˆn →a.s. β0 .
148 CHAPTER 10. CONSISTENCY OF MAXIMUM LIKELIHOOD ESTIMATORS
10.1
Exercises
Exercise 10.1 Show that for any two probability measures
h2 (P, Q) ≤ dT V (P, Q) ≤
where dT V (P, Q) = (1/2)
√
2h(P, Q)(1 − (1/2)h2 (P, Q))1/2 ≤
√
2h(P, Q)
|p−q|dµ = supA |P (A)−Q(A)| for any measure µ dominating
both P and Q.
Solution.
Note that
dT V (P, Q) =
=
≤
=
=
=
≤
1
|p − q| dµ
2
1
√ √
√
√
|( p − q)( p + q) dµ
2
1/2 1/2
1
√
1
√
√
√
( p + q)2 dµ
2
( p − q)2 dµ
2
2
1/2
1√
√
2h(P, Q) 2 + 2
pq dµ
2
√
1
2h(P, Q) {2 + 2(1 − h2 (P, Q))}1/2
2
1/2
√
1 2
2h(P, Q) 1 − h (P, Q)
2
√
2h(P, Q) .
This finishes the proof. Now we want to show that
1
|p − q| dµ = sup |P (A) − Q(A)| .
2
A
It results
P (A) − Q(A) =
(p − q) dµ ≤
A
A
{p>q}
(p − q) dµ ≤
{p>q}
(p − q) dµ
|P (A) − Q(A)| ≤
A
=
A
|p − q| dµ
≤
(p − q) dµ + (q − p) dµ
A {q>p}
(p − q) dµ +
(q − p) dµ
{p>q}
p>q
q>p
=
But
|p − q| dµ .
(p − q) dµ = 0
10.1. EXERCISES
149
and this implies
(p − q) dµ +
p>q
i.e.
(p − q) dµ = 0
p<q
1
(p − q) dµ =
(q − p) dµ =
2
p>q
q>p
|p − q| dµ .
Therefore
sup |P (A) − Q(A)| ≤
|p − q| dµ .
A
But
1
P (A) − Q(A) ≤
(p − q) dµ =
2
p>q
and similarly
Q(A) − P (A) ≤
(q − p) dµ =
q>p
This implies
1
|P (A) − Q(A)| ≤
2
1
2
|p − q| dµ
|p − q| dµ .
|p − q| dµ = dT V (p, q)
∀A
and
sup |P (A) − Q(A)| ≤ dT V (P, Q)
A
with equality when A = {p > q} or {q > p}.
2
Exercise 10.2 Show that for any two probability measures P and Q, the KullbackLeibler “distance” K(P, Q) = P (log(p/q)) satisfies
K(P, Q) ≥ 2h2 (P, Q) ≥ 0 .
Solution.
i.e.
For any two probability measures P and Q
p
K(P, Q) = P log
≥ 2h2 (P, Q) ≥ 0
q
log
p(x)
q(x)
p(x) dµ(x) ≥
(p1/2 (x) − q 1/2 (x))2 dµ(x)
+
(p(x) + q(x) − 2 p(x)q(x)) dµ(x)
= 2 − 2 (p1/2 (x)q 1/2 (x)) dµ(x)
=
⇔
1
log
2
p(x)
q(x)
p(x) dµ(x) ≥ 1 −
= 1−
p1/2 (x)q 1/2 (x) dµ(x)
q 1/2 (x)
p(x) dµ(x)
p1/2 (x)
150 CHAPTER 10. CONSISTENCY OF MAXIMUM LIKELIHOOD ESTIMATORS
⇔
−
⇔
1
log
2
1
log
2
q(x)
p(x)
q(x)
p(x)
p(x) dµ(x) ≥ 1 −
p(x) dµ(x) ≤
q 1/2 (x)
p(x) dµ(x)
p1/2 (x)
q 1/2 (x)
− 1 p(x) dµ(x) .
p1/2 (x)
But this holds since
1
log v ≤ v 1/2 − 1
2
(where we choose v = q(x)/p(x)). Moreover
∀v > 0
1
log v ≤ v 1/2 − 1
2
1
⇔
v 1/2 − 1 − log v ≥ 0
2
1
1/2 log v
⇔e
− 1 − log v ≥ 0
2
∀v > 0
∀v > 0
∀ v > 0.
2
Exercise 10.3 Show that for any nonnegative numbers p and q,
|(2p)1/2 − (p + q)1/2 | ≤ |p1/2 − q 1/2 | ≤ (1 +
√
2)|(2p)1/2 − (p + q)1/2 | .
This implies that for measures P and Q the Hellinger distances h(P, Q) and h(P, (P +
Q)/2)) satisfy
2h2 (P, (P + Q)/2) ≤ h2 (P, Q) ≤ 2(1 +
√ 2 2
2) h (P, (P + Q)/2) ≤ 12h2 (P, (P + Q)/2).
(Hint: To prove the first inequalities, prove them first for p = 0. In the second case of
p = 0, divide through by p and rewrite the inequalities in terms of r = q/p, then (for the
inequality on the right) consider the cases r ≥ 1 and 0 < r ≤ 1).
Solution.
For any non-negative numbers p and q
|(2p)1/2 − (p + q)1/2 | ≤ |p1/2 − q 1/2 |
√
≤ (1 + 2)|(2p)1/2 − (p + q)1/2 |
This implies that for measures P and Q
P +Q
2
2h P,
≤ h2 (P, Q)
2
√
P +Q
≤ 2(1 + 2)2 h2 P,
2
P +Q
≤ 12h2 P,
2
10.1. EXERCISES
2
2h
151
P +Q
P,
2
=
p
1/2
(x) −
p+q
(x)
2
1/2 2
dµ(x)
1
((2p(x))1/2 − (p(x) + q(x))1/2 )2 dµ(x)
2
1
≤
(p1/2 (x) − q 1/2 (x))2 dµ(x)
2
= h2 (P, Q)
=
Similarly, the other inequalities involving Hellinger distances can be established.
Now we want to show the pointwise inequalities:
|p1/2 − q 1/2 | ≤ (1 +
√
2)|(2p)1/2 − (p + q)1/2 | .
We consider the case p ≥ q. We have to show that:
p1/2 − q 1/2 ≤ (1 +
√
2)((2p)1/2 − (p + q)1/2 )
i.e.
p
1/2
2
1/2 3
√
q
q 1/2
1/2
1/2
1−
≤ ( 2 + 1)p
2 − 1+
p
p
It is sufficient to show that
1/2
1/2 √
q
q
1−
≤ ( 2 + 1) 21/2 − 1 +
p
p
i.e.
1/2
1/2 − 1 + q
2
p
1
√
≤
1/2
2+1
1 − pq
We wonder if
1
√
≤
2+1
that is
i.e.
and simplifying
Let
√
√
2−
√
2 − (1 + x)1/2
1 − x1/2
0≤
q
≤ 1.
p
∀0 ≤ x ≤ 1
√
2 − (1 + x)1/2
≥
2−1
1 − x1/2
√
√ √
√
√
1+x≥ 2− 2 x−1+ x
√ √
√
√
2 x + 1 − ( x + 1 + x) ≥ 0 .
√ √
√
√
2 x + 1 − ( x + 1 + x) = ξ(x)
152 CHAPTER 10. CONSISTENCY OF MAXIMUM LIKELIHOOD ESTIMATORS
then ξ(0) = 0 = ξ(1) and
ξ (x) =
=
=
1 √
1
1
√
2− √ − √
2 x
2 x 2 1+x
1
1
1
√ √ − √ − √
2
x
2 1+x
2 x
!
"
x
1
1 1
1
√ √ − −
x
2 2 2 1+x
Note that ξ (x) is greater than, equal or less than zero according as √12 − 12 is greater
,
x
and this implies that x is less than, equal or greater than
than, equal or less than 12 1+x
,
x
x∗ , because 1+x
is increasing with x. This shows that ξ(x) ≥ 0, ∀ x, with a maximum
at x∗ .
Now we show that
|(2p)1/2 − (p + q)1/2 | ≤ |p1/2 − q 1/2 |.
Once again consider the case p > q. We have to show that
(2p)1/2 − (p + q)1/2 ≤ p1/2 − q 1/2 ,
i.e.
(2p)1/2 + q 1/2 ≤ p1/2 + (p + q)1/2 .
Moreover, we have to show that
2
1/2 1/2 3 √
q
q
p1/2
≤ 1−
p1/2 ,
2− 1+
p
p
that is
i.e.
√
√
q
2− 1+
p
2−1≤
q
1+
p
1/2
1/2
1/2
q
≤1−
,
p
1/2
q
−
p
0≤
q
≤ 1.
p
It is sufficient to show that
√
2 − 1 ≤ (1 + x)1/2 − x1/2 .
Let Ψ(x) = (1 + x)1/2 − x1/2 , then Ψ(0) = 1, Ψ(1) =
√
2 − 1 and
1
1
Ψ (x) = √
− √ ≤0
2
x
2 1+x
therefore
Ψ(1) ≤ Ψ(x)
This completes the proof.
∀ x ∈ [0, 1] .
2
10.1. EXERCISES
153
Exercise 10.4 We will say that θ0 is identifiable for the metric τ on Θ ⊃ Θ if for all
θ ∈ Θ, h(pθ , pθ0 ) = 0 implies that τ (θ, θ0 ) = 0. Prove the following claim: Suppose
that Θ ⊂ Θ where (Θ, τ ) is a compact metric space. Suppose that θ → pθ is µ-almost
everywhere continuous and that θ0 is identifiable for τ . Then h(pθn , pθ0 ) → 0 implies
that τ (θn , θ0 ) → 0. (Hint: See Van de Geer (1993), page 37).
Solution.
Θ is compact for the metric τ . Moreover Θ ⊆ Θ and θ → pθ is µ-almost
everywhere continuous and θ0 is identifiable for τ :
h(pθn , pθ0 ) → 0
⇒
τ (θn , θ0 ) → 0 .
Consider the sequence {θn }. Given any subsequence {θn } of {θn }, we can find a further
subsequence {θn } that converges to some θ (by compactness). Then
p(θn , x) → p(θ, x)
almost surely.
By Scheffe’s theorem,
1
2
|p(θn , x) − p(θ, x)| dµ(x) → 0
i.e.
dT V (p(θn , ·), p(θ, ·)) → 0
This implies that
h2 (p(θn , ·), p(θ, )) → 0 .
But
h2 (p(θn , ·), p(θ0 , ·)) → 0 .
Let sn (·) = p(θn , ·)1/2 , s0 (·) = p(θ0 , ·)1/2 and s(·) = p(θ, ·)1/2 , then
h(pθ
n
h(pθ
, pθ0 ) = ||sn − s0 ||L2 (µ)
n
, pθ ) = ||sn − s||L2 (µ)
(ignoring the factor of 1/2). Now
||s0 − s||L2 (µ) ≤ ||sn − s0 ||L2 (µ) + ||sn − s||L2 (µ)
< ε,
eventually for any pre-fixed ε. Therefore
||s0 − s||L2 (µ) = 0
154 CHAPTER 10. CONSISTENCY OF MAXIMUM LIKELIHOOD ESTIMATORS
i.e.
h(pθ , pθ0 ) = 0 .
This implies that
τ (θ, θ0 ) = 0 .
But then
τ (θn , θ0 ) → 0
This shows that
τ (θn , θ0 ) → 0 ,
completing the proof.
2
Chapter 11
M -Estimators: the Argmax
Continuous Mapping Theorem
We begin this chapter recalling the definition of M -estimators. Suppose we are interested
in a parameter θ attached to the distribution of i.i.d. observations X1 , . . . , Xn . A popular
method for finding an estimator θˆn = θˆn (X1 , . . . , Xn ) is to maximize a certain criterion
function of the type
1
mθ (Xi ).
n
n
θ → Pn mθ =
i=1
An estimator maximizing Pn mθ over Θ is called an M-estimator.
Suppose now that Mn and M are stochastic processes indexed by a metric space H;
typically Mn (h) = Pn mh for a collection of real-valued functions mh defined on the sample
space and M is either a deterministic function (such as M(h) = P (mh )) or a (limiting)
ˆ n and h
ˆ be points of (near) maximum of the
stochastic process. Let the “estimators” h
“criterion functions” Mn (h) and M(h) respectively. We suppose that
ˆ n = arg max Mn (h)
h
h
and
ˆ
h = arg max M(h)
h
are well defined. In the most basic version of this set-up we frequently begin with thinking
of Mn (θ) = Pn log pθ and M(θ) = P0 log pθ for θ ⊂ Θ; i.e. mθ = log pθ .
In this chapter we want to study the conditions under which the convergence in
distribution of the criterion functions would imply the convergence in distribution of
ˆ n , to the point of maximum h
ˆ
their point of maximum, the sequence of M -estimators h
of the limit criterion function.
To this aim we begin presenting the following lemma:
ˆ ∈ H satisfies
Lemma 11.1 Suppose that A, B ⊂ H. Assume that h
ˆ >
M(h)
sup
h∈G,h∈A
/
M(h) =
155
sup M(h)
h∈Gc ∩A
156CHAPTER 11. M-ESTIMATORS: THE ARGMAX CONTINUOUS MAPPING THEOREM
ˆ ∈ G. Suppose that h
ˆ n satisfies
for all open G with h
ˆ n ) ≥ sup Mn (h) − op (1).
Mn (h
h
If Mn ⇒ M in ∞ (A ∪ B), then for every closed set F
ˆ n ∈ F ∩ A) ≤ P (h
ˆ ∈ F ∪ B c ).
lim sup P ∗ (h
n→∞
If Mn ⇒ M in ∞ (H), then we can take A = B = H to conclude that, by the portmanteau
ˆ
ˆ n ⇒ h.
theorem for weak convergence, h
The following theorem follows from Lemma 11.1.
Theorem 11.1 (argmax continuous mapping) Suppose that Mn ⇒ M in ∞ (K)
for every compact K ⊂ H. Suppose that h → M(h) is upper semicontinuous and has
ˆ Suppose, moreover, that Mn (h
ˆ n ) ≥ sup Mn (h) − op (1),
a unique point of maximum h.
h
ˆ
and hn is tight (in H). Then
ˆn ⇒ h
ˆ
h
Proof.
in
H.
(of Lemma 11.1) Suppose that F is closed. By the continuous mapping
theorem it follows that
sup Mn (h) − sup Mn (h) ⇒ sup M(h) − sup M(h).
h∈F ∩A
h∈F ∩A
h∈B
h∈B
Now
ˆ n ∈ F ∩ A} =
{h
ˆ n ∈ F ∩ A} ∩ {Mn F ∩A ≥ Mn B − op (1)} ∪
{h
ˆ n ∈ F ∩ A} ∩ {Mn F ∩A < Mn B − op (1)} ,
∪ {h
where the second event implies
ˆ n ) ≤ Mn F ∩A ≤ Mn B − op (1) ≤ Mn H − op (1)
Mn (h
and hence is empty in view of the hypothesis. Hence
ˆ n ∈ F ∩ A} ⊂ {Mn F ∩A ≥ Mn B − op (1)},
{h
and it follows that
ˆ n ∈ F ∩ A) ≤ lim sup P (Mn F ∩A ≥ Mn B − op (1))
lim sup P (h
n→∞
n→∞
= P (Mn F ∩A ≥ Mn B )
ˆ ∈ F ∪ B c );
≤ P (h
157
to see the last inequality, note that
ˆ ∈ F ∪ B c }c
{h
=
ˆ ∈ F c } ∩ {h
ˆ ∈ B}
{h
ˆ ∈ F c ∩ B} ∩ {MF ∩A < MB } ∪
{h
ˆ ∈ F c ∩ B} ∩ {MF ∩A ≥ MB }
{h
⊂
ˆ > MF ∩A ≥ MB ≥ M(h)}
ˆ
{MF ∩A < MB } ∪ {M(h)
=
{MF ∩A < MB } ∪ ∅.
=
2
Proof.
(of Theorem 11.1) Take A = B = K in Lemma 11.1. Then
ˆ >
M(h)
sup M(h).
h∈Gc ∩K
(If not, then there is a subsequence {hm } ⊂ Gc ∩K which is compact satisfying M(hm ) →
ˆ But we can choose a further subsequence (call it hm again) with hm → h ∈ Gc ∩ K
M(h).
since K is compact, and then
ˆ = lim M(hm ) ≤ M(h)
M(h)
m
by upper semicontinuity of M, and this implies that there is another maximizer. But
this contradicts our uniqueness hypothesis). By Lemma 11.1 with A = B = K,
ˆ n ∈ F ) ≤ lim sup P (h
ˆ n ∈ F ∩ K) + lim sup P (h
ˆ n ∈ K c)
lim sup P (h
n→∞
n→∞
n→∞
ˆ ∈ F ∪ K c ) + lim sup P (h
ˆn ∈ K c)
≤ P (h
n→∞
ˆ ∈ F ) + P (h
ˆ ∈ K c ) + lim sup P (h
ˆ n ∈ K c)
≤ P (h
n→∞
where the second and third terms can be made arbitrarily small by choice of K. Hence,
we conclude that
ˆ n ∈ F ) ≤ P (h
ˆ ∈ F ),
lim sup P (h
n→∞
ˆn ⇒ h
ˆ in H.
and we conclude from the portmanteau theorem that h
2
We will use this theorem in two different ways:
A. First scenario (the results are applied to the original parameter): H = Θ, Mn (θ) =
Pn mθ mθ (x)dPn (x), M(θ) = P0 mθ mθ (x)dP0 (x) deterministic. Here
ˆ n = θˆn and h
ˆ = θ0 and often mθ (x) = log pθ (x) for x ∈ X , θ ∈ Θ.
h
158CHAPTER 11. M-ESTIMATORS: THE ARGMAX CONTINUOUS MAPPING THEOREM
˜ n (h) =
˙ 0 ), M
B. Second scenario (the results are applied to a local parameter): H = Θ(θ
sn (Mn (θ0 + rn−1 h) − Mn (θ0 )) for some sequences rn → ∞ and sn → ∞ (often
˜
ˆ n = rn (θˆn − θ0 ) and h
ˆ = arg max M(h)
˜
is random. In this case h
sn = r 2 ), and M(h)
n
˙ 0 ) we can often take the collection {h : θt − θ0 − th =
is also random. For Θ(θ
o(t), for some {θt } ⊂ Θ}.
By using this theorem in the set-up of our first scenario, where the limit criterion
function in typically nonrandom, the approach turns into a research of results concerning
consistency.
Corollary 11.1 (consistency) Suppose that Mn are stochastic processes indexed by
Θ and suppose that M : Θ → R is deterministic.
A. Suppose that:
(i) Mn − MΘ →p 0.
(ii) There exists θ0 ∈ Θ such that M(θ0 ) > supθ∈G
/ M(θ) for all G open with θ0 ∈
G. Then any sequence θˆn with Mn (θˆn ) ≥ Mn Θ − op (1) satisfies θˆn →p θ0 .
B. Suppose that Mn −MK →p 0 for all K ⊂ Θ compact, and that the map θ → M(θ)
is upper semi-continuous with a unique maximum at θ0 . Suppose that {θˆn } is tight.
Then θˆn →p θ0 .
Proof.
This follows immediately from Theorem 11.1.
2
Suppose that an estimator θˆn maximizes the criterion function θ → Mn (θ). In obtaining a limit distribution of a sequence of M -estimators, this theorem is usually not
applied with the original criterion functions θ → Mn (θ), but to a rescaled and “localized”
criterion function of the form
˜ n (h) = sn
M
h
Mn θ 0 +
− Mn (θ0 )
rn
where θ0 is the “true” value of θ, rn → ∞ is the “rate of convergence” of the estimator
and sn → ∞. If this new sequence of processes converges weakly, then Theorem 11.1
ˆ n = rn (θˆn − θ0 ). Thus we will typically proceed in steps
will yield a limit theorem for h
in studying the limiting distribution of M -estimators θˆn of Euclidean parameters.
Step 1: Prove that θˆn is consistent: θˆn →p θ0 ;
Step 2: Establish a rate of convergence rn of the sequence θˆn , or equivalently,
ˆ n = rn (θˆn − θ0 ) is tight;
show that the sequence of “local estimators” h
159
Step 3: Show that an appropriate localized criterion function Mn (h) as in (1)
converges in distribution (i.e. weakly) to a limit process M in ∞ ({h : h ≤
K}) for every K. If the limit process M has sample functions which are upperˆ then the final conclusion is that the
semicontinuous with a unique maximum h,
ˆ
sequence rn (θˆn − θ0 ) ⇒ h.
Example 11.1 (Parametric maximum likelihood) Suppose that we observe X1 , . . . , Xn
from a density pθ where θ ∈ Θ ⊂ Rd . Then the maximum likelihood estimator θˆn (assuming that it exists and is unique) satisfies Mn (θˆn ) = supθ∈Θ Mn (θ) where Mn (θ) =
n−1 ni=1 log pθ (Xi ) = Pn mθ (X) with mθ (x) = log pθ (x). If pθ is smooth enough as a
function of θ, then the sequences of local log-likelihood ratios is locally asymptotically
normal: under P0 = Pθ0
n
pθ +h/√n
−1/2
n Mn (θ0 + n
h) − Mn (θ0 )
=
log 0
(Xi )
p θ0
i=1
1
1 ˙
= h√
lθ (Xi ) − h
I(θ0 )h + oP0 (1),
2
n
n
i=1
where l˙θ is the score function for the model (usually ∇θ log pθ ), and I(θ0 ) is the Fisher
information matrix. The finite dimensional distributions of the stochastic processes on
the right side of the display converge in law to the finite-dimensional laws of the Gaussian
process, that is
1
n Mn (θ0 + n−1/2 h) − Mn (θ0 ) →d M(h) = h
∆ − h
I(θ0 )h ,
2
where ∆ ∼ Nd (0, I(θ0 )). Then, M(h) ∼ N (− 12 h
I(θ0 )h, h
I(θ0 )h). If θ0 is an interior
ˆ n typically converges in distribution to the maximizer h
ˆ
point of Θ, then the sequence h
of this process over all h ∈ Rd , that is:
ˆn =
h
√
ˆ = arg max{M(h)}
n(θˆn − θ0 ) = arg max{Mn (θ0 + n−1/2 h) − Mn (θ0 )} →d h
h∈Rd
Assuming that I(θ0 ) is invertible, we can write
1
1
M(h) = − (h − I −1 (θ0 )∆)
I(θ0 )(h − I −1 (θ0 )∆) + ∆
I −1 (θ0 )∆,
2
2
ˆ = I −1 (θ0 )∆ ∼ Nd (0, I −1 (θ0 )) with maximum
and it follows that M is maximized by h
value 12 ∆
I −1 (θ0 )∆. If we could strengthen the finite-dimensional convergence indicated
above to convergence as a process in ∞ ({h : h ≤ K}), then the above arguments
would yield
ˆn =
h
√
˜ ∼ Nd (0, I −1 (θ0 )).
n(θˆn − θ0 ) →d h
We will take this approach in Chapter 12 and 13.
160CHAPTER 11. M-ESTIMATORS: THE ARGMAX CONTINUOUS MAPPING THEOREM
The classical results on asymptotic normality of maximum likelihood estimators make
the convergence in the last display rigorous by specifying rather strong smoothness conditions. Our approach in Chapter 12 and 13 instead will follow the following theorem
which takes into account considerably weaker smoothness hypotheses than the classical
conditions.
Theorem 11.2 (van der Vaart (1998)) Suppose that the model (Pθ : θ ∈ Θ) is differentiable in quadratic mean at an inner point θ0 of Θ ⊂ Rk . Furthermore, suppose
that there exists a measurable function ˙ with Pθ ˙2 < ∞ such that, for every θ1 and θ2
0
in a neighborhood of θ0 ,
˙
| log pθ1 (x) − log pθ2 (x)| ≤ (x)θ
1 − θ2 .
If the Fisher information matrix Iθ0 is nonsingular and θˆn is consistent, then
n
√
−1 1
ˆ
˙θ0 (Xi ) + oPθ0 (1).
n(θn − θ0 ) = Iθ0 √
n
i=1
In particular, the sequence
covariance matrix Iθ−1
.
0
√
n(θˆn − θ0 ) is asymptotically normal with mean zero and