M390C: Probabilistic Combinatorics

M390C: Probabilistic Combinatorics
Last updated: February 26, 2015
1
Lecture 1
We introduce the basic idea of the probabilistic method, and running examples. These
objects will reappear in later lectures, as we refine our tools.
Typical problem: prove that there exists an object with desired property T . The
probabilistic method: take a random object. Show that with positive probability, it
will have property T (and therefore such an object must exist).
This method is particularly useful when explicit construction is difficult.
We illustrate this method with three examples.
1.1
Example 1: Tournament
A tournament is a complete, directed graph, where u → v if u beats v. Say that a
tournament has property T (k) if for every set of k teams, there is one who beat them
all. Show that if k < n and
n
(1 − 2−k )n−k < 1,
k
then there exists a tournament with property T (k).
Tournament analysis: For example, property T (1) means there exists a set of cycles
than span the graph. For given n and k, constructing an explicit tournament with
property T (k) seems difficult.
Here is the probabilistic solution. Take a tournament uniformly at random: orient
each the n2 edges leftward or rightward with probability 1/2. Take a set K of k
teams. Let AK be the bad event that there does NOT exist a team who beat them
all. For each of the n − k remaining teams, the probability that it beats all of the
chosen k teams is 2−k . Thus
P(AK ) = (1 − 2−k )n−k .
1
There are
P(
n
k
set of k teams. By the union bound,1
[
K⊂V,|K|=k
AK ) ≤
X
P(AK )
K
n
=
(1 − 2−k )n−k < 1
k
by our assumption.
T
Thus the complement event { K AK }, which is property T (k), must exists with
positive probability. Thus, there exists a tournament with property T (k).
1.2
Example 2: Ramsey numbers
The complete graph on n vertices, denoted Kn , is the graph where each vertex is
connected to all other vertices. A k-coloring of a graph is an assignment of a color
to each vertex, out of k possible color choices. One can also assign colors to the
edges instead. In this case, it is called an edge coloring. A k-clique in a graph G is a
complete subgraph Kk in G. In a colored graph, a monochromatic clique is a clique
where all vertices (or edges) have the same color.
Definition 1 The Ramsey number R(k, `) is the smallest integer n such that in any
two-coloring of the edges of the complete graph on n vertices Kn by red and blue,
either there is a red Kk , or there is a blue K` .
Hidden in this definition is the claim that such a number exists (ie: finite) for any
pairs of positive integers (k, `). That is, given integers k, ` ≥ 1, the definition claims
that one can always find a large enough complete graph with a two-coloring of the
edges such that there is a monochromatic clique of either size k or `. This claim itself
is called Ramsey’s theorem.
Ramsey’s theorem. R(k, `) exists (ie: is finite) for any pair of integers k, ` ≥ 1.
Problem 1. For k ≥ 1, what is R(k, 1)? What is R(1, k)?
Problem 2. Prove Ramsey’s theorem by showing that
R(k, `) ≤ R(k − 1, `) + R(k, ` − 1).
Now that we have convinced ourselves that R(k, `) exists, the next question is, what
is it? Unfortunately, formula for arbitrary k, ` is an open problem. A fast algorithm
1
For events {A1 , . . . , Am }, the union bound says
P(A1 ∪ . . . ∪ Am ) ≤ P(A1 ) + . . . + P(Am ).
(That is, the probability that any of these events occur is at most the sum of their probabilities).
2
for computing R(k, `) is also open. In fact, we only know this number exactly for
k ≤ 5, ` ≤ 4. For other numbers we have bounds: R(5, 5) ∈ [43, 49], for example.
The situation is summarized by Joel Spencer, who attributed the story to Erdos.
Imagine a bunch of aliens, much more powerful than us, land on Earth
and demand R(5, 5). Then we should gather all of our computers and
mathematicians, and attempt to find the value. But suppose, instead,
that they ask for R(6, 6). Then we should attempt to destroy the aliens.
Problem 3. Consider the na¨ıve algorithm of finding R(5, 5) by checking all possible
two-colorings of K43 , K44 , . . . , K49 . How many such colorings are there on K43 ?
Ramsey theorem is an instance of Ramsey theory, which are theorems that roughly
says: ‘if a structure is big enough, order emerges’. Wikipedia states:
Problems in Ramsey theory typically ask: ‘how many elements of some
structure must there be to guarantee that a particular property will hold?’
Results of this kind have numerous applications - some of which we will derive in this
class. The probabilistic method is a very good tool for tackling these questions.
Our goal will be to bound the Ramsey number using the probabilistic method.
1.3
First bound on the Ramsey number
Proposition 2 If
k ≥ 3.
1−(k)
2 2 < 1, then R(k, k) > n. Thus, R(k, k) > b2k/2 c for all
n
k
Proof: First, let us prove the ‘thus’. Suppose k ≥ 3. Choose n = b2k/2 c. We shall
use the bound
nk
n
n!
< .
=
(n − k)!k!
k!
k
Apply this bound to the first term, and expand out the second term, we get
n 1−(k2) nk
2
nk 21+k/2
2
<
=
.
k
k! 2(k2 −k)/2
k! 2k2 /2
But nk = b2k/2 ck < 2k
2 /2
, and for k ≥ 3,
21+k/2 < k!,
so
nk 21+k/2
=
k! 2k2 /2
nk
2k2 /2
3
21+k/2
k!
< 1.
Thus by the first claim, R(k, k) > n.
Now, let us prove the first claim. Fix an n satisfying the hypothesis. We need to show
that there exists a coloring of Kn such that there are no monochromatic k-cliques.
The idea is to find one at random.
For each edge of Kn , color the edge red or blue each with probability 1/2. For a set
K of k vertices, let AK be the event that they form a monochromatic k-clique. Then
k
P(AK ) = 21−(2) .
Indeed, imagine that we are
coloring the edges sequentially. After fixing the color of
k
the first edge, there are 2 − 1 uncolored edges remaining. The clique is monochromatic if and only if all of these edges have the same color as our first edge, and this
k
happens with probability 21−(2) . By the union bound,
[
X
P( AK ) ≤
P(AK )
K
K n 1−(k2)
=
2
<1
k
by our assumption.
So, with positive probability, our random coloring of Kn is a coloring that does not
contain monochromatic k-cliques. Thus, such a coloring exists. Such, R(k, k) > n. 1.4
Take-away messages
The probabilistic method is the following: to find an object, take a random one (and
show that we succeed with positive probability). In Proposition 12, we needed to
find a two-coloring of Kn with no monochromatic k-cliques. The ‘object’ is a twocoloring of Kn . ‘Property T ’ is ‘contains no monochromatic k-cliques’. The random
construction is the simplest, most natural one: color each edge independently at
random with a uniformly chosen color.
There are three other take-away points in this proof.
• It is a counting argument. This is the ‘combinatorics’ part of the method.
• We applied a bound to simplify quantities that appeared. In this case, we used
the union bound. We will encouter ‘fancier’ bounds in the class, and in practice
part of the art is to apply the correct bound at the correct step.
• We could have applied more complicated random constructions: for example, we
could have chosen red with probability p, blue with probability 1 − p, and hope
4
to tune p to obtain an optimal lower bound. Sometimes this yield substantially
better results. See Lecture 3.
2
Lecture 2: Existence using expectation
Lemma 3 Let X be a random variable, E(X) = c < ∞. Either X = c with probability 1, or that each of the events {X > c} and {X < c} happen with positive
probability.
Example application:
Theorem 4 (Szele 1943) There is a tournament with n players and at least n!2−(n−1)
Hamiltonian paths.
(Recall that a Hamiltonian path in a directed graph is a path that visits each vertex
exactly once, with no edges repeated. A Hamiltonian path in a tournament is a
total ranking of players such that a > b > c > · · · . It may seem surprising that a
tournament can have exponentially many different Hamiltonian paths.
Proof: Take a random tournament as before. Let X denote the number Hamiltonian paths. For each permutation σ, let Xσ be indicator of the event that σ is a
Hamiltonian path. Such a path has n − 1 edges, so
E(Xσ ) = P({σ is Hamiltonian }) = 2−(n−1) .
There are n! permutations, so
!
E(X) = E
X
Xσ
=
X
E(Xσ ) = n!2−(n−1) .
σ
σ
Example application 2:
Let F be a family of subsets of [n]. Say that F is an antichain if there are no two
sets A, B ∈ F such that A ⊂ B. For example, if F= Fk is the family of subsets of
size k of [n], then Fk is an antichain with |Fk | = nk .
We can represent F as a graph on a subset of vertices of the n-hypercube, where two
vertices A and B are adjacent if either A ⊂ B or B ⊂ A. Then F is an antichain if
the graph is totally disconnected. Call Fk the k-th belt of the cube.
The largest belt Fk is for k = bn/2c. Can one construct a larger antichain? Sperner’s
theorem says no.
5
Theorem 5 (Sperner’s theorem)
Let F be a family of subsets of [n]. If F is an
n
antichain, then |F| ≤ bn/2c
.
Proof: View F as a graph, let fk denote the number of vertices of F in the k-th
belt of the n-hypercube. We shall prove that
n
X
fk
n ≤ 1.
k=0
(1)
k
This then proves Sperner’s theorem, since
n
n
X
X
fk
n ≥
k=0
k
k=0
fk
n
bn/2c
, ⇒
n
X
fk
k=0
n
bn/2c
≤ 1, ⇒
n
X
fk = |F| ≤
k=0
n
.
bn/2c
We shall prove the contrapositive of (1). That is, assume that
n
X
fk
n > 1.
k=0
k
We want to show that F is not an antichain.
Let π be a random permutation of [n]. Consider the sequence of sets {∅}, {π(1)},
{π(1), π(2)}, . . . {π(1), π(2), . . . , π(n)}. Let U be the number of sets in this sequence
that belongs to F. Let Uk denote the indicator of the event that the k-th set
{π(k), . . . π(k)} is in F. Note that the k-th set in the sequence is a random element of the k-th belt. Thus
fk
E(Uk ) = n .
k
Now
n
X
X
X
fk
E(U ) = E(
Uk ) =
E(Uk ) =
n > 1.
k
k
k=0
k
By Lemma 3, with positive probability, U ≥ 2. That is, there exists a permutation
π such that there are at least two sets of the sequence lies in F. But one of such set
must contain another, and thus F cannot be an antichain. 3
Lecture 3: Sample and Modify
Sometimes the first random object we constructed does not quite have the desired
property. In this case, we may need to do small modifications. In some communities,
this is called sample-and-modify.
6
3.1
Independent set
Let G = (V, E) be a graph. A set I ⊆ V is called independent if i, j ∈ I implies
(i, j) 6∈ E. The size of the largest independent set α(G) is called the independent
number of a graph. (Recall: for example, an antichain is an independent set in the
n-hypercube).
We want a lower bound on α(G).
Theorem 6 Suppose G = (V, E) has n vertices and nd/2 edges, d ≥ 1. Then α(G) ≥
n
.
2d
For example, suppose n is a multiple of d + 1, so n = (d + 1)k. Consider a graph
n
with k disjoint (d + 1)-cliques. Clearly α(G) = d+1
in this case, so our bound is only
off by a constant 2. In fact, the clique construction is tight in this case. This is an
example of Turan’s theorem, which has a probabilistic proof. (See future lectures).
Proof: We shall randomly find a large independent set. Here is a recipe for an
independent set: start with some initial set of vertices S. To each of its edges on
this induced subgraph, delete one of the incident vertices. After this, all the induced
edges have been destroyed, and thus the remaining vertices S ∗ form an independent
set.
For this construction to work, we need to start with a big enough set S with few
enough edges, so that after deletion, S ∗ is large. Specifically, let X be the number of
vertices in S, Y be the number of edges in the induced subgraph on S. Then S ∗ has
at least X − Y vertices. So we need to find S such that X − Y is large.
Choose S at random: for each vertex v ∈ V , include v in S with probability p. Then
E(X) = np
and
E(Y ) =
X
E({e ∈ G|S }) =
e∈E
So
nd 2
p.
2
nd 2
p.
2
Choose p to maximize the above quantity, we find that the optimal p is p = 1/d.
(Assuming d ≥ 1). Thus,
n
E(X − Y ) = .
2d
By Lemma 3, there exists a set S where the modified set S ∗ is an independent set
n
with at least X − Y ≥ 2d
vertices. E(X − Y ) = np −
7
3.2
Packing
Let B(x) denote the cube [0, x]d of side length x. Let C be a compact measurable
set with (Lebesgue) measure µ(C). A packing of C into B(x) is a family of mutually
disjoint copies of C, all lying inside B(x). Let f (x) denote the largest size of such a
family. The packing constant δ(C) is
δ(C) := µ(C) lim f (x)x−d .
x→∞
Note that µ(C)f (x) is the volume of space covered by copies of C inside B(x), and
x−d is the volume of B(x). So δ(C) is the maximal proportion of space that may be
packed by copies of C. One can show that this limit exists.
Our goal is to lower bound δ(C).
Theorem 7 Let C be a bounded, convex and centrally symmetric around the origin.
Then δ(C) ≥ 2−d−1 .
Proof: This example is similar to independent set, since putting copies of C i.i.d
inside B(x) does not yield a packing, so after sampling one needs to modify.
Given C, we need to find a dense packing. Here is one construction of a packing:
sample n points s1 , . . . , sn i.i.d from B(x). To each point si , puts a copy of C with
this center, that is, C + si . Now to each pair of intersecting copies, remove one of
the two copies. This gives a packing, except that some copies of C may lie outside
the box. We fix it by enlarging the box until it includes all copies, and call this the
packing of the larger box.
Let s and t be two i.i.d points from B(x). First we compute the probability that
C + s intersects C + t. By central symmetry and convexity, C = −C. Thus the two
sets intersect iff
s ∈ t + 2C.
For each given t, this event happens with probability at most
µ(2C)x−d = 2d µ(C)x−d .
So
P(C + s ∩ C + t 6= ∅) ≤ 2d µ(C)x−d .
Let s1 , . . . , sn be n i.i.d points from B(x). Let Xij be the indicator of the event that
C + si intersects C + sj . Let X be the total number of pairwise intersections. Then
X
n d
n2
E(X) =
E(Xij ) ≤
2 µ(C)x−d ≤ 2d µ(C)x−d .
2
2
i<j
8
So there exists a specific choice of n points with fewer than
intersections. After our removals, there are at least
n2 d
2 µ(C)x−d
2
pairwise
n2 d
n − 2 µ(C)x−d
2
nonintersecting copies of C. Now, we choose n to optimize this quantity. This gives
n=
xd
.
2d µ(C)
Finally we enlarge the box: let 2w be the width of the smallest cube centered at 0
that contains C. Then our constructed set is a packing of B(x + 2w). Hence
f (x + 2w) ≥
xd
,
2d µ(C)
so
δ(C) ≥ lim µ(C)f (x + 2w)(x + 2w)−d ≥ 2−d−1 .
x→∞
4
Lecture 3 (cont): Markov’s inequality (first moment method)
Theorem 8 (Markov’s inequality) Consider a random variable X ≥ 0. Let t > 0.
Then
E(X)
P(X ≥ t) ≤
t
Proof: Consider the function g : x 7→ t · 1{x≥t} . Note that
g(x) ≤ x.
Then
E(X) ≥ E(g(x)) = tP(X ≥ t).
Rearranging gives
P(X ≥ t) ≤
9
E(X)
.
t
Corollary 9 (First moment inequality) If X is a nonnegative, integer random
variable, then
P(X > 0) ≤ E(X).
Corollary 10 Let Xn be a sequence of nonnegative, integer random variables. Suppose E(Xn ) = o(1), that is, limn→∞ E(Xn ) = 0. Then limn→∞ P(Xn > 0) = 0. That
is, Xn = 0 asymptotically almost surely.
4.1
Ramsey number revisited
Recall the definition of Ramsey number.
Definition 11 The Ramsey number R(k, `) is the smallest integer n such that in
any two-coloring of the edges of the complete graph on n vertices Kn by red and blue,
either there is a red Kk , or there is a blue K` .
Recall that we proved the following
Proposition 12 If
n
k
k
21−(2) < 1, then R(k, k) > n.
Let us rewrite the proof as an application of the first moment method.
Proof: Take a random coloring: color each edge red or blue with probability 1/2
independently. Let X be the number of cliques of size k that is monochromatic. As
k
argued before,
the probability of the above event for a given k set of vertices is 21−(2) .
There are nk set of vertices of size k, so
n 1−(k2)
E(X) =
2
.
k
Thus
P(X > 0) ≤ E(X) < 1,
k
so P(X = 0) > 0. That is, if n satisfies nk 21−(2) < 1, there exists a coloring with no
monochromatic k-clique, thus R(k, k) > n. This proof is shorter, and cleaner to generalize.
Proposition 13 (Exercise) If there exists p ∈ [0, 1] with
`
n (k2)
n
p +
(1 − p)(2) < 1
k
`
then R(k, `) > n.
10
Proof: Color an edge blue with probability p, red with probability 1 − p independently at random. Let X denote the number of k-clique colored red plus the number
of `-cliques colored blue. Then
`
n
n (k2)
(1 − p)(2) .
E(X) =
p +
`
k
The proof ends in the same way as that of Proposition 13. As an aside, with sample-and-modify, we can generalize the above result. This does
not use the the first moment method, but is an illustration of the previous lecture.
Proposition 14 (Exercise) For all p ∈ [0, 1] and for all integers n,
`
n (k2)
n
R(k, `) > n −
p −
(1 − p)(2) .
k
`
Proof: As before, color an edge blue with probability p, red with probability 1 − p
independently at random. Let X denote the number of k-clique colored red plus the
number of `-cliques colored blue. Then
`
n (k2)
n
E(X) =
p +
(1 − p)(2) .
k
`
Recall that X is the number of ‘bad’ sets. By the Expectation Lemma (Lemma 3,
Lecture 2), there exists a coloring with at most E(X) such bad sets. For each bad
k
`
set, remove a vertex. This procedure removes at most nk p(2) + n` (1 − p)(2) vertices.
The coloring on the remaining
`
n (k2)
n
n−
p +
(1 − p)(2)
k
`
vertices have no ‘bad’ sets (that is, no k red clique or ` blue clique), thus R(k, `) is
at least this quantity. 4.2
4.2.1
Graph threshold, K4 in a G(n, p)
Threshold function
The Erdos-Renyi random graph G(n, p) is a random undirected graph on n vertices,
generated by including each edge independently with probability p. A property P of
a graph G is called monotone increasing if
G ∈ P, G0 ⊃ G ⇒ G0 ∈ P.
11
Example of such properties include connectedness, existence of a clique, existence
of a Hamiltonian cycle, etc. A threshold for a monotone increasing property P is a
function p∗ = p(n) such that
0 if p p∗ (p = o(p∗ ))
lim P(G(n, p) ∈ P) =
1 if p p∗ (p∗ = o(p))
n→∞
Every non-trivial monotone increasing property has a threshold function. (We omit
the proof of this theorem). Finding the threshold function for a given property is
a central problem in graph theory. Threshold proofs often consist of two parts, an
upper bound and a matching lower bound for p∗ . In many cases, the first moment
method gives one side.
4.2.2
Existence of K4
Proposition 15 If p n−2/3 , then lim P(G(n, p) ⊃ K4 ) = 0.
n→∞
Proof: Let X be the number of copies of K4 in G(n, p). Then
n 6
E(X) =
p.
4
Suppose p n−2/3 , that is, p = n−2/3 o(1). Then
n 6
E(X) =
p = O(n4 n−12/3 o(1)) = o(1).
4
That is, E(X) → 0 as n → ∞. By the first moment method, P(X > 0) ≤ E(X) → 0,
thus lim P(G(n, p) ⊃ K4 ) = 0. n→∞
5
5.1
Lecture 4: More examples of the first moment
method
Background: Big-O and little-o
Big-O. For sequences an and bn , we write
an = O(bn )
to mean that there exists some finite constant C and some large integer N such that
|an | ≤ C|bn |
12
for all n ≥ N . We say that an is big-O of bn . If the quantities involved are non-zero,
then an = O(bn ) if and only if
lim sup |an |/|bn | < ∞.
Recall that the lim sup (limit superior) of a sequence {xn } is
lim sup xn .
m→∞ m≥n
By monotone convergence, the lim sup of a bounded sequence always exist, even if
the sequence does not converge.
Little-o. We write
an = o(bn )
if an grows slower than bn , that is, if for every positive constant , there exists some
constant N () such that
|an | ≤ |bn |
for all n ≥ N (). If the quantities involved are non-zero, then an = o(bn ) if and only
if abnn → 0 as n → ∞.
The notation ∼. We write an ∼ bn if
5.2
an
bn
→ 1 as n → ∞.
Background: Stirling’s approximation
The
term n! often appears in various formulae (for example, in the binomial coefficient
n
). Often we want to bound this quantity for large n. We now collects and proves
k
some useful bounds.
Theorem 16 (Stirling’s formula) n! ∼
nn
en
√
2πn.
It is common to express large quantities by its (natural) logarithm. So another way
to write the above is
log(n!) = n log(n) − n +
1
1
log(n) + log(2π) + o(1)
2
2
First we prove that the dominant term n log(n) is the right order. In fact, often we
only need this bound.
Theorem 17 For n ≥ 1, n log n − n < log(n!) < n log(n). Thus
log(n!) ∼ n log(n).
13
Proof: Since n! < nn , log(n!) < n log(n). Now we prove the lower bound.
One
Rn
method is to use the Riemann sum with right endpoints for the integral 1 log(x)dx.
Since log is a strictly increasing function, the right endpoint sum upperbounds the
integral. So
Z n
log(2) + . . . + log(n) = log(n!) >
log(x) dx = n log(n) − n + 1 > n log(n) − n.
1
A sketch of the proof Stirling’s formula can be found on Wikipedia, for example.
Useful asymptotic bounds on binomial
coefficients come from Stirling’s. The most
2n
commonly quoted bound
in n!n , but for any large k (ie: going to ∞ as n → ∞),
, and use Stirling’s approximation to analyze the
it is worth writing nk as (n−k)!k!
asymptotic of this coefficient.
Example 1 Write
2n
n
=
(2n)!
.
n!n!
Apply Stirling’s formula to (2n)! and n!, we get
2n
22n
∼√
n
πn
Here are some other useful bounds
Proposition 18 For 1 ≤ k ≤ n,
n k
k
n
nk n · e k
≤
≤
≤
k
k!
k
Proof: First inequality: note that
n
n−1
n−2
≤
≤
...,
k
k−1
k−2
so
nn−1
n
n−k
n
=
···
≥ ( )k .
k
kk−1
1
k
Second inequality: note that
n(n − 1) · · · (n − k)
n
nk
=
≤ .
k
k!
k!
Third inequality: cancel the nk , take log both sides, rearrange, this is the inequality
k log(k) − k < log(k!),
which we already proved above. 14
5.3
Longest increasing subsequence
Let σn be a uniformly chosen random permutation of [n]. An increasing subsequence
of σn is a sequence of indices i1 < i2 < . . . < ik such that σn (i1 ) < σn (i2 ) < . . . <
σn (ik ). Let Ln be the length of a longest increasing subsequence of σn .
√
Lemma 19 P(Ln > 2e n) = o(n−1/2 ) as n → ∞. In particular, this implies
E(Ln )
√
≤ 2e.
n
n→∞
lim sup
Technical comment: Here we use the lim sup instead of lim since we do not know if
)
√ n converges as n → ∞.
the sequence E(L
n
Intuition: The lemma states that for large n,
√ for a random permutation of [n], the
longest increasing subsequence is at most O( n).
Proof: First we deal with the implications. Let ` be some value (think ` for ‘length’).
Borrowing the idea from the proof of Markov’s inequality, we use the bound
E(Ln ) ≤ `P(Ln < `) + nP(Ln ≥ `)
≤ ` + nP(Ln ≥ `)
√
Thus if we can show that for ` = 2e n, P(Ln ≥ `) → 0 faster than n−1/2 , this would
imply
E(Ln )
√
≤ 2e + o(1),
n
and thus
E(Ln )
√
≤ 2e.
n
n→∞
lim sup
Now consider the problem of bounding P(Ln ≥ `). This is difficult to bound because
Ln is the definition of Ln involves the maximum. Let Xn,` be the number of increasing
subsequences of σn with length `. Then
P(Ln ≥ `) = P(Xn,` > 0).
Now we use the first moment method to upperbound the second quantity. Note that
X
Xn,` =
1{ s is increasing in σn } .
subsequence s of length `
There are n` subsequences, and the probability of a specific subsequence s being
increasing in σn is
1/`!
15
Indeed, σn restricted to s is just a random permutation of these s indices. There are
`! ways that σn permutes s, only one of which is increasing. Thus
1 n
P(Xn,` > 0) ≤ E(Xn ) =
.
`! `
Assuming that ` → ∞, we use the bounds
`
`! ≥ ( )`
e
and
n
en
≤ ( )`
`
`
to obtain
√
1 n
e n 2`
≤(
) .
`! `
`
√
For ` = 2e n, the above bound becomes
√
2−4e
n
√
which → 0 much faster than 1/ n. In summary, we have shown that
P(Ln ≥ `) = P(Xn,` > 0) ≤ E(Xn ) ≤ 2−4e
√
n
n−1/2 ,
so P(Ln ≥ `) = o(n−1/2 ). Remark. It has been show that
√
E(Ln ) = 2 n + cn1/6 + o(n1/6 )
for c = −1.77 . . .. So our bound is loose only by a constant factor.
5.4
Connectivity threshold
Say a graph is connected if for every pair of vertices u, v, there exists a path that
connects u and v.
Our goal is to prove a lower bound for the threshold function of connectivity of
G(n, p).
Proposition 20 If p log(n)
,
n
then limn→∞ P(G(n, p) is connected ) → 1
A graph is disconnected if either it has isolated vertices, or it has disconnected components. We shall prove both events have asymptotically zero probabilities, and the
conclusion follows.
16
Lemma 21 [Exercise] Let X1 be the number of isolated vertices of G(n, p). If p =
c log(n)
for c 1, then P(X1 > 0) = o(1).
n
Proof: Note that
X1 = Z1 + . . . + Zn ,
where Zi is 1 if the i-th vertex is isolated, 0 else. Then
lim E(X1 ) = lim n(1 − p)n−1 = lim n(1 −
n→∞
n→∞
n→∞
c log n n
) = lim ne−c log(n) = n1−c .
n→∞
n
So if c 1, E(X1 ) → 0, hence P(X1 > 0) → 0. Proposition 22 Let Xk be the number of components of size k which are disconnected from the rest of the graph in G(n, p). If p = c log(n)
for c 1, then
n
n/2
X
P(Xk > 0) = o(1).
k=2
Proof: Our goal will be to show that
n/2
X
E(Xk ) = o(1).
k=2
There are nk subset of k vertices. For fixed k vertices, there are k(n − k) edges to
the rest of the graph that cannot appear, so
n
E(Xk ) =
(1 − p)k(n−k) .
k
k
Now we use k ≤ n/2 so n − k ≥ n/2, nk ≤ nk! , and k! ≥ (k/e)k to get
E(Xk ) ≤
nk
(1 − p)kn/2 ≤ (en(1 − p)n/2 )k .
k!
Now,
c log(n) n/2
) → ne−c log(n)/2 → n1−c/2 → 0
n
for c 1. Thus (en(1 − p)n/2 ) = o(1) for c 1. By the first moment
n(1 − p)n/2 = n(1 −
n/2
X
k=2
P(Xk > 0) ≤
∞
X
(en(1 − p)n/2 )k = O(n(1 − p)n/2 ) = o(1).
k=2
17
6
Lecture 6: The second moment method
Corollary 23 Suppose φ : R → R is a strictly monotone increasing function, that
is, x > y implies φ(x) > φ(y). Let X be a nonnegative random variable. Then
P(X ≥ t) = P(φ(X) ≥ φ(t)) ≤
E(φ(X))
.
φ(t)
Theorem 24 (Chebychev’s inequality) Let X be a random variable. Then
P(|X − E(X)| ≥ t) ≤
Var(X)
.
t2
Proof: Apply the previous corollary to the random variable |X − E(X)|, with
function φ : x 7→ x2 . The use of the Chebychev’s inequality is called the second moment method. The
following often-used corollary is sometimes called the second moment inequality.
Corollary 25 (Second moment inequality) Let X be a nonnegative integer valued random variable. Then
Var(X)
P(X = 0) ≤
E(X)2
Proof:
P(X = 0) ≤ P(|X − E(X)| ≥ E(X)) ≤
Var(X)
(E(X))2
In comparison, the first moment inequality is
P(X > 0) ≤ E(X).
A typical use of the first moment is to show that E(X) → 0 implies X = 0 a.s. in
the limit. Or E(X) < 1, and thereby get P(X = 0) > 0. That is, if we can show that
E(X) is ‘very small’, then the event {X = 0} happens ‘very often’.
If E(X) is large, say, E(X) → ∞, we cannot conclude that the event {X > 0} happens
‘very often’ based on the first moment. This is where the second moment is handy.
Corollary 26 Let Xn be a sequence of nonnegative integer valued random variables.
Suppose E(Xn ) → ∞. If Var(Xn ) = o(E(Xn )2 ) as n → ∞, then P(Xn > 0) → 1 as
n → ∞.
18
Corollary 27 Let Xn be a sequence of nonnegative integer valued random variables.
Suppose E(Xn ) → ∞. If Var(Xn ) = o(E(Xn )2 ) as n → ∞, then Xn ∼ E(Xn ) as
n → ∞ a.a.e.
Proof: From Chebychev’s inequality, for any > 0,
P(|X − E(X)| ≥ E(X)) ≤
Var(X)
→ 0.
(E(X))2
Intuitively, Var(X) measures the concentration of X around E(X). If Var(X) =
o(E(X)), the above corollary says that eventually X concentrates around E(X).
P
Typically, X = i Xi , where Xi is the indicator of some event Ai . Write i ∼ j if the
events Ai and Aj are not independent. Then
X
X
Var(X) =
E(Xi ) − E(Xi2 ) +
E(Xi Xj ) − E(Xi )E(Xj )
i
i∼j
≤ E(X) +
X
P(Ai ∩ Aj ).
i∼j
A typical
P application of the second moment in this case reduce to showing that
∆ := i∼j P(Ai ∩ Aj ) = o(E(X)2 ).
Corollary 28 Suppose that
PX = Xn is the sum of indicators of events {Ai }. Suppose
E(X) → ∞ as n → ∞. If i∼j P(Ai ∩ Aj ) = o(E(X)2 ), then X ∼ E(X) a.a.s.
Here is another ‘trick’ for sum of indicators. Suppose the events Ai are symmetric:
for any i 6= j, there is a measure preserving map of the underlying probability space
that sends Ai to Aj . Then
X
X
X
P(Ai ∩ Aj ) =
P(Ai )
P(Aj |Ai ).
i∼j
By symmetry, the sum
i
P
j∼i
j∼i
P(Aj |Ai ) is independent of i. Call this sum ∆∗ . Then
∆=
X
P(Ai )∆∗ = ∆∗ E(X).
i
Corollary 29 Suppose E(X) → ∞. If ∆∗ = o(E(X)), then X ∼ E(X) a.a.s.
19
6.1
K4 threshold
Recalled two lectures ago we showed that if p n−2/3 , then G(n, p) a.a.s. does NOT
have a copy of K4 . Now we shall prove that if p n−2/3 , then G(n, p) a.a.s has a
copy of K4 .
Theorem 30 n−2/3 is the threshold function for the existence of K4 .
Proof: The lower bound was established in Proposition 15. Now we shall prove the
upperbound. Suppose p n−2/3 . Let X be the number of copies of K4 in G(n, p).
As we computed previously,
n 6
E(X) =
p = O(n4 p6 ) → ∞.
4
We want to show that X > 0 a.a.s. Note that X is a sum of symmetric indicators.
Fix a K4 graph i. The indicator for K4 graph j is not independent from that of i if i
and j share either two or three vertices. There are O(n2 ) j’s that share precisely two
vertices with i (pick two vertices), and
P(Aj |Ai ) = p5 .
There are O(n) j’s that share precisely three vertices, and
P(Aj |Ai ) = p3 .
Using Corollary 29, we have
X
∆∗ =
P(Aj |Ai ) = O(n2 p5 ) + O(np3 ) = o(n4 p6 ) = o(E(X)).
j∼i
Thus X > 0 a.a.s. 6.2
Connectivity threshold
Theorem 31 The threshold for the connectivity of G(n, p) is p∗ =
log n
.
n
First we prove that this is the threshold for existence of isolated vertices.
Proposition 32 (exercise) The threshold for existence of isolated vertices is p∗ =
log n
.
n
20
Proof: Let X1 be the number of isolated vertices of G(n, p). By first moment
method (see previous lecture), we showed that p = p∗ implies P(X1 > 0) = o(1), and
thus p p∗ implies P(X1 > 0) = o(1).
We need to prove that p p∗ implies X1 > 0 with high probability. Let p = cp∗ =
c logn n . Note that
X1 = Z1 + . . . + Zn ,
where Zi is 1 if the i-th vertex is isolated, 0 else. Then
lim E(X1 ) = lim n(1 − p)n−1 = lim n(1 −
n→∞
n→∞
n→∞
c log n n
) = lim ne−c log(n) = n1−c .
n→∞
n
So if c < 1, E(X1 ) → ∞. We want to show that c < 1 implies X1 > 0 w.h.p.
Use the second moment method. In general, it is good to bound
that
X
E(X12 ) = E(X1 ) +
E(Zi Zj )
Var(X1 )
E(X1 )2
directly. Note
i∼j
All the pairs (i, j), i 6= j are correlated. For a given pair i 6= j,
E(Zi Zj ) = P(Zi = Zj = 1) = (1 − p)2n−3 .
So
X
E(Zi Zj ) = n(n − 1)(1 − p)2n−3 .
i∼j
Now
n(1 − p)n−1 + n(n − 1)(1 − p)2n−3
1
1
1
E(X12 )
=
=
+
−
.
E(X1 )2
n2 (1 − p)2n−2
n(1 − p)n−1 1 − p n(1 − p)
For p =
c log(n)
,
n
c < 1, we have
E(X 2 )
1
1
1
= lim 1−c +
−
= 1.
c
log(n)
2
n→∞ E(X)
n→∞ n
n − c log(n)
1− n
lim
So in particular,
E(X12 ) − E(X1 )2
Var(X1 )
=
→ 0,
E(X1 )2
E(X1 )2
thus P(X1 = 0) → 0 as n → ∞.
So for p =
c log(n)
n
< p∗ , a.a.s there are no isolated vertices. Proof:[Proof of the connectivity threshold] We have shown the following.
• If p p∗ , then the graph has isolated vertices and hence disconnected.
21
• If p ≥ p∗ , then the graph has no isolated vertices.
• If p = p∗ , then the graph has no isolated components of size at least 2.
It follows that for p ≥ p∗ , the graph has no isolated components, and hence is connected. 6.3
Turan’s proof of the Hardy-Ramanjuan theorem
This theorem states that for most large integers, the number of prime factors of x is
about log log n.
Theorem 33 (Hardy and Ramanujan 1920; Turan 1934) Let ω(n) → ∞ arbitrarily slowly. For x = 1, 2, . . . , n, let ν(x) be the number of prime factors of x.
Then the number of x ∈ {1, . . . , n} such that
p
|ν(x) − log log n| > ω(n) log log n
is o(1).
For this proof, reserve p to denote a prime number. We need the following result from
number theory, known as Merten’s formula
Lemma 34 (Mertens’ formula)
X1
p≤x
p
= log log x + O(1).
Proof: Here is a proof sketch to show that log log x is the correct order. For an
integer x, the higher power of a prime p which divides x! is
x
x
x
b c + b 2c + b 3c + ....
p
p
p
(Indeed, if only p is present in x!, then this power is b xp c. If both p and p2 are present
in x!, then this power is the sum of the first two terms, and so on). Thus we have
Y b x c+b x c+b x c+...
x! =
p p p2 p3
.
p≤x
Take log from both sides, and use log(x!) ∼ x log x. Since squares, cubes etc... of
primes are quite rare, and b xp c is almost the same as x/p, we get
x
X log p
p≤x
p
∼ x log x.
22
Write
S(n) =
X log p
p≤n
We have
X1
p≤x
p
=
p
.
1
(S(n) − S(n − 1))
log
n
n≤x
X
We now use summation by parts (also called Abel summation), which states that
X
X
a(n)f (n) = A(x)f (x) −
A(n)f 0 (n).
n≤x
Apply this for f (x) =
n≤x
1
,
log(x)
X1
p≤x
p
a(n) = S(n) − S(n − 1), so A(n) = S(n), we get
= S(x)
X
1
1
+
S(n)
.
log(x) n≤x
n log(n)2
Now S(x) ∼ log(x), so the first term is O(1). For the second term,
X
n≤x
S(n)
X 1
1
∼
∼ log log x.
n log(n)2
n
log
n
n≤x
Proof:[Proof of Hardy-Ramanujan]
Our goal will be to apply Chebychev’s inequality.
Choose x randomly from {1, 2, . . . , n}. For prime p, set Xp = 1 if p divides x, zero
otherwise. Our goal will be to apply Chebychev’s inequality to bound
X
ν(x) =
Xp .
p≤n
But the covariance of this quantity requires one to upper bound the number of primes
below n. This is ∼ logn n by the prime number theorem, but we can get around
introducing extra results by the following trick. Let M = n1/10 . Define
X
X :=
Xp .
p≤M
Since each x ≤ n cannot have more than 10 prime factors larger than M , we have
ν(x) − 10 ≤ X ≤ ν(x).
Thus to prove the statement for ν(x), it is sufficient to prove the same statement for
X.
23
There are bn/pc many x that is divisible by p. So
E(Xp ) =
bn/pc
1
= + O(n−1 ).
n
p
So by Mertens’ formula
E(X) =
X1
+ O(n−1 ) = log log(M ) + O(1) = log log(n) + O(1).
p
p≤M
Now we bound the variance.
Var(X) =
X
Var(Xp ) +
p≤n
X
Cov(Xp , Xq ).
p6=q≤M
Since Xp is an indicator with expectation
1
p
+ O(n−1 ),
1
1
Var(Xp ) = (1 − ) + O(n−1 ).
p
p
So the first term is log log n + O(1). For p and q distinct primes, Xp Xq = 1 if and
only if pq divides x. Thus
Cov(Xp , Xq ) = E(Xp Xq ) − E(Xp )E(Xq ) =
1
1 1
1 1
≤
−
−
−
pq
p n
q n
1 1 1
+
.
≤
n p q
bn/(pq)c bn/pc bn/qc
−
n
n
n
So
1 X
1 1
2M X 1
+
≤
= O(n−9/10 log log n) = o(1).
Cov(Xp , Xq ) ≤
n p6=q≤M p q
n p≤M p
p6=q≤M
X
Similarly, by lower bounding Cov(Xp , Xq ), one can show that
−o(1). So
Var(X) = log log n + O(1).
P
p6=q≤M
Apply Chebychev’s inequality give for any constant λ > 0
p
P(|X − log log n| > λ log log n) < λ−2 + o(1) → 0.
24
Cov(Xp , Xq ) ≥
7
7.1
Lecture 7: Variations on the second moment
method
A slight improvement
Lemma 35 Let X be a non-negative random variable, X 6≡ 0. Then
P(X = 0) ≤
Var(X)
.
(EX)2 + Var(X)
Compared to the second moment inequality, this has an extra Var(X) at the bottom.
We will see below an example application where this extra Var(X) makes a difference.
For most other applications this does not matter, the original inequality is often
sufficient.
Proof: We will prove the equivalent statement, which is
P(X > 0) ≥
(EX)2
.
E(X 2 )
By Cauchy-Schwarz inequality
(EX)2 = (E(X1X>0 ))2 ≤ E(X 2 )P(X > 0).
Rearranging gives the above. 7.1.1
Percolation on regular tree
Percolation theory models the movement of liquids through a porous material, consisting of ‘sites’ (vertices) connected by ‘bonds’ (edges). An edge (or vertex) is open
if the liquid flows through, otherwise it is close. In bond percolation, each edge is
open independently with probability p. In site percolation, each vertex is open independently with probability p. The main question is the existence of an infinite
component: on an infinite graph, for what values of p does there exist an infinite
subgraph connected by open paths?
Let Td be the infinite regular tree of degree d. Designate 0 to be the root. Consider
bond percolation on Td . Define
θ(p) = Pp (|C0 | = +∞).
Define pc (Td ) = supp∈[0,1]:θ(p)=0 . That is, pc (Td ) is the critical probability for the
existence of an infinite component in the regular infinite tree.
25
Theorem 36
pc =
1
.
d−1
Proof:
Let ∂n be the set of vertices of Td of distance n from 0. Let Xn be the number of
vertices of ∂n ∩ C0 . ∂n is called a cutset, since for C0 to be infinite, one must have
Xn > 0.
By the first moment,
θ(p) ≤ Pp (Xn > 0) ≤ Ep (Xn ) = d(d − 1)n−1 pn .
(There are d vertices in the first level, each gives d − 1 children at the next level, thus
there are d(d − 1)n−1 leaves. For each leaf, there is a unique path of length n to the
origin).
We have Ep (Xn ) → 0 for p <
1
.
d−1
Thus pc (Td ) ≥
1
.
d−1
1
, limn→∞ Pp (Xn >
Now we use the second moment. We shall prove that for p > d−1
1
0) ≥ 0, and hence pc ≤ d−1 . Note that nodes x ∼ y for x, y ∈ ∂n iff they have a
common ancestor that is not 0. Furthermore, their paths are independent starting
from the most recent common ancestor. Let x ∧ y denote the most recent common
ancestor of x and y. We sum the pairs x ∼ y by the level of x ∧ y.
Define
µn = Ep (Xn ) = d(d − 1)n−1 pn .
Ep (Xn2 ) =
X
P(x, y ∈ C0 )
x,y∈∂n
= µn +
n−1 X
XX
1{x∧y∈∂m } pm p2(n−m) .
x∈∂n m=0 y∈∂n
For a fixed x, the set y where x ∧ y ∈ ∂m has (d − 2)(d − 1)n−m−1 . All the vertices at
level n are equivalent, and there are d(d − 1)n−1 such vertices. Let
r = ((d − 1)p)−1 .
26
Since p >
1
,
d−1
r < 1. So the above equals
n−1
= µn + d(d − 1)
(d − 2)
n−1
X
(d − 1)n−m−1 p2n−m
m=0
n−1
X
2n−2
= µn + d(d − 2)(d − 1)
((d − 1)p)−m
m=0
n
= µn + µ2n
d−2 1−r
d 1−r
Dividing by µ2n which goes to ∞, we get
1
d − 2 1 − rn
E(Xn2 )
=
+
E(Xn )2
µn
d 1−r
1
d−2
1
≤
+
µn
d 1 − ((d − 1)p)−1
1
d−2
≤1+
d 1 − ((d − 1)p)−1
=: C
This bound holds for all n large enough. Thus by the variant of the second moment
inequality
θ(p) = P( for all n, Xn > 0) = lim P(Xn > 0) ≥ C −1 > 0.
n
This concludes the proof. Note that the second moment inequality does not work. This gives the bound
Var(Xn )
2µ2n − E(Xn2 )
=
µ2n
µ2n
1
d − 2 1 − rn
=2−
−
.
µn
d 1−r
P(Xn > 0) ≥ 1 −
Take p very close to
8
1
,
d−1
so r ≈ 1, then the bound becomes negative.
Lecture 8: The Local Lemma
Often one can phrase existence of a desired random object as a lack of bad events. If
the bad events A1 , . . . , An are independent, and Ai happens with probability at most
xi , then
n
n
^
Y
P( Ai ) =
P(Ai ) ≥ (1 − xi )n .
i=1
i=1
27
The Local Lemma generalizes this to the case where the bad events Ai have limited
dependencies.
Definition 37 Say that an event A is mutually independent of a set of events {Bi }
if for any subset β of events contained in {Bi }, P(A|β) = P(A).
Note that mutual independence is not symmetric (except for the case of two events).
That is, for events A, B1 and B2 ,
P(A|B1 ) = P(A) ⇔ P(B1 |A) = P(B1 ),
and
P(A|B1 , B2 ) = P(A) ⇒ P(A|B1 ) = P(A), P(A|B2 ) = P(A),
but
P(A|B1 ) = P(A), P(A|B2 ) = P(A) 6⇒ P(A|B1 , B2 ) = P(A).
Example 2 Let B1 and B2 be two independent Ber(1/2), and A be the event that
B1 = B2 . Then A is independent of B1 , A is independent of B2 , but A is not
independent of {B1 , B2 }, and thus it is not mutually independent of {B1 , B2 }.
The following proposition is useful to establish mutual independence.
Proposition 38 (Mutual independence principle) Suppose that Z1 , . . . , Zm is
an underlying sequence of independent events, and suppose that each event Ai is completely determined by some subset Si ⊂ {Z1 , . . . , Zm }. For a given i, if Si ∩ Sj = ∅
for all j = j1 , . . . , jk , then Ai is mutually independent of {Aj1 , . . . , Ajk }.
Proof: Homework Definition 39 (Dependency digraph) The dependency digraph D = (V, E) of
events A1 , . . . , An is a directed graph on n nodes, where for i = 1, . . . , n, Ai is mutually
independent of all events {Aj : (i, j) 6∈ E}.
Note that the dependency digraph is directed since mutual independence is not symmetric.
Lemma 40 (The Local Lemma) For events A1 , . . . , An with dependence digraph
D = (V, E), suppose that there are real numbers x1 , . . . , xn ∈ [0, 1] such that for all
i = 1, . . . , n
Y
P(Ai ) ≤ xi
(1 − xj ).
j:i→j∈E
28
Then
P(
n
^
n
Y
Ai ) ≥
(1 − xi ).
i=1
i=1
In particular, with positive probability no events Ai hold.
Proof:
that
2
Fix a subset S ⊂ {1, . . . , n}, |S| = s ≤ n. We will show by induction on s
Y
^
(1 − xi ),
(2)
P( Ai ) ≥
i∈S
i∈S
and for i 6∈ S,
P(Ai |
^
Aj ) ≤ xi .
(3)
j∈S
The case s = 0 is trivial. Suppose (2) and (3) are true for all S such that |S| = s0 < s.
If (3) is true, then (2) holds for a set with cardinality s. Indeed, such a new set is of
the form S ∪ {i}, so by conditional probability
!
^
^
Y
P(
Aj ) = (1 − P(Ai )) · · · 1 − P(Ai |
Aj ≥
(1 − xj ).
j∈S
j∈S∪{i}
j∈S∪{i}
Thus one can use (3) to prove the induction step in (2). It remains to prove (3) by
induction.
Split the set into S1 = {j ∈ S : i → j}, and its complement S2 = S\S1 . For any
subset T ⊆ [n] of indices, define
^
AT :=
Aj .
j∈T
Then
P(Ai |AS ) = P(Ai |AS1 ∧ AS2 ) =
P(Ai ∧ AS1 |AS2 )
.
P(AS1 |AS2 )
(4)
For the numerator, since Ai is independent of AS2 ,
P(Ai ∧ AS1 |AS2 ) ≤ P(Ai |AS2 ) = P(Ai ) ≤ xi
Y
(1 − xj ).
j:i→j∈E
2
Thanks to class discussion for pointing out issues with the earlier version of the proof.
29
(5)
For the denominator, use conditional probability and applies the induction hypothesis
in (3). Suppose S = {j1 , j2 , . . . , jr }. Then
P(AS1 |AS2 ) = P(Aj1 ∧ Aj2 . . . ∧ Ajr |AS2 )
= (1 − P(Aj1 |AS2 )) · (1 − P(Aj2 |Aj1 ∧ AS2 ) · · ·
· · · (1 − P(Ajr |AS\{jr } ))
Y
≥ (1 − xj1 )(1 − xj2 ) · · · (1 − xjr ) ≥
(1 − xj ).
j:i→j∈E
Substitute in (4) completes the induction step in (3). 8.1
Special cases
Here is a special case of the Local Lemma. This result is most powerful when the
events Ai have roughly the same probabilities, and dependencies between events are
rare.
Corollary 41 (The Local Lemma, symmetric version) For events A1 , . . . , An ,
suppose that each Ai is mutually independent of all but at most d other events Aj ,
and that P(Ai ) ≤ p for all i ∈ [n]. If
ep(d + 1) ≤ 1,
V
then P( ni=1 Ai ) > 0.
Proof: For d = 0 this is trivial. Suppose d > 0. By assumption, there exists a
1
dependency digraph where |j : i → j ∈ E| ≤ d for all i. Define xi = d+1
, we get
xi
Y
j:i→j
(1 − xj ) ≥
1
1 d
1
(1 −
) >
.
d+1
d+1
e(d + 1)
Here we used the fact that for all d ≥ 1,
(1 −
1 d 1
) > .
d+1
e
So if ep(d + 1) ≤ 1, then P(Ai ) < 1 ≤
Lemma applies. 1
e(d+1)
30
≤ xi
Q
j:i→j (1
− xj ), and the Local
8.1.1
Example: Two-coloring of hypergraphs
A hypergraph H = (V, E) is two-colorable if there is a coloring of V by two colors so
that no edge f ∈ E is monochromatic.
Theorem 42 (Erd¨
os-Lov´
asz 1975. (Exercise)) Let H = (V, E) be a hypergraph
in which every edge has at least k elements. Suppose that each edge of H intersects
at most d other edges. If e(d + 1) ≤ 2k−1 , then H is two-colorable.
Proof: Color each vertex of H independently of either color with probability 1/2.
For an edge f ∈ E, let Af be the event that it is monochromatic. Then
P(Af ) = 21−|f | ≤ 21−k .
Note that Af is mutually independent of {Ag : f ∩g = ∅} by the mutual independence
principle. The conclusion of the Local Lemma applies. The following is another special case, useful when the probability of the events Ai
can differ alot.
Corollary 43 (The Local Lemma, summation version) For events A1 , . . . , An ,
suppose that for all i,
X
1
P(Aj ) ≤ .
4
j:i→j
Vn
Then P( i=1 Ai ) > 0.
P
Proof: Take xi = 2P(Ai ). For all i, given that j:i→j xj ≤ 12 , we need to show that
1
(1 − xj ) ≥ .
2
j:i→j
Y
Let r be the cardinality of the set {j : i → j}. Since xj ∈ [0, 1], we shall prove that
the solution to the problem
r
Y
minimize (1 − xj )
subject to
j=1
r
X
1
xj = , xj ≥ 0
2
j=1
is attained at the boundary where all but one xj are zero, with minimum value 21 .
Let f be the objective function. Then
h(x) := log(f (x)) =
r
X
j=1
31
log(1 − xj ).
Note that ∇h(x) = ( −1
, . . . −1
). By Lagrange multipler method, the only critical
x1
xr
1
point of h (and hence of f ) on the interior is where x1 = . . . = xr = 2r
, which is a
clear maxmizer. Thus the minimum of the function has to be attained the boundary
of the domain, which is where at least one xj is zero. Recurse the argument, we are
done. If we apply the asymmetric version under the hypothesis of the symmetric case, we
get a bound of
4pd ≤ 1,
which is slightly worse than
ep(d + 1) ≤ 1.
However, the 14 in this proof is tight, since the optimization problem we solved has a
tight minimum of 12 .
8.1.2
Example: frugal graph coloring
Definition 44 A vertex-coloring of a graph G is proper if no edge have the same
end-point color. A proper vertex-coloring of a graph G is called β-frugal if no color
appears more than β times in the neighborhood of any vertex of G.
If ∆ is the maximum degree of a vertex of G, then a β-frugal coloring requires at
least d∆/βe + 1 many colors.
Theorem 45 (1997) For β ≥ 2, if a graph G has maximum degree ∆ ≥ β β , then G
has a β-frugal coloring with 16∆1+1/β colors.
Proof: We will pick a random coloring of G with C = 16∆1+1/β colors. Then we
will show that with positive probability, the coloring is proper and β-frugal.
There are two types of bad events: those that prevents our coloring from being proper
(type A), and those that prevents our coloring from being β-frugal (type B).
Type A events: for each edge (u, v) of G, let Auv be the event that u and v have the
same color.
Type B events: for each set of β + 1 neighbors U of some vertex, let BU be the event
that they all have the same color.
We have
P(Auv ) = 1/C
and
P(BU ) = 1/C β .
32
Each type A event is mutually independent of all but at most 2∆ other A events and
2∆ ∆
other B events.
β
Each type B event
is mutually independent of all but at most (β + 1)∆ type A events
∆
and (β + 1)∆ β type B events.
Let Ei denote an event of either type A or B. Apply the summation version of the
Local Lemma, we have
X
1
∆ 1
P(Ej ) ≤ (β + 1)∆ + (β + 1)∆
C
β Cβ
j:BU →j
∆β+1
1
∆
∆β
≤ (β + 1)∆ + (β + 1)
use
≤
C
β!C β
β!
β
β+1
β+1
=
+
16∆1/β β!16β
1
≤ .
use ∆ > β β , β ≥ 2.
4
Similarly,
X
∆ 1
1
P(Ej ) ≤ 2∆ + 2∆
β Cβ
C
j:Auv →j
X
1
P(Ej ) ≤ .
≤
4
j:B →j
U
Thus by the summation version of the Local Lemma, we conclude that with positive
probability, none of the bad events occur, so the coloring is proper and β-frugal. Note that if we used the symmetric version, then we need the
bound P(Ei )∆ ≤β p = 1/C,
∆
and the dependency set size of at least d ≥ (β + 1)∆ β ≥ (β + 1)∆( β ) . But for
∆ = β β , then
1
1 2
2
pd > β −β−1 (β + 1)β β β β −β > β β −β > 1
16
16
for β > 3. So the Symmetric Local Lemma does not work. The reason the other
version works is that most events have very small probability and only few have large
probability.
9
9.1
Lecture 9: More examples of the Local Lemma
The Lopsided Local Lemma
Let Di = {j : i → j ∈ E}. The proof of the local lemma would still go through if we
replace the condition that each Ai is mutually independent of {Aj : j 6∈ Di }, which
33
implies
^
P(Ai |
Aj ) = P(Ai ),
j6∈Di
by the weaker assumption that for each i,
^
P(Ai |
Aj ) ≤ P(Ai ).
(6)
j6∈Di
Indeed, in the proof, we had Di = S1 , {j 6∈ Di } = S2 . The inequality in line (5)
would still hold when we have
P(Ai |AS2 ) ≤ P(Ai ).
This generalization is useful when we do not have mutual independence between
events.
Definition 46 (Negative dependency digraph) The negative dependency digraph
(V, E) of events A1 , . . . , An is a directed graph on n nodes, where for i = 1, . . . , n, Di
the children of node i, then (6) holds.
A dependency digraph is a negative dependency digraph, but the converse is not
true. Thus, the Local Lemma for negative dependency digraphs generalizes the Local
Lemma. This is also called the Lopsided Local Lemma. It first appeared in the paper
of Erd¨os and Spencer titled ‘Lopsided Lov´asz Local Lemma and Latin transversals’.
Lemma 47 (The Lopsided Local Lemma) For events A1 , . . . An with negative dependency digraph (V, E), with the same hypothesis as the Local Lemma, the same
conclusion holds.
Proof: Same as the proof of the Local Lemma, with the equality in (??) replaced
by an inequality. What does (6) mean? In words, this says that event Ai is less likely to occur if its
non-neighbors do not occur. One can rewrite (6) in the form of a correlation inequality
_
_
P(Ai )P(
Aj ) ≤ P(Ai ∧
Aj ),
j6∈Di
j6∈Di
W
that is, it says that the events Ai and j6∈Di Aj are positively correlated (or more
precisely, have non-negative correlation).
34
9.1.1
Example: counting derangements
A derangement is a permutation π with no fixed points, that is, there is no i ∈ [n]
such that π(i) = i. Let Dn be the number of derangements of n. While there are
exact counts for Dn , we can use the Lopsided Local Lemma to get a lower bound
on its asymptotic. While the result is weak, this example shows how the Lopsided
version can succeed while the Local Lemma fails.
Lemma 48
lim
n→∞
1
Dn
≥ .
n!
e
Proof: Let π be a permutation chosen uniformly at random. Let
V Ai denote the
event that i is a fixed point of π. A derangement is thus the event ni=1 Ai .
Unfortunately the Local Lemma fails here, since no pair of events Ai , Aj are independent
1
1
(n − 2)!
=
6= P(Ai )P(Aj ) = 2 .
P(Ai ∧ Aj ) =
n!
n(n − 1)
n
Thus there is no mutual independence between events. On the other hand, we claim
that the graph with n vertices and no edges is a negative dependency graph for this
case. That is, for all subsets Sk of k elements in n, and for all i ∈ [n], i 6∈ Sk ,
P(Ai |ASk ) ≤ P(Ai ).
One can establish this by counting. (Homework. Hint: start with the correlation
inequality form). The intuitive idea is simple: if k other elements are not fixed, then
one of them has a positive probability of being mapped to i, and in that event, i
cannot be a fixed point. Thus the conditional probability is strictly smaller.
Apply the Lopsided Local Lemma with xi =
1
n
(the smallest possible values), we have
n
^
n
Y
1
1
P( Ai ) ≥
(1 − ) = (1 − )n → e−1
n
n
i=1
i=1
as n → ∞. In comparison, the correct value for finite n is
n
^
n
Dn X (−1)k
n! 1
P( Ai ) =
=
= b + c.
n!
k!
e
2
i=1
k=0
35
9.2
Lower bound for Ramsey number
In our examples so far we have used ‘ready-made’ version of the Local Lemma - that
is, in particular, we did not have to choose the xi ’s. In general one can formulate
this as an optimization problem, and choose the optimal xi ’s this way. (The special
cases of the Local Lemma come with choices such as xi = 2P(Ai ) for the summation
1
for the symmetric version).
version, or xi = d+1
9.2.1
Example: lower bound for R(k, 3)
Proposition 49 There exists a constant C ≈
1
27
such that
R(k, 3) > Ck / log2 k
2
This is not far from the best current bound, which is
R(k, 3) > C 0 k 2 / log k.
Proof: As before, color edges blue with probability p, red with probability 1 − p.
There are two types of bad events: let AT be the event of a triangle T being blue, BS
be the event of a clique S being red. Then
k
P(A ) = p3 , P(B ) = (1 − p)(2) .
T
S
By the mutual independence principle, each event is mutually independent of all but
events that share an edge with it.
So an event AT is adjacent to 3(n−3) < 3n other A events, and at most nk B events.
An event BS is adjacent to at most k2 (n − 2) < k 2 n/2 A events, and nk other B
events.
Since the events A and B are symmetric (in the sense that there exists a relabeling of
the graph that takes one A-event to another), we try to find xi ’s such that all events
A have the same xi = x, and all events B have the same xi = y. Thus, we need to
find p ∈ (0, 1), and real numbers x and y such that
n
p3 ≤ x(1 − x)3n (1 − y)(k ) ,
and
k
n
2
(1 − p)(2) ≤ y(1 − x)k n/2 (1 − y)(k ) .
If there exist such p, x and y, then R(k, 3) > n by the Local Lemma.
It turns out (after optimizing for the largest possible n) that the optimum is reached
with
n
−1/2
3/2
p = c1 n
, x = c3 /n , y = c4 /
,
k
which gives R(k, 3) > Ck 2 / log2 k. 36
10
Lecture 10: Correlation inequalities
The condition of negative dependency digraph can be rewritten as
_
_
P(Ai )P(
Aj ) ≤ P(Ai ∧
Aj ),
j6∈Di
j6∈Di
which states that two events are positively correlated. To apply the Lopsided Local
Lemma, one needs to establish this inequality. In the example on derangement, we
obtained this by counting. In general, there are general conditions which imply that
two events are correlated. Perhaps the most famous is the FKG inequality. In the next
two lectures we state and prove this inequality and its extension, the Four Functions
theorem, their applications in percolation, and the XYZ theorem.
10.1
Order inequalities
Theorem 50 (Chebychev’s order inequality) Let f, g : R → R be non-decreasing
functions. Let X be a random variable distributed according to probability measure µ.
Ef (X)Eg(X) ≤ E(f (X)g(X)).
(7)
Equality occurs when either f or g is a constant. The inequality in intuitive: if f (X)
and g(X) are increasing functions of a common variable X, then they are positively
correlated.
A case of particular interest is when µ is a discrete measure
Pn with support on finitely
many values x1 ≤ x2 ≤ . . . ≤ xn , with finite total mass i=1 µ(xi ). Then Chebychev’s
inequality (after normalization) states that
n
X
i=1
f (xi )µ(xi )
n
X
g(xi )µ(xi ) ≤
i=1
n
X
f (xi )g(xi )µ(xi )
i=1
n
X
µ(xi ).
(8)
i=1
The FKG inequality extends the above to the case where the underlying index set is
only partially ordered, as opposed to totally ordered. In particular, the index set is a
finite distributive lattice.
Definition 51 (Finite distributive lattice) A finite distributive lattice (L, <) consists of a finite set L, and a partial order <, for which the two functions ∧ (meet)
and ∨ (join) defined by
x ∧ y := max{z ∈ L : z ≤ x, z ≤ y}
x ∨ y := max{z ∈ L : z ≥ x, z ≥ y}
37
are well-defined and satisfy the distributive laws
x ∧ (y ∨ z) = (x ∧ y) ∨ (x ∧ z)
x ∨ (y ∧ z) = (x ∨ y) ∧ (x ∨ z).
Example 3 Let L = 2[n] be the power set (set of all subsets of [n]). Order the sets
by inclusion (ie: <:=⊂), define x ∧ y = x ∪ y, x ∨ y = x ∩ y. Then this is a finite
distributive lattice.
A finite distributive lattice can be represented as an undirected graph. A totally
ordered set, for example, is a line. Any finite distributive lattice is isomorphic to a
sublattice of (2[n] , ⊂).
Definition 52 Let (L, >) be a finite distributive lattice. Suppose µ : L → R≥0 , and
µ(x)µ(y) ≤ µ(x ∧ y)µ(x ∨ y)
for al x, y ∈ L. Then µ is called a log supermodular function.
Theorem 53 (The FKG inequality) If f, g : L → R are both increasing (or both
decreasing) functions, µ : L → R≥0 is log supermodular, then
!
!
!
!
X
X
X
X
f (x)µ(x)
g(x)µ(x) ≤
f (x)g(x)µ(x)
µ(x) .
(9)
x∈L
x∈L
x∈L
x∈L
Remark: we shall prove the FKG inequality as a corollary of the Four Functions
theorem. We shall prove this theorem by induction. Thus, the FKG inequality also
holds for a countably infinite distributive lattice.
10.2
Example application
In some applications, the sample space Ω comes with a natural partial order. For
example, in models of random graph, for two graphs G, H ∈ Ω, one can define G ≤ H
if every edge in G is present in H. Similarly, in bond percolation, for realizations
ω1 , ω2 , we define ω1 ≤ ω2 if any edge open in ω1 must be open in ω2 .
Consider a probability space (Ω, F, P) where Ω has partial order <. Say that a random
variable N on (Ω, F, P) is increasing if N (ω) ≤ N (ω 0 ) whenever ω ≤ ω 0 . Say that an
event A is increasing if its indicator function is increasing.
Suppose Ω is countable. (Such is the case for Erd¨os-Renyi random graph, or bond
percolation on a graph with countable many edges). The FKG inequality applied to
38
countable distributive lattices states that if X and Y are increasing random variables,
E(X 2 ) < ∞, E(Y 2 ) < ∞, then
E(X)E(Y ) ≤ E(XY ).
In particular, if A and B are increasing events, then
P(A)P(B) ≤ P(A ∩ B).
That is, increasing events are positively correlated.
A typical use of this property is as follows. Consider bond percolation on a graph
G with countably many edges. Let p be the probability of an edge being open. Let
Π1 , . . . , Πk be families of paths in G, Ai be the event that some path in Πi is open.
The Ai ’s are clearly increasing events, so
Pp (
k
\
Ai ) ≥ Pp (A1 )Pp (
i=1
Reiterate this to obtain
Pp (
k
\
Ai ).
i=2
k
\
Ai ) ≥
k
Y
i=1
Pp (Ai ).
i=1
Note the resemblances to the conclusion of the Local Lemma. (This is indeed no
surprise, because the condition for the more general Lopsided Local Lemma is a
correlation inequality).
Here is a concrete application. Let G = (V, E) be an infinite connected graph with
countably many edges. Consider bond percolation on G. For a vertex x, write θ(p, x)
for the probability that x lies in an infinite open cluster. For fixed x, θ(p, x) is an
increasing function in p. Define
pc (x) = sup{p : θ(p, x) = 0.}
If the graph is symmetric, then one may argue that pc (x) = pc (y) for all sites x, y ∈ V ,
and thus is independent of the choice of x. For general graph this is not intuitively
clear.
Theorem 54 The value pc (x) is independent of the choice of x.
Proof: Let x, y ∈ V . Let {x ↔ y} be the event that there is an open path from x
and y, and let {y ↔ ∞} be the event that y lies in an infinite open cluster. These
are both increasing events. So by the FKG inequality,
θ(p, x) ≥ Pp ({x ↔ y} ∩ {y ↔ ∞}) ≥ P(x ↔ y)θ(p, y).
Thus pc (x) ≤ pc (y). The same argument with x and y interchanged gives pc (y) ≤
pc (x). Thus pc (x) = pc (y). We will prove the FKG inequality as a special case of the Four Functions theorem.
39
10.3
The Four Functions theorem
Let (L, <) be a finite distributive lattice. For X, Y ⊂ L, define
X ∧ Y = {x ∧ y : x ∈ X, y ∈ Y }
X ∨ Y = {x ∨ y : x ∈ X, y ∈ Y }.
For a function φ : L → R≥0 , X ⊂ L, define
X
φ(X) =
φ(x).
x∈X
Theorem 55 (The Four Functions theorem) Let L be a finite distributive lattice. Consider α, β, γ, δ : L → R≥0 . If for every x, y ∈ L,
α(x)β(y) ≤ γ(x ∧ y)δ(x ∨ y),
(10)
α(X)β(Y ) ≤ γ(X ∧ Y )δ(X ∨ Y ).
(11)
then for every X, Y ⊂ L,
First we show how this theorem implies the FKG inequality.
Proof:[Proof of the FKG inequality] For x ∈ L, define
α(x) = µ(x)f (x),
γ(x) = µ(x)f (x)g(x),
β(x) = µ(x)g(x)
δ(x) = µ(x).
We claim that these four function satisfy the hypothesis of the Four Functions theorem. Indeed,
α(x)β(y) = f (x)g(y)µ(x)µ(y)
≤ f (x)g(y)µ(x ∧ y)µ(x ∨ y)
≤ f (x ∧ y)g(x ∧ y)µ(x ∧ y)µ(x ∨ y)
= γ(x ∧ y)δ(x ∨ y).
as µ is log-supermodular
as f and g are increasing
The Four Functions theorem applies, which gives the implication of the FKG inequality. 11
11.1
Lecture 11: Proof of the Four Functions Theorem and Applications of the FKG inequality
Proof of the Four Functions theorem
(Notes forthcoming)
40
11.2
Some applications of the Four Functions theorem
The FKG inequality (and more generally, the Four Functions theorem) is an important
and elegant tool to prove correlation between events. Without it, proof of ‘intuitive’
statements may involve very complicated counting. The last problem of Homework 4
is an example. We now consider more examples.
11.2.1
Increasing events
As mentioned in the previous lecture, the FKG inequality can be used to show that
increasing events are positively correlated. We state some special cases here, since we
will need them in the next lecture to prove Janson’s inequality.
Consider an n-element set [n]. Choose the i-th element of this set independently with
probability pi ∈ [0, 1]. Let Pp (A) be the probability of a set A ⊂ [n] be chosen. For a
family A ⊂ 2[n] , let Pp (A) be the probability that some element A ∈ A is chosen.
A family of subsets A ⊆ 2[n] is called monotone
increasing
if A ∈ A, A ⊆ A0 implies
Q
Q
A0 ∈ A. Define µ : 2[n] → R≥0 by µ(A) = i∈A pi j6∈A (1 − pj ) = Pp (A). One can
check that µ is log-supermodular. With this µ and indicator functions f and g, (that
is, f (A) = 0 if A 6∈ A, and f (A) = 1 otherwise, similarly for g and B), the FKG
inequality translates to the following.
Theorem 56 Let A, B ⊆ 2[n] be two monotone increasing family of subsets of [n].
Then for any p = (p1 , . . . , pn ) ∈ [0, 1]n ,
Pp (A ∧ B) ≥ Pp (A)Pp (B).
Here is a simple but non-trivial illustration of the above theorem.
Corollary 57 Suppose A1 , . . . , Ak are arbitrary subsets of [n]. Choose a random
subset A of [n] according to p as above. Then
Pp (A intersects each Ai ) ≥
k
Y
Pp (A intersects Ai ).
i=1
Proof:[Exercise] Let Ai = {C : C ∩ Ai 6= ∅}. Then each Ai is a monotone increasing
family. Write
^
Pp (A intersects each Ai ) = Pp ( Ai ),
i
and recursively apply the previous theorem. 41
Corollary 58 Consider the Erd¨os-Renyi random graph G = G(n, p). Then
P(G is planar and Hamiltonian) ≥ P(G is planar)P(G is Hamiltonian).
Proof:[Exercise] Being planar and Hamiltonian are increasing events. 11.2.2
Marcia-Sch¨
onheim inequality and number theory
For two families of sets A and B, define
A\B = {A\B : A ∈ A, B ∈ B}.
As usual, let |A\B| denote the number of distinct elements of A\B.
Theorem 59 (The MS inequality) For all A ⊆ 2[n] , |A| ≤ |A\A|.
This is a special case of the Four Functions theorem.
Proof: Consider the set lattice 2[n] . Choose α = β = γ = δ = 1, so α(T ) = |T | for
T ⊆ [n]. Then the Four Functions theorem state that for all A, B ⊆ 2[n] ,
|A||B| ≤ |A ∧ B||A ∨ B|.
Let B = {b ⊆ [n] : b 6∈ B}. Then
|A||B| = |A||B|
≤ |A ∨ B||A ∧ B|
= |A ∨ B||A ∧ B|
= |A ∧ B||A ∧ B|
= |B\A||A\B|.
Now choose B = A, and we get the conclusion of the MS inequality. The MS inequality, discovered in 1969, is a non-trivial statement. It arose in connection with the following result in number theory. Note that an integer is squarefree if
it is not divisible by any perfect square.
Proposition 60 If 0 < a1 < a2 < . . . < an are all squarefree integers, then
max
i,j
ai
≥ n.
gcd(ai , aj )
42
Proof: Let pk be the k-th prime. Since each ai is squarefree, there is a finite set
Si ⊂ N such that
Y
ai =
pk .
k∈Si
Therefore,
Y
ai
=
pk .
gcd(ai , aj )
k∈Si \Sj
Let A = {S1 , . . . , Sn }. By the MS inequality,
|A\A| = |{Si \Sj }| ≥ |A| = n.
That is, there are at least n different sets of the form Si \Sj , and hence there must be
at least n different integers of the form gcd(aaii ,aj ) . Thus, the largest value must be at
least as large as n. This is the inequality desired. 11.2.3
The XYZ Theorem
In the early days of the FKG, an important application is the analysis of partially
ordered sets and sorting algorithms. Many sorting algorithms for sorting numbers
{a1 , . . . , an } perform binary comparisons (ai , aj ) to successively construct partial orders P until one gets a linear ordering. Thus, a fundamental quantity is P(ai > aj |P ),
that is, the probability ai > aj given the current partial order P , and assuming that
all linear extensions of P are equally likely.
The XYZ conjecture (1980, Rival and Sands) states that for any partially ordered set
P , and any three elements x, y, z ∈ P ,
P(x ≤ y ∨ x ≤ z) ≥ P(x ≤ y)P(x ≤ z).
This seems intuitive: if x ≤ y, then x is small, so it is even more likely to be smaller
than z. Thus the events {x ≤ y} and {x ≤ z} are likely to be positively correlated.
This type of reasoning may be misleading. For example, consider the statement
P(x1 < x2 < x4 ∧ x1 < x3 < x4 ) ≥ P(x1 < x2 < x4 )P(x1 < x3 < x4 ).
By the same reasoning, one expects x1 to be small and x4 to be large, and thus the
statement seems true. But in fact it is false. Here is a counterexample by Mallows:
for n = 6, let P = {x2 < x5 < x6 < x3 , x1 < x4 }. By computations, one find that
P(x1 < x2 < x4 ) =
but
4
,
15
1
P(x1 < x2 < x4 |x1 < x3 < x4 ) = .
4
43
So
P(x1 < x2 < x4 ) > P(x1 < x2 < x4 |x1 < x3 < x4 ),
and rearranging gives
P(x1 < x2 < x4 ∧ x1 < x3 < x4 ) < P(x1 < x2 < x4 )P(x1 < x3 < x4 ).
The XYZ conjecture was first proved by Graham, Yao and Yao (1980) using a complicated combinatorial argument, which used the MS inequality. The second proof by
Shepp (1982) uses the FKG inequality. We now present this proof.
Theorem 61 (The XYZ theorem) Let P be a partially ordered set with n elements a1 , . . . , an . Consider the probability space of all linear extensions of P , each
extension equally like. Then
P(a1 ≤ a2 ∧ a1 ≤ a3 ) ≥ P(a1 ≤ a2 )P(a1 ≤ a3 ).
Proof: Fix a large integer m. Let L be a set with elements of the form
x = (x1 , . . . , xn ),
where xi ∈ [m]. Define an order relation ≤ on L as follows: for x, y ∈ L, say x ≤ y if
x1 ≥ y1 and xi − x1 ≤ yi − y1 for all 2 ≤ i ≤ n.
One can show that this makes (L, ≤) a lattice with
(x ∨ y)i = max(xi − x1 , yi − y1 ) + min(x1 , y1 ), for all i ∈ [n],
(x ∧ y)i = min(xi − x1 , yi − y1 ) + max(x1 , y1 ), for all i ∈ [n].
One also needs to show that L is a distributive lattice, that is,
x ∨ (y ∧ z) = (x ∨ y) ∧ (x ∨ z).
This follows from the following identities for three integers a, b, c
min(a, max(b, c)) = max(min(a, b), min(a, c))
max(a, min(b, c)) = min(max(a, b), max(a, c)).
Let us show that this implies L is distributive. The i-th component of x ∨ (y ∧ z) is
min(xi − x1 , max(yi − y1 , zi − z1 )) + max(x1 , min(y1 , z1 )).
The i-th component of (x ∨ y) ∧ (x ∨ z is
max(min(xi − x1 , yi − y1 ), min(xi − x1 , zi − z1 )) + min(max(x1 , y1 ), max(x1 , z1 )).
44
Apply the previous two identities, we see that these two quantities are equal. So
indeed L is a finite distributive lattice.
Now we connect L with the partial order P through the functions µ, f, g. Define µ
by µ(x) = 1 if xi ≤ xj whenever ai ≤ aj in P , and 0 otherwise. Note that µ puts
positive mass on tuples x that satisfies the inequalities in P only. To show that µ
is log-supermodular, it suffices to check that if µ(x) = µ(y) = 1, then µ(x ∨ y) =
µ(x ∧ y) = 1.
Suppose µ(x) = µ(y) = 1. Let ai ≤ aj ∈ P . Then xi ≤ xj , yi ≤ yj , so
(x∧y)i = max(xi −x1 , yi −y1 )+min(x1 , y1 ) ≤ max(xj −x1 , yj −y1 )+min(x1 , y1 ) = (x∧y)j .
So µ(x ∧ y) = 1. By a similar proof, µ(x ∨ y) = 1.
Define f (x) = 1 if x1 ≤ x2 and f (x) = 0 otherwise. Define g(x) = 1 if x1 ≤ x3 and
g(x) = 0 otherwise. These are trivially increasing functions by our definition of order
on L.
The FKG inequality then states
P(x1 ≤ x2 ∧ x1 ≤ x3 ∧ P in L)P(P in L) ≥ P(x1 ≤ x2 ∧ P in L)P(x1 ≤ x3 ∧ P in L),
which is almost what we want, except that L is not necessary a tuple of n distinct
integers, and hence may not be a linear extension of P . However, as m → ∞, the
fraction of x such that xi = xj for i 6= j in L tends to 0. Then for any linear extension
of P identified via the tuple x of distinct entries, the desired inequality holds. This
proves the theorem. 45
12
12.1
Lecture 12: Janson’s inequality
Motivations and setup
Let Ω be a finite state space, {Ai : i ∈ I} be events in Ω. Choose a random subset R
of Ω by including each element r ∈ Ω with probability pr independently. Define the
event that all elements in Ai was picked, that is,
Bi = {Ai ⊆ R}.
Let Xi be the indicator for Vthe event Bi . We want upper and lower bounds on
P(X = 0), or equivalently, P( i∈I Bi ). This is a common setup in many combinatorial
problems, many of which we have seen. (Recall: mutual independence principle, Local
Lemma examples, random graphs, Chebychev’s inequality examples, etc).
Consider the lower bound. Suppose the Ai ’s are all disjoint. Then Bi are independent
events, so
Y
^
P(Bi ).
P( Bi ) =
i∈I
i∈I
Q
Define M = i∈I P(Bi ). Now suppose some Ai ’s overlap. Recall Corollary 57 in the
last lecture, which shows, using the FKG inequality, that the events {R ∩ Ai 6= ∅} are
positively correlated. By the same argument, one can show that Bi are also positively
correlated, and thus
^
M ≤ P( Bi ).
i∈I
Now consider the upper bound. If Ai ∩ Aj = ∅, then Xi and Xj are independent.
Define i ∼ j if Ai ∩ Aj 6= ∅,
X
∆=
P(Bi ∧ Bj ),
i∼j
and
µ = E(X) =
X
P(Bi ).
i∈I
Then, by Chebychev’s inequality,
^
Var(X)
µ+∆
P( Bi ) = P(X = 0) ≤
≤
.
2
2
(EX)
µ
i∈I
(12)
Janson’s inequality is an improvement on this upperbound in this setup, when P(Bi )
are all small.
Theorem 62 (Janson’s inequality) Let {Bi }, ∆, M , µ be defined as above. Then
^
P( Bi ) ≤ e−µ+∆/2 .
(13)
i∈I
46
If in addition, P(Bi ) ≤ , then
M ≤ P(
^
1
Bi ) ≤ M e 1− ∆/2 .
(14)
i∈I
The two upperbounds are quite similar in practice. The second form is more often
used for convenience. Note that for each i,
P(Bi ) = 1 − P(Bi ) ≤ e−P(Bi ) ,
so
M ≤ e−µ .
So if ∆ is small, then the upperbound is very close to the lower bound. (This makes
sense: small ∆ means the covariance of X is small, that is, the Xi ’s are close in
being independent). In many G(n, p) applications, = o(1), ∆ = o(1), and µ → k as
n → ∞. Both bounds conclude that
P(X = 0) → e−k .
d
In particular, one can often show (by other methods) that X → P oisson(k). The
intuition is that X is a sum of almost independent rare events.
This is no longer the case for large ∆ (ie: far away from independence). If ∆ ≥ 2µ,
for example, then the second bound is useless. For slightly less ∆, the following
inequality offers an improvement.
Theorem 63 (The extended Janson’s inequality) Under the assumptions of the
previous theorem, suppose further that ∆ ≥ µ. Then
^
2
P( Bi ) ≤ e−µ /2∆ .
(15)
i∈I
Compare (12) and (15). Suppose µ → ∞ as |Ω| → ∞ (say, in applications like
random graphs, where we send n, the number of nodes in the graph, to infinity).
Suppose µ ∆, and γ = µ2 /∆ → ∞. Then Chebychev’s upperbound scales as
γ −1 , while Janson’s inequality scales as e−γ , a vast improvement. We will revisit this
point next week, when we view Janson and Chebychev inequalities as tail bounds for
random variables, and in particular, sums of indicators.
12.2
Proofs and generalizations
Janson’s proof of Janson’s inequality bounds the Laplace transform E(e−tX ) of X,
a technique common in proving tail bounds. (If time permits, we will consider this
proof next week). Here we follow the proof of Boppana and Spencer, which uses the
FKG inequality and resembles proof the Local Lemma.
First we need some inequalities
47
Lemma 64 For arbitrary events A, B, C,
P(A|B ∧ C) ≥ P(A ∧ B|C).
Proof: We have
P(A ∧ B ∧ C) = P(A|B ∧ C)P(B ∧ C) = P(A ∧ B|C)P(C).
(16)
Since P(B ∧ C) ≤ P(C), the inequality follows by rearranging. Proof:[Proof of Janson’s inequality] As mentioned, the lower bound follows from
Theorem 56 (Exercise). We now prove the two upperbounds. The initial steps are
the same. Let |I| = m. By ‘peeling’ the intersection from the back, we get
P(
^
Bi ) =
m
Y
P(Bi |
i=1
i∈I
^
Bj ).
(17)
1≤j<i
We shall upperbound each term in this product, or equivalently, lower bound
^
P(Bi |
Bj ).
1≤j<i
By renumbering the events, let d1 , . . . , dm ∈ [m] be integers such that for 1 ≤ j < i,
i ∼ j for all 1 ≤ j ≤ di , and not for di + 1 ≤ j < i. If di = i − 1, then the later set is
empty. Note that this does not say anything about dependence between i and j for
j > i.
V
Vi
Fix an i. Use (16) for A = Bi , B = dj=1
Bj , C = i−1
j=di +1 Bj . Suppose di < i − 1, so
that C is an event with positive probability (we deal with the other case later). Then
^
P(Bi |
Bj ) = P(A|B ∧ C) ≥ P(A ∧ B|C) = P(A|C)P(B|A ∧ C).
(18)
1≤j<i
By the mutual independence principle, we have P(A|C) = P(A). For the remaining
term, we bound


di
_
P(B|A ∧ C) = P  Bj |Bi ∧ C 
j=1
≥1−
di
X
P(Bj |Bi ∧ C) by union bound
j=1
≥1−
di
X
P(Bj |Bi )
j=1
48
by FKG.
If C is an event with zero probability, then P(B) > 0, and the analogue of (18 is
P(A|B) = P(A)P(B|A)/P(B) ≥ P(A)P(B|A).
By the union bound as argued above,
P(B|A) ≥ 1 −
di
X
P(Bj |Bi ).
j=1
(Similarly, if B is an event with zero probability, then one can check that the proof
still goes through).
So far we have
P(Bi |
^
Bj ) ≥ P(Bi ) −
di
X
P(Bj |Bi )P(Bi ) = P(Bi ) −
j=1
1≤j<i
di
X
P(Bj ∧ Bi ).
j=1
Take 1 and subtract the above, we get
^
P(Bi |
Bj ) ≤ P(Bi ) +
di
X
P(Bj ∧ Bi ).
j=1
1≤j<i
For the second upperbound, use 1 − x ≤ e−x to get
P(Bi |
^
Bj ) ≤ 1 − P(Bi ) +
di
X
P(Bj ∧ Bi )
j=1
1≤j<i
≤ exp −P(Bi ) +
di
X
!
P(Bj ∧ Bi ) .
j=1
Now take product over i, we get
P(
m
^
i=1
Bi ) ≤ exp −
X
i
P(Bi ) +
di
XX
i
!
P(Bj ∧ Bi )
= exp (−µ + ∆/2) .
j=1
For the first upperbound, we also use 1 + x ≤ ex but at another location.
!
di
^
1 X
P(Bi |
Bj ) ≤ P(Bi ) 1 +
P(Bj ∧ Bi ) since P(Bi ) ≥ 1 − 1 − j=1
1≤j<i
!
di
X
1
≤ P(Bi ) exp
P(Bj ∧ Bi ) .
1 − j=1
49
Again, take product over i on both sides, we get
becomes ∆/2. Q
i
P(Bi ) = 1, and the inner sum
When we re-examine the proof, in fact we only need to use the following two properties
about the Bj ’s (both from come the FKG inequality)
P(Bi |
^
Bj ) ≤ P(Bi ),
(19)
j∈J
valid for all index sets J ⊂ I, i 6= J, and
^
P(Bi |Bk ∧
Bj ) ≤ P(Bi |Bk ),
(20)
j∈J
valid for all index sets J ⊂ I, i, k 6∈ J. So if Bi are arbitrary events with dependency
digraph G that satisfies (19) and (20), the Janson’s inequality applies. (Admittedly,
I only the know of the examples where Janson’s inequality applies directly, so this
observation is not hugely useful).
12.2.1
Proof of the Extended Janson’s inequality
Proof:[Proof of the Extended Janson’s inequality, Exercise] The Extended Janson’s
inequality has a probabilistic(!) proof. Here is the idea. For any subset S ⊂ I,
^
^
P( Bi ) ≤ P( Bi ).
i∈S
i∈I
Thus we can choose a random S, stsart with (13) applied to S, and optimize S to get
a new upperbound over the entire collection I. (Without the hindsight, it does seem
strange that this argument even works).
Substitute in the definitions of µ and ∆ and take log, we get
!
X
^
1X
− log P( Bi ) ≥
P(Bi ) −
P(Bi ∧ Bj ).
2 i∼j
i∈I
i∈I
This inequality holds for all subset S ⊂ I. Let S be a random subset given by
P(i ∈ S) = p,
where we will optimize for p later. So each term P(Bi ) appears with probability p,
each term P(Bi ∧ Bj ) appears with probability p2 . So
!
^
E(− log P( Bi ) ) ≥ pµ − p2 ∆/2.
i∈S
50
This probabilty is maximized at p = µ/∆. By the assumption of the theorem, p ≤ 1.
Thus we get
!
^
µ2
E(− log P( Bi ) ) ≥
.
2∆
i∈S
So there must be a subset S ⊂ I for which
!
− log P(
^
Bi )
≥
i∈S
So
P(
^
Bi ) ≤ exp(−
i∈S
So this completes the proof. 51
µ2
.
2∆
µ2
).
2∆
13
13.1
Lecture 13: Brun’s sieve, and some examples
Example: Threshold of Golomb rulers
Recall that we used first and second moment methods to prove threshold for random
graphs. This really boils down to upper and lower bounding P(X = 0) for some
counting random variable X. Thus, we can also use Janson’s inequality to prove
thresholds. Here is an example.
A set S ⊂ N is called a Golomb ruler if all pairwise differences ai − aj , i 6= j, are
different. If we think of S as marks on a line, then this condition states that no
two pairs of points are of the same distance from each other. Initial motivations
for Golomb ruler is to select Fourier frequencies (for radio applications) to minimize
interference.
Suppose we choose a random S ⊂ [n] by picking elements of [n] independently with
probability p. Heuristics suggest that smaller p implies S more likely to be a Golomb
ruler. How large can p be so that S is still very likely to be a Golomb ruler?
Proposition 65 Assume p = o(n−2/3 ). If p n−3/4 (that is, pn3/4 → ∞), then
P(S is Sidon) → 0. If p n−3/4 (that is, pn3/4 → 0), then P(S is Sidon) → 1.
Proof: For a given quadruple x = (a, b, c, d) of elements of [n] with a − b = c − d, let
Ax be the event that x ⊂ S. Let N be the number of such quadruples. For large n,
one can show that N ∼ αn3 for some constant α. (The idea is once we picked three
integers, then the last in the quadruple is determined. One needs to check that this
last element is in [n], and is not one of the integers already picked. But the number
of invalid last element is very small, thus N scales as αn3 ).
P
Let X = x Ax be the number of bad quadruples. So, with µ = p4 N , by Janson’s
inequality, we have
−µ ≤ log(P(X = 0)) ≤ −µ + ∆/2.
Consider ∆. Two quadruples x ∼ y share between 1 and 6 integers. If they share j
integers, then P(Ax ∩ Ay ) = p8−j . By the same heuristic used for counting N , the
number of pairs of quadruples which share j integers is ∼ n8−j−2 . Let i = 8 − j.
Thus, ∆ is a polynomial of the form
∆=
7
X
αi pi ni−2 ,
i=2
for some constants αi . By our assumption pn2/3 → 0, ∆ µ. So by Janson’s
inequality,
log P(X = 0) = log P(S is a Golomb ruler) ∼ αp4 n3 .
The conclusion follows. 52
13.2
Example: triangles in G(n, p)
Let n ≥ 3. Let X be the number of triangles in G(n, p).
Lemma 66 (Exercise) If p = cn−1 , then as n → ∞, the probability that the graph
3
does not contain any triangle is e−c /6 .
Proof: Let Ai be the event
that the i-th triple of vertices is a triangle. Then
P(Ai ) = p3 , µ = E(X) = n3 p3 . Two triangles i, j shares at most one edge. For such
triangles, P(Ai ∩ Aj ) = p5 . Thus
X
n 4 5
∆=
p.
P(Ai ∩ Aj ) =
4
2
i∼j
By Janson’s inequality,
Y
(1 − p3 ) ≤ P(X = 0) ≤ e−µ+∆/2 .
i
So if p = o(n−1/2 ), that is, pn1/2 → 0, then
log P(X = 0) ∼ −µ.
Plug in p = cn−1 and simplify, we have the desired statement. 13.3
Brun’s sieve
Poisson approximation was the original motivation for Janson’s inequality. In general,
there are two methods to prove that a sequence of random variables Xn converge
to Poisson: convergence characteristic functions (equivalently, moment generating
function), and the Chen-Stein’s method.
Since the moment generating function of a random variable X is
E(etX ) =
∞ r
X
t E(X r )
r=0
r!
,
one would hope that if one can show
E(Xnr ) → r-th moment of P oisson(µ),
d
then this implies Xn → P oisson(µ). Brun’s sieve
uses this idea, with the small twist
Xn
that one should consider the limit of E( r ) instead, since the r-th moment of a
r
Poisson random variable is ugly, but E( P oi(µ)
) = µr! . The use of Brun’s sieve to
r
prove Poisson convergence is also called the moment method.
53
Theorem 67 Suppose {Xn } is a sequence of nonnegative integer-valued random variables, each is a sum of indicators. Suppose that there exists a constant µ such that
Er (Xn ) := E(Xn (Xn − 1)(Xn − 2) · · · (Xn − r + 1)) → µr .
Then for every t, as n → ∞,
P(Xn = t) →
µt −µ
e .
t!
There are several ways to prove this statement. One is by inclusion-exclusion (hence
the name sieve). Another is by moment generating function, and invoke general
theorems in probability such as the L´evy Convergence Theorem, see, for example,
Kallenberg[Chapter 4]. We skip the proof and consider an application.
13.3.1
Example: EPIT
In G(n, p), let EPIT stand for the property that every vertex is in a triangle.
Theorem 68 For fixed c > 0, p = p(n), µ = µ(n), assume that
n−1 3
p = µ,
2
c
e−µ = .
n
Let X = X(n) be the number of vertices of G(n, p) not lying in a triangle. Then
lim P(G(n, p) is EPIT) = e−c .
n→∞
Proof: Our goal is to apply Brun’s P
sieve. Let Xx be the indicator of the event that
x is not
in any triangle. Then X = x Xx . We need to find the limit of E(X) and
E( Xr ). We will do this by Janson’s inequality.
Fix a vertex x ∈ V . For each y, z ∈ V , let Bxyz be the event that the triangle on
{x, y, z} is present. Let Cx be the event x is not in any triangle, that is,
^
Cx =
Bxyz .
y,z
Then Xx is the indicator of Cx .
Let us use Janson to compute E(Xx ). Here we have events Bxyz over y, z ∈ V . By
definition of µ, we have
X
P(Bxyz ) = µ.
y,z
54
Let us compute ∆ of these events. xyz ∼ xuv if and only if {y, z} ∩ {u, v} =
6 ∅. So
X
∆=
P(Bxyz ∧ Bxyz0 ) = O(n3 p5 ) = o(1)
y,z,z 0
since p = n−2/3+o(1) . So by Janson’s inequality
E(Xx ) ∼ e−µ =
c
.
n
Thus
E(X) =
X
E(Xx ) = c.
x
Fix r. Then
X
X
E(
)=
P(Cx1 ∧ · · · ∧ Cxr ),
r
where the sum is over all sets of vertices {x1 , . . . , xr }. These events are symmetric,
so
r
X
n
n
E(
)=
P(Cx1 ∧ · · · ∧ Cxr )) ∼
P(C1 ∧ · · · ∧ Cr ).
r
r
r!
But
P(C1 ∧ · · · ∧ Cr ) =
^
Biyz ,
y,z
where 1 ≤ i ≤ r, and over all y and z. Again,
apply Janson’s inequality to this set
of events. The number of {i, y, z} is r n−1
−
O(n),
the overcount comes from those
2
triangles where either y or z are equal to i. So
X
n−1
3
P(Biyz ) = p (r
− O(n)) = rµ + O(n−1+o(1) ).
2
y,z
As before, ∆ is p5 times the number of pairs iyz ∼ jyz 0 . There are O(rn3 ) = O(n3 )
terms with i = j and O(r2 n2 ) = O(n2 ) terms with i 6= j, so ∆ = o(1). Therefore,
P(C1 ∧ · · · ∧ Cr ) ∼ e−rµ ,
and E(
X
r
)∼
cr
r!
as needed. 55