MATS423 OPTIMAL MASS TRANSPORTATION FALL 2014 Foreword

MATS423
OPTIMAL MASS TRANSPORTATION
FALL 2014
Foreword
These are the lecture notes for the course Optimal Mass Transportation given at the University
of Jyv¨askyl¨
a in the Fall of 2014. The course aims at providing the basics of optimal mass
transportation for students who are familiar with basic abstract measure theory.
– —— –
In the course we study the Monge and Kantorovich formulations of optimal mass transportation, existence and uniqueness of optimal transport maps, Wasserstein distance, brief
introduction to functionals and gradient flows in Wasserstein spaces, Ricci curvature lower
bounds in metric spaces using optimal mass transportation.
– —— –
The lecture notes can by found (with a possible delay) from the course website
http://users.jyu.fi/~tamaraja/MATS423/
Version: November 6, 2014.
1
2
OPTIMAL MASS TRANSPORTATION
0. Introduction
The study of optimal mass transportation has a long history, dating back to Gaspard Monge
and his 1781 publication M´emoire sur la th´eorie des d´eblais et des remblais. The problem he
addresses is the following: suppose you have certain amount of soil taken from the ground,
at different locations, that you want to transport to construction sites.
Because transporting the soil takes a lot of resources, one wants to determine where it is
optimal to send which part of the extracted soil. When faced with the problem, one wonders
what should be the cost of transporting the soil. Monge considered the cost to be the distance
times the mass of the soil transported.
Many years later Leonid Kantorovich tackled similar problems that this time arose in
various areas of economics. In 1975 Kantorovich was awarded the Nobel Prize in economics,
together with Tjalling Koopmans, for their contributions to the theory of optimum allocation
of resources. Kantorovich also introduced a distance between measures coming from the
optimal transport problem. This distance, which we shall study in Section 1 has many
names. It is called the Kantorovich-Rubinstein distance, the Wasserstein distance, the Lp transportation distance, Prokhorov distance, and so on.
The problem of finding the optimal way to transport the mass (soil, goods, . . . ) is nowadays
called the Monge-Kantorovich problem. In the first part of the course we will formulate
the Monge-Kantorovich problem in Rn , introduce a useful dual formulation of the problem,
and study the existence and uniqueness of the solution to the Monge-Kantorovich problem.
After this we will study optimal mass transportation in the more general setting of metric
spaces, define there the Kantorovich-Rubinstein distance, and as time permits, study a bit
gradient flows and Ricci curvature in metric spaces.
1. Optimal mass transportation in Rn
In optimal mass transportation the mass is usually understood as a Borel probability measure. The reason for assuming the measures to be probability measures is just a normalization.
However, the measures should have the same total mass in order to make our formulation of
the problem reasonable. Let us recall some definitions in measure theory.
Definition 1.1. The Borel σ-algebra B(Rn ) is the σ-algebra generated by the open sets of
n
Rn . In other words, it is the smallest set Σ ⊂ 2R with the properties
(1) Σ 6= ∅,
(2) A ∈ Σ ⇒ Rn \ AS∈ Σ, and
(3) (Ai )i∈N ⊂ Σ ⇒ i∈N Ai ∈ Σ.
Definition 1.2. A Borel probability measure µ is a function µ : B(Rn ) → [0, 1] with the
properties
(1) µ(∅) = 0,
(2) µ(Rn ) = 1, and
OPTIMAL MASS TRANSPORTATION
3
P
S
(3) µ i∈N Ai = i∈N µ(Ai ) for all pairwise disjoint collections {Ai } ⊂ B(Rn ).
We denote the space of all Borel probability measures on Rn as P(Rn ). For any closed
C ⊂ Rn we denote P(C) := {µ ∈ P(Rn ) : µ(K) = 1}.
Definition 1.3. A mapping f : Rn → Rm is Borel measurable, if f −1 (A) ∈ B(Rn ) for all
A ∈ B(Rm ).
Suppose T : Rn → Rm is Borel measurable and µ ∈ P(Rn ). Then the pushforward of µ
through T is the measure T♯ µ ∈ P(Rm ) defined as
T♯ µ(A) = µ(T −1 (A))
for all A ∈ B(Rm ).
Notice that by the Borel measurability of T , also the pushforward measure is a Borel measure.
Let us now list some basic examples of Borel probability measures that will help us understand different aspects of optimal transportation.
Examples 1.4. (i) Measures µ ∈ P(Rn ) that are absolutely continuous with respect to the
LebesgueR measure Ln . In other words, µ = f Ln with f : Rn → [0, ∞) Borel measurable and
kf k1 = Rn |f (x)| Ln (x) = 1. The notation ’µ = f Ln ’ means
Z
f (x) dLn (x)
for all A ∈ B(Rn ).
µ(A) =
(ii) Dirac measures δx ∈
A
P(Rn )
for x ∈ Rn defined by
(
1, if x ∈ A
δx (A) =
0, if x ∈
/ A,
P
P
and their combinations µ = i∈N ai δx with weights ai ∈ [0, 1] satisfying i∈N ai = 1.
(iii) Hausdorff measures Hs , defined as
)
(
[
X
and diam(Ai ) < δ for all i ∈ N ,
diam(Ai )s : A ⊂
Hs (A) = lim inf
δց0
i∈N
i∈N
weighted with a Borel measurable function f as in (i) such that f Hs ∈ P(Rn ). (In particular
f 6= 0 on a set of positive and finite Hs -measure.)
We will weak topology on the space P(Rn ). Later on we will see that the Lp transportation
distances metrize the weak topology. Let us recall the definition of weak convergence.
Definition 1.5. Let µk , µ ∈ P(Rn ). We say that µk converges weakly to µ if
Z
Z
ϕ dµ
ϕ dµk →
Rn
for all ϕ ∈ Cb
(Rn )
= {φ :
Rn
Rn
→ R continuous and bounded}.
Recall that the weak convergence of µk to µ is equivalent with requiring that
lim sup µk (C) ≤ µ(K)
for all closed set K ⊂ Rn
k→∞
as well as equivalent with requiring
lim inf µk (U ) ≥ µ(U )
k→∞
for all open set U ⊂ Rn .
4
OPTIMAL MASS TRANSPORTATION
1.1. Monge and Kantorovich formulations of the problem. Let c : Rn × Rn → R ∪
{+∞} be Borel measurable. We will call this function the cost function. In Monge’s original
work the cost function was c(x, y) = kx − yk. Now we are ready to formulate
Monge’s formulation of the optimal transport problem
Let µ, ν ∈ P(Rn ). Minimize
Z
c(x, T (x)) dµ(x)
T 7→
Rn
over all transport maps T from µ to ν, i.e. over all maps T such that
T♯ µ = ν.
It is easy to see that in this generality Monge’s formulation can be ill-posed. The simplest
way this can happen is if µ = δx and ν = 21 (δy + δz ) with y 6= z. Now for every function T
we have T♯ µ = δT (x) 6= ν. In other words, it might be that there are no transport maps to
minimize over.
The next example, which is a slight modification of the previous example on dirac measures,
shows that even when there exist transport maps, the mimizer might not exist. This shows
that the condition T♯ µ = ν is not weakly sequentally closed (in any applicable weak topology).
Example 1.6. Define two measures on the plane as
1
H1 |{−1}×[0,1] + H1 |{1}×[0,1]
µ = H1 |{0}×[0,1] and ν =
2
and suppose that the cost function is c(x, y) = kx − yk. Now
(
(1, 2x),
if x ≤ 21
T1 (0, x) =
(−1, 2x − 1), if x > 21
transports µ to ν, so there at least exist maps transporting µ to ν.
Moreover, for the transport maps
(
k
k
,
if 2n
< x ≤ k+1
1, 2x − 2n
2n and k is odd
Tn (0, x) =
k
k+1
k+1
−1, 2x − 2n , if 2n < x ≤ 2n and k is even
we have
Z
kx, Tn (x)k dµ(x) → 1.
R2
T2
T1
...
T4
However, no transport map realizes the transport cost 1. Such mapping T should transport
a.e. horizontally, which is impossible.
OPTIMAL MASS TRANSPORTATION
5
The ill-posedness of the optimal transport problem was removed by Kantorovich by considering more general transports.
Kantorovich’s formulation of the optimal transport problem
Let µ, ν ∈ P(Rn ). Minimize
Z
c(x, y) dσ(x, y)
σ 7→
Rn ×Rn
over all transport plans σ from µ to ν, i.e. over all measures σ ∈ P(Rn ×Rn )
for which
σ(A × Rn ) = µ(A)
and σ(Rn × A) = ν(A)
for all A ∈ B(Rn )
We denote the set of transport plans from µ to ν as A(µ, ν). Any transport map T from µ
to ν naturally induces a tranport plan σ = (id × T )♯ µ. Therefore the set of transport plans
always include all transport maps. Moreover, the measure µ × ν ∈ A(µ, ν) so A(µ, ν) 6= ∅.
Example 1.7. Let us revisit Example 1.6. The minimizing sequence (Tn ) induce a sequence
of transport plans (σn ), σn = (id × T )♯ µ. Define
1
((id × R)♯ µ + (id × L)♯ µ)
2
with R(x, y) = (x, y + 1) and L(x, y) = (x, y − 1). Now σ is a minimizer. Notice also that not
only
Z
Z
σ=
kx − yk dσ(x, y),
kx − yk dσn (x, y) →
R2 ×R2
R2 ×R2
but also
for all ϕ ∈ Cb (R2 ×
Z
ϕ(x, y) dσn (x, y) →
R2 ×R2
2
R ). In other
Z
ϕ(x, y) dσ(x, y),
R2 ×R2
words, σn converges to σ weakly.
One might wonder how generally are Monge’s and Kantorovich’s formulations the same
in the sense that the infimums in the problems agree. It can be shown that they agree for
example when the starting measure µ has no atoms, i.e. µ(x) = 0 for all x ∈ Rn , and the cost
function c is continuous.
1.2. Existence of optimal transport plans. The proof of the existence of optimal transport plans, i.e. transport plans minimizing Kantorovich’s optimal transport problem follows
the basic scheme in variational problems. The ingredients are
R
(1) the lower semicontinuity of σ 7→ Rn ×Rn c(x, y) dσ(x, y) and
(2) the compactness of A(µ, ν)
Let us start with (1). It is clear that in general the lower semicontinuity cannot hold. In
order to obtain lower semicontinuity for the transport cost we assume lower semicontinuity
of the cost function c. Let us first recall what is meant by lower semicontinuity.
Definition 1.8. Let X be a topological space and f : X → R ∪ {−∞, ∞}. The function f is
lower semicontinuous at x0 ∈ X if for every ǫ > 0 there exists a neighbourhood U of x0 such
that f (x) ≥ f (x0 ) − ǫ for every x ∈ U . The function f is called lower semicontinuous if it is
lower semicontinuous at every point x0 ∈ X.
6
OPTIMAL MASS TRANSPORTATION
Lower semicontinuity means that the function cannot jump up at the limit when we converge towards a point. Notice that lower semicontinuous functions need not be continuous.
Below is an illustration of a graph of a lower semicontinuous function.
Lemma 1.9. Let (X, d) be a metric space and f : X → R ∪ {∞} a lower semicontinuous
function that is bounded from below by some constant C. Then f can be written as the
pointwise limit of a nondecreasing family (fn )n ∈ N of bounded continuous functions fn : X →
R.
Proof. We may assume that f is not identically +∞. Define
fn (x) = inf {f (y) + nd(x, y) : y ∈ X} .
Then we immediately have |fn (x) − fn (y)| ≤ nd(x, y) and fn (x) ≤ fm (x) ≤ f (x) for all
n ≤ m and x, y ∈ X. To see the pointwise convergence, fix x0 ∈ X and ǫ > 0. By the lower
semicontinuity of f there exists δ > 0 such that f (x) ≥ f (x0 ) − ǫ for all x ∈ B(x0 , δ). Let
n ∈ N be such that nδ ≥ f (x0 ) − C. Then for all y ∈
/ B(x0 , δ) we have
f (y) + nd(x0 , y) ≥ C + nd(x0 , y) ≥ C + nδ ≥ f (x0 )
and for y ∈ B(x0 , δ)
f (y) + nd(x0 , y) ≥ f (y) ≥ f (x0 ) − ǫ.
Finally, in order to make fn bounded, we can take min(fn , n).
With Lemma 1.9 we can prove lower semicontinuity for the transport cost under very mild
assumptions on the cost function c.
Lemma 1.10. Let c : Rn × Rn → [0, +∞] be a lower semicontinuous cost function. Suppose
that a sequence (σk )k ⊂ P(Rn × Rn ) converges weakly to some σ ∈ P(Rn × Rn ). Then
Z
Z
c dσk .
c dσ ≤ lim inf
Rn ×Rn
k→∞
Rn ×Rn
Proof. By Lemma 1.9 the function c can be written as the pointwise limit of a nondecreasing
family (cm )m∈N of continuous functions cm : Rn × Rn → R. By monotone convergence,
Z
Z
Z
Z
c dσ = lim
cm σ = lim lim
cm dσk ≤ lim inf c dσk ,
m→∞
m→∞ k→∞
where in the last inequality we just use the trivial estimate
Z
Z
cm dσk ≤ c dσk .
k→∞
OPTIMAL MASS TRANSPORTATION
7
Now that we have established the lower semicontinuity under suitable conditions, let us
turn to compactness. Let us start with a general theorem giving us precompactness.1 In the
special case of the Prokhorov’s theorem stated below, where the underlying space is Rn , the
precompactness of a collection of measures is proven to be the same as the intuitive condition
that no mass is leaking to infinity.
Theorem 1.11 (Prokhorov’s theorem). A set P ⊂ P(Rn ) is precompact in the weak topology
if and only if it is tight, i.e. for any ǫ > 0 there is a compact set Kǫ ⊂ Rn such that
µ(Rn \ Kǫ ) ≤ ǫ for all µ ∈ P.
Next we will prove Prokhorov’s theorem. The proof relies on Riesz representation theorem,
so let us recall a version of it which will be sufficient.
Theorem 1.12 ((a version of) Riesz representation theorem). For every positive linear functional φ on Cc (Rn ) there exists a Borel measure µ on Rn such that
Z
f (x) dµ(x)
for all f ∈ Cc (Rn ).
φ(f ) =
Rn
Here Cc (Rn ) is the space of continuous compactly supported functions on Rn . The measure
in the Riesz representation is obtained by setting first
µ(U ) := sup {φ(f ) : φ ∈ Cc (Rn ), 0 ≤ f ≤ 1, spt(f ) ⊂ U }
for all open U and then defining
µ(A) := inf {µ(U ) : A ⊂ U open}
for all A ∈ B(Rn ). One then needs to check that this gives a measure with the desired
properties.
Also the following basic theorem in Functional analysis, Banach-Alaoglu theorem, comes in
handy, although we will also sketch the proof of Prokhorov’s theorem without directly using
Banach-Alaoglu.
Theorem 1.13 (Banach-Alaoglu theorem). The closed unit ball of the dual of a normed space
is weak∗ compact.
Remark 1.14. A few words on the different topologies is probably needed. Notice that we
defined the weak convergence using Cb (Rn ). This is not the same topology as the one defined
by using Cc (Rn ) (or C0 (Rn ) consisting of functions vanishing at infinity). To see this, take
µn = L1 |[n,n+1] . Then µn does not converge weakly to any measure, but still
Z
ϕ dµn → 0
for all ϕ ∈ C0 (R).
R
Banach-Alouglu theorem naturally is true also for Cb (R). However, (Cb (R))′ should be idenˇ
tified with measures on the Stone-Cech
compactification of R. The convergence using Cb (Rn )
is also called narrow convergence.
Lemma 1.15. Let K ⊂ Rn be compact. Then P(K) is compact.
1Notice that Rn could be replaecd by any complete and separable metric space in the statement.
8
OPTIMAL MASS TRANSPORTATION
Proof. Since K is compact, any continuous function on K is bounded and has compact support. Thus
Cb (K) = Cc (K) = C(K) = {f : K → R continuous}.
Recall that C(K) is a Banach space when equipped with the supremum norm
kf k∞ = sup |f |.
x∈K
By the Banach-Alaoglu theorem the unit ball
B ′ = {ϕ ∈ C(K)′ : kϕk ≤ 1}
of the dual space C(K)′ is compact in the weak∗ topology. Now consider the weak∗ closed
subset of B ′ defined as
Σ := {ϕ ∈ B ′ : ϕ(1) = 1, and ϕ(f ) ≥ 0 for all f ∈ C(K) with f ≥ 0}.
By the Riesz representation theorem, the map T : P(K) → Σ : µ 7→ ϕµ with
Z
f dµ,
f ∈ C(K)
ϕµ (f ) :=
K
is a bijection. Since the weak topology on P(K) is given in duality to Cb (K), the map T is
a homeomorphism. Thus P(K) is compact.
Let us also give the same proof written without the explicit use of the Banach-Alaoglu
theorem.
Second proof of Lemma 1.15. Let us show that P(K) is sequentially compact. For this
purpose take a sequence of measures (µk ) ⊂ P(K). We will extract a converging subsequence using a diagonal argument. Let (µ0,k )k∈N be defined as µ0,k := µk . Suppose
that a sequence (µi,k )k∈N has been defined for some i ∈ N. Take a finite collection of
i
balls (B(xi,j , 1i ))N
j=1 covering K and select a subsequence (µi+1,k )k∈N of (µi,k )k∈N such that
1
|µi+1,k (B(xi,j , i )) − µi+1,l (B(xi,j , 1i ))| ≤ iN1 i for all j and all k, l ∈ N. Finally define a converging subsequence (νk )k by taking the diagonal νk := µk,k .
Let us now check the weak convergence of (νk )k . Take ϕ ∈ Cb (K) and ǫ > 0. Write
m−1
[
1
1
B(xk,i , ).
Ak,m := B(xk,m, ) \
k
k
i=1
Since K is compact, ϕ is uniformly continuous. Let δ > 0 be such
1
kx − yk < δ. Let k ≥ 2δ
. Then for any j, l ≥ k
Z
Z
Z
Nk Z
X
ϕ dνj −
ϕ
dν
ϕ
dν
−
ϕ
dν
≤
j
l
l
Ak,m
K
K
m=1 Ak,m
Z
Z
Nk
X
≤
inf ϕ(y) − ϕ dνj +
m=1
Ak,m y∈Ak,m
+ inf
that |ϕ(x) − ϕ(y)| < ǫ if
inf ϕ(y) − ϕ dνl
y∈Ak,m
Ak,m
!
ϕ(y) |νj (Ak,n ) − νl (Ak,n )|
y∈Ak,m
1
≤ǫ+ǫ+ .
k
OPTIMAL MASS TRANSPORTATION
9
R
Thus we can define a functional φν (ϕ) = limk→∞ K ϕ dνk which by the Riesz representation
theorem corresponds to a measure ν ∈ P(K) towards which (νk ) converges.
Proof of Prokhorov’s theorem in Rn . “⇒” Suppose that the claim is not true. Thus there
exists ǫ > 0 and a sequence of measures (µk )k∈N ⊂ P(Rn ) such that µk (Rn \ B(0, k)) ≥ ǫ for
all k ∈ N. By precompactness of P there exists a subsequence, still noted by µk , converging
weakly to some µ ∈ P(Rn ). But now
!
∞
[
n
1 = µ(R ) = µ
B(0, k) = lim µ(B(0, k)) ≤ lim lim inf µj (B(0, k)) ≤ 1 − ǫ,
k=1
k→∞ j→∞
k→∞
which is a contradiction.
“⇐” Take a sequence (µk ) ⊂ P. We want to show that the sequence has a converg¯ 1)) defined as νk = f♯ (µk ) with a
ing subsequence. Consider the sequence νk ⊂ P(B(0,
n
¯
homeomorphism f : R → B(0, 1). Since B(0, 1) is compact, by Lemma 1.15 there exists a
¯ 1)). By the tightness, for
subsequence νkj of νk weakly converging to a measure ν ∈ P(B(0,
every ǫ > 0
¯ 1) \ f (Kǫ )) ≤ ǫ.
ν(S(0, 1)) ≤ lim inf νkj (B(0,
j→∞
Thus µkj weakly converges to a measure f♯−1 ν ∈ P(Rn ).
Now we can prove the desired compactness of A(µ, ν)
Lemma 1.16. Let µ, ν ∈ P(Rn ). Then A(µ, ν) is compact in the weak topology.
Proof. Let us start with the tightness of A(µ, ν). Because {µ} and {ν} are both tight, for
every ǫ > 0 there exists a compact set Kǫ ⊂ Rn such that µ(Kǫ ) ≥ 1 − ǫ and ν(Kǫ ) ≥ 1 − ǫ.
Let σ ∈ A(µ, ν) Then
σ(Kǫ × Kǫ ) ≥ 1− σ((Rn \Kǫ )× Rn )− σ(Rn × (Rn \Kǫ )) = 1− µ(Rn \Kǫ )− ν(Rn \Kǫ ) ≥ 1− 2ǫ.
Since Kǫ × Kǫ is compact, we have proven tightness of A(µ, ν). By Prokhorov’s theorem
A(µ, ν) is then weakly precompact in P(Rn × Rn ).
Let us next prove the compactness of A(µ, ν). Let (σk )k∈N ⊂ A(µ, ν) be a sequence
converging weakly to σ ∈ P(Rn × Rn ). We have to prove that the projections of σ are µ and
ν. This is clear since
σ(U × Rn ) ≤ lim inf σk (U × Rn ) = lim inf µ(U ) = µ(U )
k→∞
k→∞
for all open U ⊂ Rn
and similarly
σ(Rn × U ) ≤ ν(U )
for all open U ⊂ Rn .
Theorem 1.17 (Existence of optimal plans). Assume that c : Rn × Rn → [0, +∞] is lower
semicontinuous. Then there exists a minimizer to Kantorovich’s formulation of the optimal
mass transportation problem.
Proof. Let (σk )k∈N ⊂ A(µ, ν) be such that
Z
c(x, y) dσk (x, y) ≤ inf
Rn ×Rn
σ∈A(µ,ν)
1
c(x, y) dσ(x, y) + .
k
Rn ×Rn
Z
10
OPTIMAL MASS TRANSPORTATION
By Lemma 1.16 the set A(µ, ν) is weakly compact and hence there exists a subsequence (σkj )j
of (σk )k converging weakly to some σ∞ ∈ A(µ, ν). Now by Lemma 1.10
Z
Z
c(x, y) dσkj (x, y)
c(x, y) dσ∞ (x, y) ≤ lim inf
j→∞
Rn ×Rn
Rn ×Rn
Z
1
≤ lim inf
c(x, y) dσ(x, y) +
inf
j→∞
kj
σ∈A(µ,ν) Rn ×Rn
Z
c(x, y) dσ(x, y)
= inf
σ∈A(µ,ν)
Rn ×Rn
and hence σ∞ is a minimizer for the problem.
Notice that it might well be the case that
Z
c(x, y) dσ∞ (x, y) = ∞
Rn ×Rn
in the case the transport cost is infinite for all σ ∈ A(µ, ν). We denote by Opt(µ, ν) ⊂
A(µ, ν) the set of σ that minimize the optimal tranportation problem. Notice that by the
lower semicontinuity of the transportation cost, Lemma 1.10, also the set Opt(µ, ν) is weakly
compact in the setting of Theorem 1.17.
Now that we have established the existence of minimizers, the next obvious question is
whether the minimizer is unique. This is not always the case.
Example 1.18. Suppose c(x, y) = h(kx − yk) for some function h : R → R ∪ {−∞, ∞} and
every x, y ∈ R2 . Let µ = 12 (δ(0,0) + δ(1,1) ) and ν = 12 (δ(1,0) + δ(0,1) ). Then
1
1
Opt(µ, ν) = A(µ, ν) = t (δ(0,0,1,0) + δ(1,1,0,1) ) + (1 − t) (δ(0,0,0,1) + δ(1,1,1,0) ) : t ∈ [0, 1]
2
2
since the the mass is always transported a distance one and because of the form of the cost
function, the transportation cost is then always h(1).
ν
µ
Moreover, the transports 12 (δ(0,0,1,0) + δ(1,1,0,1) ) and 12 (δ(0,0,0,1) + δ(1,1,1,0) ) are induced by optimal transport maps while all the other (optimal) transport plans are not.
The phenomena in Example 1.18 are quite general. Let us write some of them in the
following
Remark 1.19. (i) Let σ1 , σ2 ∈ A(µ, ν). Then for any t ∈ [0, 1] also tσ1 + (1 − t)σ2 ∈ A(µ, ν).
Similarly, if σ1 , σ2 ∈ Opt(µ, ν), then for any t ∈ [0, 1] also tσ1 + (1 − t)σ2 ∈ Opt(σ1 , σ2 ). This
is because
Z
Z
Z
c d(tσ1 + (1 − t)σ2 ) = t
c dσ1 + (1 − t)
c dσ2 .
OPTIMAL MASS TRANSPORTATION
11
(ii) Suppose that σ1 , σ2 ∈ A(µ, ν) are both induced by some maps and that σ1 6= σ2 . Then
for any t ∈ (0, 1) the measure tσ1 + (1 − t)σ2 is not induced by a map. In order to see this,
let Ti be the maps satisfying σi = (id × Ti )♯ µ and suppose that tσ1 + (1 − t)σ2 is induced by
some map T . Then
(id × T )♯ µ = tσ1 + (1 − t)σ2 = t(id × T1 )♯ µ + (1 − t)(id × T2 )♯ µ
= (id × (tT1 + (1 − t)T2 ))♯ µ
and so δT (x) = tδT1 (x) + (1 − t)δT2 (x) at µ-a.e. x ∈ Rn . In particular, T1 (x) = T2 (x) at µ-a.e.
x ∈ Rn and hence σ1 = σ2 .
(iii) In many cases (ii) is the way one proves uniqueness of optimal transport plans: First
one shows that any optimal transport plan is induced by a map. If there then were two
optimal transport plans their convex combination would not be induced by a map, which is
a contradiction. Hence the uniqueness.
1.3. Cyclical monotonicity and subdifferentials. Let us now study in more detail the
structure of optimal transport plans. This will later lead to results showing existence of optimal transport maps and uniqueness of optimal transport plans with the approach mentioned
in Remark 1.19 (iii).
The idea is to characterize optimal tranport plans σ ∈ Opt(µ, ν) as
• the σ ∈ A(µ, ν) for which spt(σ) is cˆacyclically monotone. c-cyclical monotonicity of
a set means that one cannot decrease the cost of transport on any finite subset of the
set by permuting the transport. (A more rigorous definition is given later.)
• and as the σ ∈ A(µ, ν) for which there exists a convex and lower semicontinuous
funciton ϕ such that σ is concentrated on the graph of the c-subdifferential of ϕ.
Recall the definition of a support of a measure
spt(µ) := {x ∈ Rn : for all ǫ > 0 we have µ(B(x, ǫ)) > 0}.
Equivalently, the support is the smallest closed set of full measure.
Recall also the notion of classical subdifferential ∂ − ϕ of a function ϕ : R → R used in
convex analysis:
ϕ
∂−ϕ
Definition 1.20 (c-cyclical monotonicity). A set Γ ⊂ Rn × Rn is called c-cyclically monotone
if for every (xi , yi )N
i=1 , N ∈ N, we have
N
X
i=1
c(xi , yi ) ≤
N
X
c(xi , yp(i) )
for all permutations p of {1, 2, . . . , N }.
i=1
Definition 1.21 (c-transforms). For a function ϕ : Rn → R ∪ {−∞, +∞} the c+ -transforms
c
c
ϕl + , ϕr+ : Rn → R ∪ {−∞, +∞} are defined as
c
ϕl + (x) := infn (c(x, y) − ϕ(y))
y∈R
12
OPTIMAL MASS TRANSPORTATION
and2
ϕcr+ (y) := infn (c(x, y) − ϕ(x)) .
x∈R
The c− -transforms of ϕ are
c
c
ϕl − , ϕr− :
Rn → R ∪ {−∞, +∞} defined as
c
ϕl − (x) := sup (−c(x, y) − ϕ(y))
y∈Rn
c
and ϕr− (x) similarly. If there is no risk for confusion, we drop the subscripts r and l.
Below are illustrations of c+ - and c− -transforms of a function on R for the cost function
c(x, y) = |x − y|.
−ϕc+
ϕ
−ϕc−
Definition 1.22 (c-concavity and c-convexity). A function ϕ : Rn → R ∪ {−∞, +∞} is cconcave if there exists ψ : Rn → R ∪ {−∞, +∞} such that ϕ = ψ c+ , and ϕ is called c-convex
if there exists ψ such that ϕ = ψ c− .
Definition 1.23 (c-superdifferential and c-subdifferential). Let ϕ : Rn → R ∪ {−∞, +∞} be
a c-concave function. The c-superdifferential ∂ c+ ϕ ⊂ Rn × Rn is defined as
∂ c+ ϕ := {(x, y) ∈ Rn × Rn : ϕ(x) + ϕc+ (y) = c(x, y)} .
Similarly, for a c-convex ϕ the c-subdifferential ∂ c− ϕ ⊂ Rn × Rn is defined as
∂ c− ϕ := {(x, y) ∈ Rn × Rn : ϕ(x) + ϕc− (y) = −c(x, y)} .
We will also write ∂ c+ ϕ(x) = {y ∈ Rn : (x, y) ∈ ∂ c+ ϕ} and similarly for ∂ c− ϕ.
The following example shows where the above terminology originates.
Example 1.24. Let c(x, y) = −hx, yi. Then the c-cyclical monotonicity reads as
N
X
i=1
hxi , yi i ≥
N
X
hxi , yp(i) i
for all permutations p of {1, 2, . . . , N }.
i=1
which is usually called cyclical monotonicity. (Notice that the same monotonicity is also
equivalent with the c-cyclical monotonicity for c(x, y) = kx − yk2 , since kx − yk2 = kxk2 −
2hx, yi + kyk2 .)
2Notice, that since c(x, y) is not always symmetric ϕc+ (x) and ϕc+ (x) are not always the same.
r
l
OPTIMAL MASS TRANSPORTATION
13
Let us next see what c-concavity means in this case. Take ψ : Rn → R ∪ {−∞, +∞} and
define ϕ = ψ c+ . For x1 , x2 ∈ Rn and t ∈ [0, 1] we notice that
ϕ(tx1 + (1 − t)x2 ) = infn (−htx1 + (1 − t)x2 , yi − ψ(y))
y∈R
= infn (t(−hx1 , yi − ψ(y)) + (1 − t)(−hx2 , yi − ψ(y)))
y∈R
≥ t infn (−hx1 , yi − ψ(y)) + (1 − t) infn (−hx2 , yi − ψ(y))
y∈R
y∈R
= tϕ(x1 ) + (1 − t)ϕ(x2 ).
In other words, ϕ is concave. Moreover, for any x ∈ Rn and (xi ) ⊂ Rn converging to x,
assuming ϕ(x) > −∞, we have
ϕ(x) ≥ −hx, yǫ i − ψ(yǫ ) − ǫ ≥ −hxi , yǫ i − ψ(yǫ ) − hx − xi , yǫ i − ǫ ≥ ϕ(xi ) − hx − xi , yǫ i − ǫ
where yǫ ∈ Rn is chosen suitably depending on ǫ > 0. Thus ϕ is upper semicontinuous. (Exercise: show that ϕ is actually c-concave if and only if it is concave and lower semicontinuous.)
The transform ϕc− (x) = supy∈Rn (hx, yi − ϕ(y)) is called the Legendre transform. Finally,
the c-superdifferential is
∂ c+ ϕ(x) = {y ∈ Rn : ϕ(x) + infn (−hy, zi − ϕ(z)) = −hx, yi}
z∈R
n
= {y ∈ R : ϕ(x) − ϕ(z) ≥ hy, z − xi for all z ∈ Rn }.
In other words, it is the usual superdifferential.
Theorem 1.25. Suppose c : Rn × Rn → R is continuous and bounded from below. Let µ, ν ∈
P(Rn ) be such that
c(x, y) ≤ a(x) + b(y),
for some a ∈ L1 (µ) and b ∈ L1 (ν). Also, let σ ∈ A(µ, ν). Then the following are equivalent:
(1) the plan σ is optimal,
(2) the set spt(σ) is c-cyclically monotone,
(3) there exists a c-concave function ϕ such that max{ϕ, 0} ∈ L1 (µ) and spt(σ) ⊂ ∂ c+ ϕ.
Proof. (1) ⇒ (2): Assume that this is not the case. Then there exist N ∈ N, (xi , yi )N
i=1 ⊂
spt(σ) and a permutation p of {1, . . . , N } such that
N
X
c(xi , yi ) >
i=1
N
X
c(xi , yp(i) ).
i=1
Since c is continuous, there exists ǫ > 0 such that
N
X
i=1
c(ai , bi ) >
N
X
c(ai , bp(i) )
for all (ai , bi ) ∈ B(xi , ǫ) × B(yi , ǫ).
i=1
Now the idea is to modify σ so that positive part of the transport from B(xi , ǫ) to B(yi , ǫ) is
changed to transport from B(xi , ǫ) to B(yp(i) , ǫ) for all i. One way to avoid deciding where to
send ai ∈ B(xi , ǫ) in B(yp(i) , ǫ) is to send it everywhere. In other words, define P ∈ P(R2nN )
as the product of
1
σ
,
with mi = σ(B(xi , ǫ) × B(yi , ǫ)).
mi |B(xi ,ǫ)×B(yi ,ǫ)
14
OPTIMAL MASS TRANSPORTATION
Let π i,j be the orthogonal projection from R2nN to the (2(i − 1) + j):th copy of Rn in the
product. Now, defining
N
min mi X i,1 p(i),2
(π , π
)♯ P − (π i,1 , π i,2 )♯ P
σ
˜ =σ+
N
i=1
we obtain σ
˜ ∈ A(µ, ν) with
Z
c(x, y) d˜
σ (x, y) <
Rn ×Rn
contradicting the optimality of σ.
Z
c(x, y) dσ(x, y)
Rn ×Rn
(2)⇒(3): Fix some (¯
x, y¯) ∈ spt(σ) and define
ϕ(x) := inf (c(x, y1 ) − c(x1 , y1 ) + c(x1 , y2 ) − c(x2 , y2 ) + · · · + c(xN , y¯) − c(¯
x, y¯)) ,
where the infimum is over all N ∈ N and (xi , yi )N
i=1 ⊂ spt(σ).
First of all,
ϕ(x) ≤ c(x, y¯) − c(¯
x, y¯) < a(x) + b(¯
y ) − c(¯
x, y¯)
1
1
and since a ∈ L (µ), also max{ϕ, 0} ∈ L (µ). Secondly, we have
ϕ(x) = inf n c(x, y1 ) − ψ(y1 )
y1 ∈R
with
ψ(y1 ) = sup (c(x1 , y1 ) − c(x1 , y2 ) + c(x2 , y2 ) − · · · − c(xN , y¯) + c(¯
x, y¯)) ,
N
where again the supremum is over all N ∈ N and (xi , yi )i=1 ⊂ spt(σ). (If c(x1 , y1 ) ∈
/ spt(σ)
for all x1 ∈ Rn , we are taking the supremum over an empty set and by definition we then
have ψ(y1 ) = −∞.) Thus ϕ is c-concave. Finally, for any (ˆ
x, yˆ) ∈ spt(γ) we have by the
definition of ϕ that for every x ∈ Rn it holds
ϕ(x) ≤ c(x, yˆ) − c(ˆ
x, yˆ) + inf (c(ˆ
x, y2 ) − c(x2 , y2 ) + · · · + c(xN , y¯) − c(¯
x, y¯))
= c(x, yˆ) − c(ˆ
x, yˆ) + ϕ(ˆ
x)
which is the same as
sup (ϕ(x) − c(x, yˆ)) = −ϕc+ (ˆ
y ) = ϕ(ˆ
x) − c(ˆ
x, yˆ)
x∈Rn
and so spt(γ) ⊂ ∂ c+ ϕ.
(3)⇒(1): In order to show optimality of σ we take a competitor σ
˜ ∈ A(µ, ν). From the
c-concavity of ϕ it follows that
ϕ(x) + ϕc+ (y) = c(x, y)
for all (x, y) ∈ spt(σ)
and
ϕ(x) + ϕc+ (y) ≤ c(x, y)
for all (x, y) ∈ Rn × Rn .
Thus
Z
c(x, y) dσ(x, y) =
=
Z
Z
c+
(ϕ(x) + ϕ (y)) dσ(x, y) =
c+
(ϕ(x) + ϕ (y)) d˜
σ (x, y) ≤
Z
Z
ϕ(x) dµ(x) +
Z
ϕc+ (y) dν(y)
c(x, y) d˜
σ (x, y).
OPTIMAL MASS TRANSPORTATION
15
A consequence of Theorem 1.25 is that the optimality of the transport plan depends only
on the support of the plan.
Let us now give a uniqueness result in the simpliest case where c(x, y) = kx − yk2 and the
starting measure µ gives zero measure for any Lipschitz graph
)
(n−1
X
xi ei + f (x1 , . . . , xn )en : (ei ) is an ON-basis of Rn and f is Lipschitz .
G=
i=1
(For example if µ is absolutely continuous with respect to the Lebesgue measure.) Later we
will sharpen the assumption on the measure µ.
Theorem 1.26. Let c(x, y) = kx − yk2 and µ ∈ P(Rn ) such that µ(G) = 0 for all Lipschitz
graphs G. Suppose that ν ∈ P(Rn ) is such that there exists a transport from µ to ν with
finite cost. Then there exists a unique optimal tranport plan from µ to ν and it is induced by
a map.
Let us start with a simple lemma.
Lemma 1.27. Let σ ∈ A(µ, ν). Then σ is induced by a map if and only if there exists a
σ-measurable set Γ ⊂ Rn × Rn where σ is concentrated such that for µ-a.e. x ∈ Rn there
exists only one y ∈ Rn such that (x, y) ∈ Γ.
Proof. Suppose that σ is induced by some map T . Then σ is concentrated on the graph
(id × T )(Rn ) as required.
To prove the other direction, suppose that σ is concentrated on the set Γ with the property
that for µ-a.e. x ∈ Rn there exists only one y ∈ Rn such that (x, y) ∈ Γ. By assumption,
outside a σ-negligible set N × Rn the set Γ is a graph, i.e. for all x ∈ Rn \ N there exists only
on y =: T (x) such that (x, y) ∈ Γ. Moreover, by the inner regularity of σ we find a sequence
of compact sets Γi ⊂ Γ \ (N × Rn ) such that
!
∞
[
Γi = 0.
σ Γ\
i=1
S
As a continuous image of a σ-compact set the projection of ∞
i=1 Γi to the first component
Rn is σ-compact. Moreover, T |π(Γ ) is continuous from π(Γi ) to Rn for all Γi . Therefore T is
i
Borel. Since for all ϕ ∈ Cc (Rn × Rn ) we have
Z
Z
Z
ϕ(x, y) dσ(x, y) = ϕ(x, T (x)) dσ(x, y) = ϕ(x, T (x)) dµ(x),
the equality σ = (id × T )♯ µ holds.
We will also need the following geometric lemma.
Lemma 1.28. Let µ ∈ P(Rn ) be such that µ(G) = 0 for any graph G of a Lipschitz function.
Then for every ǫ > 0 and v ∈ Sn−1 we have C(x, v, ǫ) ∩ spt(µ) 6= ∅ at µ-a.e. point x ∈ spt(µ),
where
C(x, v, ǫ) := {x + tv + ǫtw : t ∈ (0, ∞) and w ∈ Bn }.
16
OPTIMAL MASS TRANSPORTATION
B(x + vt, ǫt)
v
C(x, v, ǫ)
x
Proof. Suppose that the claim is not true. Then there exist ǫ > 0, v ∈ (S)n−1 and a set
A ⊂ spt(µ) with µ(A) > 0 such that C(x, v, ǫ) ∩ spt(µ) = ∅ for all x ∈ A. But now the
orthogonal projection πv⊥ : R2 → v ⊥ is a bi-Lipschitz map between A and πv⊥ (A). Hence A
is a subset of a Lipschitz graph and by assumption µ(A) = 0. This is a contradiction.
Proof of Theorem 1.26. By Theorem 1.17 there exists an optimal transport plan σ from µ
to ν. By contradiction let us assume that σ is not induced by a map. By Lemma 1.27 the
measure σ is not concentrated on any set Γ ⊂ Rn × Rn with the property that µ-a.e. x ∈ Rn
there exists only one y ∈ Rn such that (x, y) ∈ Γ. In particular, we have that the set
A := {x ∈ Rn : the set {y ∈ Rn : (x, y) ∈ spt(σ)} has more than one element}
has positive µ-measure.
Our aim is now to find a contradiction with the c-cyclical monotonicity of spt(σ) provided
by Theorem 1.25. We will arrive at the contradiction via a discretization and a final geometric
argument at a density point. Let us start with the discretization. We can write
[
A=
Ai
i∈N
with
1
n
Ai := x ∈ R : there exist (x, y1 ), (x, y2 ) ∈ spt(σ) with ky1 − y2 k ≥
.
i
Since µ(A) > 0, there exists i ∈ N such that µ(Ai ) > 0. Let us fix such i. Next we can cover
1
Rn with balls B(xj , 10i
), j ∈ N, and write
[
Ai :=
Fj,k
j,k∈N
with
Fj,k
1
)
:= x ∈ Rn : there exist (x, y1 ), (x, y2 ) ∈ spt(σ) with y1 ∈ B(xj ,
10i
1
1
and y2 ∈ B(xk ,
) such that ky1 − y2 k ≥
.
10i
i
Again, we can fix j, k ∈ N such that µ(Fj,k ) > 0.
Take a point z1 of µ(Fj,k ) given by Lemma 1.28 such that there exists a point z2 ∈ µ(Fj,k )
9
1
with hz1 − z2 , xj − xk i > 10
kz1 − z2 k kxk − xk k. Now there exist y1 ∈ B(xj , 10i
) and y2 ∈
1
B(xk , 10i ) such that (z1 , y1 ), (x2 , y2 ) ∈ spt(σ). Now
hz1 , y1 i + hz2 , y2 i − hz1 , y2 i − hz2 , y1 i = hz1 − z2 , y1 − y2 i < 0
contradicting the c-cyclical monotonicity of spt(σ). (Recall the remark in Example 1.24.)
OPTIMAL MASS TRANSPORTATION
17
xk
y2
xj
y1
z2
z1
1.4. Dual transportation problem. Let us next connect the Kantorovich problem to a
dual formulation:
Dual formulation of the optimal transport problem
Let µ, ν ∈ P(Rn ). Maximize
Z
Z
ϕ(x) dµ(x) + ψ(y) dν(y)
among all functions ϕ ∈ L1 (µ), ψ ∈ L1 (ν) such that
ϕ(x) + ψ(y) ≤ c(x, y),
for all x, y ∈ Rn .
Theorem 1.29 (duality). Let µ, ν ∈ R and c : Rn × Rn → R continuous and bouned below
such that
c(x, y) ≤ a(x) + b(y)
for some a ∈ L1 (µ) and b ∈ L1 (ν). Then the minimum of the Kantorovich problem equals
the supremum in the dual formulation and this supremum is attained by some couple (ϕ, ϕc+ )
with ϕ a c-concave function.
Proof. Let σ ∈ A(µ, ν). For any pair ϕ ∈ L1 (µ) and ψ ∈ L1 (ν) satisfying
ϕ(x) + ψ(y) ≤ c(x, y),
we have
Z
c(x, y) dσ(x, y) ≥
Z
for all x, y ∈ Rn
(ϕ(x) + ψ(y)) dσ(x, y) =
Z
ϕ(x) dµ(x) +
Z
ψ(y) dν(y).
Thus the supremum in the dual problem never exceeds the minimum of the Kantorovich
problem.
For the other direction, take σ ∈ Opt(µ, ν) and let ϕ be the c-concave function given by
Theorem 1.25: spt(σ) ⊂ ∂ c+ ϕ and max{ϕ, 0} ∈ L1 (µ).
18
OPTIMAL MASS TRANSPORTATION
Notice that by the assumption c(x, y) ≤ a(x) + b(y) and by the inequality ϕ(x) + ϕc+ (y) ≤
c(x, y), we have
Z
Z
Z
c+
ϕ (y) dν(y) ≤ (c(x, y) − ϕ(x)) dν(y) ≤ (a(x) + b(y) − ϕ(x)) dν(y) < ∞.
Hence also max{ϕc+ , 0} ∈ L1 (ν). Now
Z
Z
Z
Z
c+
c+
ϕ(x) dµ(x) + ϕ (y) dν(y) = (ϕ(x) + ϕ (y)) dσ(x, y) = c(x, y) dσ(x, y),
and so ϕ ∈ L1 (µ), ϕc+ ∈ L1 (ν). Thus (ϕ, ϕc+ ) is an admissiple pair of functions for the dual
problem and we have proven the claim.
Definition 1.30 (Kantorovich potential). A c-concave function ϕ such that (ϕ, ϕc+ ) is a
maximizing pair for the dual problem is called a Kantorovich potential for the couple µ and
ν.
Let us now have another look at the existence of optimal maps with the cost c(x, y) =
kx−yk2
.
2
Proposition 1.31. Let ϕ : Rn → R ∪ {−∞}. Then ϕ is c-concave if and only if x 7→ ϕ(x)
¯
:=
kxk2
c
−
+
¯
2 − ϕ(x) is convex and lower semicontinuous. In this case ∂ ϕ = ∂ ϕ.
Proof. Notice that
ϕ(x) = infn
y∈R
kyk2
kxk2
kx − yk2
− ψ(y) = infn
− hx, yi +
− ψ(y) ,
y∈R
2
2
2
or equivalently
kyk2
kxk2
− ϕ(x) = sup hx, yi −
+ ψ(y) ,
ϕ(x)
¯
=
2
2
y∈Rn
which proves the first claim.
For the second claim, observe that
(
ϕ(x) =
y ∈ ∂ c+ ϕ(x) ⇐⇒
ϕ(z) ≤
kx−yk2
2
kz−yk2
2
− ϕc+ (y),
− ϕc+ (y),
∀z ∈ Rn
or equivalently
(
ϕ(x) −
ϕ(z) −
kxk2
2
kzk2
2
which is the same as
ϕ(z) −
= hx, −yi +
≤ hz, −yi +
kyk2
c+
2 − ϕ (y),
2
kyk
c+
2 − ϕ (y),
∀z ∈ Rn
kzk2
kxk2
≤ ϕ(x) −
+ hz − x, −yi
2
2
and
−y ∈ ∂ + (−ϕ)(x)
¯
⇐⇒ y ∈ ∂ − ϕ(x).
¯
Definition 1.32 (c − c hypersurface). A set E ⊂ Rn is called a c − c hypersurface, if it is
the graph of the difference of two real valued convex functions in some coordinate system.
OPTIMAL MASS TRANSPORTATION
19
Notice that any c − c hypersurface is contained in a countable union of Lipschitz graphs
since a convex function is locally Lipschitz. Hence
µ(L) = 0 for every Lipschitz graph L
⇒
µ(E) for every c − c hypersurface E.
We will need the following result in differentiability of convex functions.
Theorem 1.33. Let A ⊂ Rn . Then there exists a convex function f : Rn → R such that A is
contained in the set of points of non differentiability of f if and only if A can be covered by
countably many c − c hypersurfaces.
Proof. (skipped)
Theorem 1.34 (Brenier). Let µ ∈ P(Rd ) such that kxk2 dµ(x) < ∞. Then the following
two conditions are equivalent:
R
(i) for every ν ∈ P(Rn ) with kxk2 dν(x) < ∞ there exists only one transport plan from
µ to ν and this plan is induced by a map,
(ii) for every c − c hypersurface E ⊂ Rn we have µ(E) = 0.
R
Furthermore, when (i) and (ii) hold, the optimal map is the gradient of a convex function.
Proof. (ii) ⇒ (i): By taking a(x) = b(x) = kxk2 we notice that
c(x, y) =
kx − yk2
≤ kxk2 + kyk2
2
and a ∈ L1 (µ) and b ∈ L1 (ν). Thus the assumptions of Theorem 1.25 are satisfied. Thus any
optimal plan σ ∈ Opt(µ, ν) is concentrated on the superdifferential of a Kantorovich potential
ϕ. By Proposition 1.31 the function ϕ(x)
¯
:= kxk2 /2 − ϕ(x) is convex and ∂ c+ ϕ = ∂ − ϕ.
¯ Since
ϕ¯ is convex, by Theorem 1.33 the set of non differentiability points of ϕ is contained in a union
of countably many c − c hypersurfaces. By assumption, this set has zero µ measure. Thus
ϕ¯ is differentiable µ-almost everywhere. In particular, ∂ − ϕ(x)
¯
is single-valued for µ-almost
d
every x ∈ R and by Lemma 1.27 the optimal plan is induced by a map. Consequently, the
optimal plan is unique.
(i) ⇒ (ii): Assume that the claim is not true. Then by Theorem 1.33 there exists a convex
function ϕ¯ : Rn → R such that the set E of non differentiability points of ϕ¯ has positive µ
measure. We may assume that ϕ¯ has linear growth at infinity. In the non differentiability
points x ∈ E the set ∂ − ϕ(x)
¯
has more than one point. Let us select measurably for every
x ∈ E points S(x), T (x) ∈ ∂ − ϕ(x)
¯
such that T (x) 6= S(x). Now define
σ :=
1
((id × T )♯ µ + (id × S)♯ µ) .
2
Since ϕ¯ has linear growth, ν := π♯2 σ
by Proposition 1.31 the support of
measure σ ∈ Opt(µ, ν) by Theorem
since T (x) 6= S(x) for all x ∈ E and
R
has compact support and thus kxk2 dν(x) < ∞. Since
the measure σ ∈ A(µ, ν) is c-cyclically monotone, the
1.25. However, the measure σ is not induced by a map
E has positive µ measure.
1.5. A few applications of optimal transport. Recall that by the classical Helmholtz
decomposition a sufficiently smooth vector field can be decomposed into a curl-free component
and a divergence-free component. We can now prove a generalization of this decomposition.
20
OPTIMAL MASS TRANSPORTATION
Theorem 1.35 (Polar factorization). Suppose Ω ⊂ Rn is a bounded domain and µΩ the
normalized Lebesgue measure on Ω. Let S ∈ L2 (µΩ ; Rn ) be such that ν := S♯ µΩ gives zero
measure to c − c hypersurfaces. Then there exist unique
s ∈ S(Ω) := {s : Ω → Ω Borel map with s♯ µΩ = µΩ }
and ∇ϕ with ϕ convex, such that S = (∇ϕ) ◦ s. Moreover, s is the unique minimizer of
Z
kS − s˜k2 dµΩ
among all s˜ ∈ S(Ω).
Before proving the polar factorization, let us formally see how it generalizes the Helmholtz
decomposition. For this purpose, suppose that Ω (and everything else) is smooth. Let u : Ω →
Rn be a vector field and consider the polar factorization of Sǫ := id + ǫu with |ǫ| small. Then
we have the decomposition Sǫ = (∇ϕǫ ) ◦ sǫ with
∇ϕǫ = id + ǫv + o(ǫ)
and
sǫ = id + ǫw + o(ǫ).
For the curl-free component, notice that since ∇ × (∇ϕǫ ) = 0, we have ∇ × v = 0 and thus
v is the gradient of some function p. On the other hand, since sǫ is measure preserving, we
have ∇ · (wχΩ ) = 0 in the sense of distributions, giving us the divergent-free component.
R
R
Proof of Theorem 1.35. By assumption kxk2 dµΩ (x) < ∞ and kxk2 dν(x) < ∞. We claim
that
Z
Z
inf
s˜∈S(Ω)
kS(x) − s˜(x)k2 dµ(x) =
min
σ∈A(µ,ν)
kx − yk2 dσ(x, y).
(1.1)
To see this, let σs˜ := (˜
s, S)♯ µ ∈ A(µΩ , ν) for all s˜ ∈ S(Ω). This already gives that the
left-hand side is at least the right-hand side in (1.1). Now, take σ
¯ ∈ Opt(µΩ , ν), which by
Theorem 1.34 is unique. Moreover, by Theorem 1.34 we have
σ
¯ = (id, ∇ϕ)♯ µΩ = (∇ϕ,
˜ id)♯ ν
for some convex functions ϕ, ϕ.
˜ Now we have ∇ϕ ◦ ∇ϕ(x)
˜
= x for µΩ -almost every x ∈ Rn .
Define s := ∇ϕ˜ ◦ S. Then s♯ µΩ = µΩ , and thus s ∈ S(Ω). Also S = ∇ϕ ◦ s giving the polar
factorization. Furthermore,
Z
Z
Z
2
2
kx − yk dσs (x, y) = ks(x) − S(x)k dµΩ (x) = k∇ϕ˜ ◦ S(x) − S(x)k2 dµΩ (x)
Z
Z
= k∇ϕ(y)
˜
− yk2 dν(y) = min
kx − yk2 dσ(x, y)
σ∈A(µΩ ,ν)
giving the claimed equality in (1.1).
Finally, in order to see the uniqueness of the factorization, assume that S = (∇ϕ)
¯ ◦ s¯ is
another polar factorization of S. Since ∇ϕ¯♯ µΩ = ((∇ϕ)
¯ ◦ s¯)♯ µΩ = ν, ∇ϕ¯ is a transport map
from µΩ to ν. Moreover, since ϕ¯ is convex, ∇ϕ¯ is optimal. By the uniqueness of the optimal
map, ∇ϕ¯ = ∇ϕ.
As another application of optimal transport we give a short proof of the isoperimetric
inequality in Rn .
OPTIMAL MASS TRANSPORTATION
21
Theorem 1.36 (Isoperimetric inequality). Let E ⊂ Rn be open. Then
P (E)
1
Ln (E)1− n ≤
1
nLn (B) n
,
where B is the unit ball in Rn and P (E) is the perimeter of the set E.
Proof. We will give the proof without paying too much attention to smoothness issues. Let
the cost-function be c(x, y) = kx − yk2 . Define
1
1
µ := n
Ln
and
ν := n
Ln ,
L (E) |E
L (B) |B
and let T : Rn → Rn be the optimal transport map given by Theorem 1.34. By the change of
variable formula, we have
1
1
= det(∇T (x)) n
,
for all x ∈ E.
Ln (E)
L (B)
Since T is the gradient of a convex function, ∇T (x) is a symmetric matrix with nonnegative
eigenvalues for every x ∈ E. Thus by the inequality for arithmetic-geometric means we have
∇ · T (x)
,
n
Combining the previous two observations we get
1
(det ∇T (x)) n ≤
1
Ln (E)
1
n
=
1
∇ · T (x)
1 ,
n
Ln (B) n
for all x ∈ E.
for all x ∈ E.
Integrating over E and by using the divergence theorem we get
Z
Z
1
1
1
∇
·
T
(x)
dx
=
hT (x), v(x)i dHd−1 (x),
L(E)1− n ≤
1
1
n
n
nL (B) n E
nL (B) n ∂E
where v : ∂E → Rn is the outer unit normal vector. Since T (x) ∈ B for all x ∈ E, we have
hT (x), v(x)i ≤ 1 for all x ∈ E and thus
Z
1
1
P (E)
1− n
≤
hT (x), v(x)i dHd−1 (x) ≤
L(E)
1
1 .
n
nL (B) n ∂E
nLn (B) n
As a third application of optimal transport we will prove the standard Sobolev inequality
in Rn .
Theorem 1.37 (Sobolev inequality). Let 1 ≤ p < n and define p∗ :=
a constant C > 0 depending only on n and p such that
kf kp∗ ≤ Ck∇f kp ,
np
n−p .
Then there exists
for all f ∈ W 1,p (Rn ).
Proof. Let n and p be fixed. We may assume that f ≥ 0 and kf kp∗ = 1. Our aim is then to
prove that k∇f kp ≥ C for some constant C independent of f . Let g : Rn → R be a smooth
nonnegative function with kgk1 = 1, and define
∗
µ := f p ∗ Ln
and
ν := gLn .
Let T be the optimal transport map from µ to ν (again with the cost being given by the
square of the distance).
22
OPTIMAL MASS TRANSPORTATION
The change of variable formula gives
∗
for all x ∈ Rn .
f p (x) = det(∇T (x))g(T (x)),
Hence
Z
g
1
1− n
=
Z
g
1
−n
g=
Z
1
−n
(g ◦ T )
f
p∗
=
Z
1
∗
1
det(∇T ) n (f p )1− n .
As in the previous proof, we know that T is the gradient of a convex function and thus
∇T (x) is a symmetric matrix with nonnegative eigenvalues, and thus by the inequality for
arithmetic-geometric means we have
1
(det ∇T (x)) n ≤
Therefore
where
1
p
+
1
q
Z
g
1
1− n
1
≤
n
Z
∇ · T (x)
,
n
1
p∗ 1− n
∇ · T (f )
p∗
=−
n
for all x ∈ E.
1
1−
n
Z
f
p∗
q
T · ∇f,
= 1. By H¨older inequality we finally get
Z
1 Z
1
Z
q
p
1
1
p∗
1− n
q
p∗
p
g
1−
f |T |
≤
|∇f |
n
n
1
1 Z
Z
q
p
1
p∗
p
q
|∇f |
g(y)|y| dy
1−
=
n
n
giving the claim.
2. Lp transportation distances in metric spaces
Let us now turn to optimal mass transportation in more general metric spaces.
Definition 2.1. Let (X, d) be a complete and separable metric space and 1 ≤ p < ∞. We define the Lp transportation distance ( Wasserstein distance, Kantorovich-Rubinstein distance,
...) Wp between two measures µ, ν ∈ P(X) as
1
Z
p
p
d (x, y) dσ(x, y)
Wp (µ, ν) :=
inf
.
σ∈A(µ,ν)
X×X
Remarks 2.2. (i) The definition of the distance makes sense without knowing the existence of
a minimizer in the definition of Wp (µ, ν). However, the existence follows as in the Euclidean
case.
(ii) The function Wp : P(X) × P(X) → [0, ∞] is typically
a distance because it may
P∞not −i
attain the value +∞. For example, in R if µ = δ0 and ν = i=1 2 δ2i , we have
W1 (µ, ν) =
∞
X
2−i 2i = ∞.
i=1
In order to have a finite distance, one restricts the function Wp to a subset Pµ,p (X) ⊂ P(X)
defined as
Pµ,p (X) := {ν ∈ P(X) : Wp (µ, ν) < ∞}
for any µ ∈ P(X). The most commonly used subset is Pp (X) := Pδx ,p (X) for some x ∈ X.
By triangle inequality, this definition is independent of the point x.
OPTIMAL MASS TRANSPORTATION
23
Theorem 2.3. Let (X, d) be a complete separable metric space and 1 ≤ p < ∞. Then
(Pp (X), Wp ) is a metric space.
Proof. Obviously Wp (µ, µ) = 0 and Wp (µ, ν) = Wp (ν, µ).
R Assume that Wp (µ, ν) = 0. Then
there exists an optimal plan σ ∈ Opt(µ, ν) such that dp (x, y) dσ(x, y) = 0 meaning that
x = y for σ-a.e. (x, y) ∈ X × X. Thus µ = π♯1 σ = π♯2 σ = ν.
We still need to show that the triangle inequality is satisfied. The triangle inequality will
also imply that
Wp (µ, ν) ≤ Wp (µ, δx ) + Wp (δx , ν) < ∞.
Let µ1 , µ2 , µ3 ∈ Pp (X). Take σ1,2 ∈ Opt(µ1 , µ2 ) and σ2,3 ∈ Opt(µ2 , µ3 ) where the optimality
is measured with the cost kx − ykp . We want to construct an admissible transport σ1,3 ∈
A(µ1 , µ3 ) by gluing together σ1,2 and σ2,3 . The gluing can be done by using the disintegration
theorem: write
dσ1,2 (x, y) = dµ2 (y)dσy1,2 (x)
and define
dσ1,3 (x, z) =
Z
and
dσ2,3 (y, z) = dµ2 (y)σy2,3 (z),
d(σy1,2 × σy2,3 )(x, z)dµ2 (y)
by integrating over y. Now
1
Z Z
1
Z
p
p
p
1,2
2,3
p
d (x, z) d(σy × σy )(x, z)dµ2 (y)
=
d (x, z) dσ1,3 (x, z)
Wp (µ1 , µ3 ) ≤
≤
Z Z
+
p
d
Z Z
=
Z Z
=
Z Z
(x, y) d(σy1,2
p
d
×
(y, z) d(σy1,2
1
p
σy2,3 )(x, z)dµ2 (y)
×
1
σy2,3 )(x, z)dµ2 (y)
p
1 Z Z
1
p
p
p
2,3
dp (x, y) dσy1,2 (x)dµ2 (y)
d (y, z) dσy (z)dµ2 (y)
+
1 Z Z
1
p
p
p
d (x, y) dσ1,2 (x, y)
+
d (y, z) dσ2,3 (y, z)
p
= Wp (µ1 , µ2) + Wp (µ2 , µ3 ).
Let us next look at the topology given by the Wp distance.
Theorem 2.4. Let p ∈ [1, ∞) and (X, d) complete and separable. Then for µi , µ ∈ Pp (X)
we have Wp (µi , µ) → 0 if and only if µi → µ weakly and
Z
Z
dp (x, x0 ) dµ(x)
for some x0 ∈ X.
dp (x, x0 ) dµi (x) →
X
X
Proof. We will prove the claim only in the simple case where (X, d) is proper (i.e. closed balls
are compact). Assume first that Wp (µi , µ) → 0. Then
Z
1 1 Z
p
p
dp (x, x0 ) dµ(x) = |Wp (µi , δx0 ) − Wp (µ, δx0 )|
−
dp (x, x0 ) dµi (x)
≤ Wp (µi , µ) → 0.
24
OPTIMAL MASS TRANSPORTATION
Since
µi (X \ B(x0 , R) ≤
Z
X\B(x0 ,R)
dp (x, x0 )
1
dµi (x) ≤ p Wpp (µi , δx0 ),
p
R
R
the set of measures {µi } is tight. Thus it suffices to check the weak∗ convergence. Since
Lipschitz functions are dense in Cc (X) with respect to the uniform convergence, it suffices to
check the convergence against Lipschitz functions f . Let σi ∈ Opt(µi , µ), for the cost dp (x, y),
and estimate
Z
Z
Z
f (x) dµi (x) −
(f (x) − f (y)) dσi (x, y)
f (y) dµ(y) = X
X
Z X×X
|f (x) − f (y)| dσi (x, y)
≤
X×X
Z
Lip(f )d(x, y) dσi (x, y)
≤
X×X
≤ Lip(f )
Z
X×X
1
p
d (x, y) dσi (x, y)
p
= Lip(f )Wp (µi , µ) → 0.
Let us then prove the converse direction. Suppose the claim is not true. Then there exists
a subsequence of (µi ), still denoted by (µi ) and ǫ > 0 such that
Wpp (µi , µ) ≥ ǫ
Now take R > 0 such that
Z
dp (x, x0 ) dµi (x) <
X\B(x0 ,R)
for all i.
ǫ
3 · 2p+1
for all i.
Notice that
dp (x, y) ≤ 2p (dp (x, x0 ) + dp (y, x0 )).
Let σi ∈ Opt(µi , µ). Since {µi } is tight, so is {σi }. Therefore there exists a subsequence
converging weakly to some σ. Now along this subsequence
Z
Z
2ǫ
dp (x, y) dσi (x, y) +
dp (x, y) dσi (x, y) ≤
ǫ ≤ Wpp (µi , µ) =
3
B(x0 ,R)×B(x0 ,R)
X×X
Z
2ǫ
dp (x, y) dσ(x, y) + ,
→
3
B(x0 ,R)×B(x0 ,R)
which is a contradiction provided that we can show σ ∈ Opt(µ, µ). Since clearly σ ∈ A(µ, µ),
we only need to show optimality. This is seen using cyclical monotonicity: Since σi are optimal, their supports are cyclically monotone set. Suppose (xj , yj ) ∈ spt(σ) for j = 1, . . . , N .
Since σi → σ weakly, there exist (xij , yji ) ∈ spt(σi ) such that
d(xij , xj ), d(yji , yj ) → 0
as i → ∞.
dp
Thus by continuity of the cost function
and the cyclical monotonicity of spt(σi ) we have
that no permutation of the pairs (xj , yj ) lower the cost for σ. Thus spt(σ) is cyclically
monotone and hence σ optimal.
Theorem 2.5. Let p ∈ [1, ∞) and (X, d) complete and separable. Then (Pp (X), Wp ) is
complete and separable.
OPTIMAL MASS TRANSPORTATION
25
Proof. Let us again prove only the simple case with (X, d) proper. Let us first prove completeness. Let (µi ) ⊂ Pp (X) be a Cauchy sequence. Since
Wp (µi , δx0 ) ≤ Wp (µi , µ1 ) + Wp (µ1 , δx0 ),
The sequence (µi ) is tight. Hence there exists a subsequence converging weakly to a measure
µ ∈ P(X). For n ∈ N consider the sequence (σi ) of measures σi ∈ Opt(µi , µn ) that is tight
by the tightness of (µi ). Hence there exists a subsequence weakly converging to σ ∈ A(µ, µn ).
Using this limit we get
Z
Z
p
p
dp (x, y) dσi (x, y) = lim inf Wpp (µi , µn ).
d (x, y) dσ(x, y) ≤ lim inf
Wp (µ, µn ) ≤
i→∞
X×X
i→∞
X×X
Hence
lim sup Wp (µ, µn ) ≤ lim sup lim inf Wp (µi , µn ) = 0.
n→∞
i→∞
n→∞
Therefore (Pp (X), Wp ) is complete.
Let (xi )∞
i=1 ⊂ X be dense in (X, d). Then


N
N
X

X
D :=
aj = 1, aj ∈ [0, 1] ∩ Q
aj δxj : N ∈ N,


j=1
j=1
is dense in (Pp (X), Wp ). To see this, take µ ∈ Pp (X) and ǫ > 0. Let R > 0 be such that
Z
dp (x, x0 ) µ(x) < ǫp .
X\B(x1 ,R)
¯ 1 , R) is compact, there exists a finite set of points {yj }N ⊂ {xi } such that
Since B(x
j=1
B(x1 , R) ⊂
N
[
B(yj , ǫ).
j=1
Define inductively
Aj = Bj \
j−1
[
Bj
k=1
for j = 1, . . . , N + 1 with BN +1 := X. By perturbing the weights a tiny amount, we may
assume that µ(Aj ) ∈ Q. Using the sets Aj we define a measure ν ∈ D as
ν :=
N
+1
X
µ(Aj )δyj ,
j=1
with yN +1 := x1 . Now
N Z
N
+1 Z
X
X
p
p
d (x, yj ) dµ(x) ≤
Wp (µ, ν) ≤
j=1
Aj
Thus (Pp (X), Wp ) is separable.
j=1
Aj
p
p
ǫ dµ(x) + ǫ ≤
Z
ǫp dµ(x) + ǫp = 2ǫp .
X
Let us note that local compactness of (X, d) does not imply that (Pp (X), Wp ) is locally
compact:
26
OPTIMAL MASS TRANSPORTATION
Example 2.6. Take ǫ > 0 and define a sequence of measures
ǫp
ǫp
δn ∈ Pp (N).
µn := 1 − p δ1 +
n
(n − 1)p
Now
ǫp
Wpp (µn , δ1 ) =
(n − 1)p = ǫp ,
(n − 1)p
¯ 1 , ǫ), but also that B(δ
¯ 1 , ǫ) is not a compact subset
showing on one hand that (µn )n∈N ∈ B(δ
of (Pp (X), Wp ). This can be seen from the fact that µn weakly converges to δ1 , but not w.r.t.
Wp .
2.1. Geodesic spaces. Let us now consider geodesic (complete, separable) metric spaces.
Let us first recall some definitions
Definition 2.7. Let (X, d) be a metric space. A curve γ : [0, 1] → X is called a (constant
speed) geodesic if
d(γt , γs ) = |t − s|d(γ0 , γ1 )
for all t, s ∈ [0, 1],
where γt := γ(t). We denote the set of all geodesics in X by Geo(X).
The space (X, d) is called geodesic if for every x, y ∈ X there exists γ ∈ Geo(X) with
γ0 = x and γ1 = y.
We equip Geo(X) with the supremum distance:
d(γ, γ ′ ) = sup d(γt , γt′ ).
t∈[0,1]
To ease the notation, we define the evaluation map et : Geo(X) → X : γ 7→ γt .
Notice first that for any metric space (X, d) the space (P1 (X), W1 ) is geodesic: For any
pair µ0 , µ1 ∈ P1 (X) we have a geodesic µt := tµ1 + (1 − t)µ0 .
For Wp with p > 1 the situation is different. We will prove the following.
Theorem 2.8. Let (X, d) be a complete, separable and geodesic. Then (Pp (X), Wp ) is geodesic.
We will again only show the case with (X, d) proper. In order to obtain a geodesic of Borel
measures in (Pp (X), Wp ) we need a measurable selection theorem.
Theorem 2.9 (Theorem 6.9.6 in [1]). Let X and Y be complete and separable metric spaces
and Γ ∈ B(X × Y ). Suppose that Γx := {y ∈ Y : (x, y) ∈ Γ} is nonempty and σ-compact for
all x ∈ X. Then Γ contains the graph of some Borel mapping f : X → Y .
Let us use Theorem 2.9 to measurably select the geodesics.
Lemma 2.10. Let (X, d) be a proper metric space. Then there exists a Borel map S : X×X →
Geo(X) such that S(x, y)0 = x and S(x, y)1 = y.
Proof. We need to show that for every x, y ∈ X the set
Γx,y := {γ ∈ Geo(X) : γ0 = x, γ1 = y}
is σ-compact and that set
Γ=
[
x,y∈X
is Borel.
(x, y) × Γx,y
OPTIMAL MASS TRANSPORTATION
27
¯ d(x, y)) is compact. Now take a sequence
Let x, y ∈ X. Since (X, d) is proper, the set B(x,
¯
⊂ Γx,y . For n ∈ N the set B(x, d(x, y)) is covered by finitely many balls of radius
d(x, y)/n. On the other hand for any t ∈ [0, 1] we have d(γsi , γti ) < d(x, y)/n if |t − s| < 1/n.
Hence there exists a subsequence of (γ i ) with diameter bounded by a given constant. Therefore
it has a subsequence converging to some γ : [0, 1] → X. It is easy to check that also γ ∈ Γx,y .
Hence Γx,y is compact.
¯ it is easy to check that γ ∈ Γ. Hence Γ is closed and thus Borel. Now
Similarly, if γ ∈ Γ,
we are in the position to use Theorem 2.9 to make the claimed Borel selection.
(γ i )
Proof of Theorem 2.8. Let µ0 , µ1 ∈ Pp (X) and σ ∈ Opt(µ0 , µ1 ). Let S : X × X → Geo(X)
be the Borel map given by Lemma 2.10. Now define ν := S♯ σ ∈ Pp (Geo(X)) and µt := (et )♯ ν
for all t ∈ [0, 1]. We claim that (µt ) is the geodesic connecting µ0 to µ1 . Since (e0 , e1 ) ◦ S = id
the measures µ0 and µ1 are indeed given as claimed. Now
Z
Z
p
p
p
d (γs , γt ) dν(γ) = |t − s|
dp (γ0 , γ1 ) dν(γ)
Wp (µs , µt ) ≤
Geo(X)
Geo(X)
Z
= |t − s|p
dp (x, y) dσ(x, y) = |t − s|p Wpp (µ0 , µ1 ).
X×X
Hence (µt ) is a constant speed geodesic.
In the previous proof we noticed that we actually gave the geodesic (µt ) ∈ Geo(Pp (X))
using a measure on the space of geodesics. In fact, we can always “lift” a given measure
(µt ) ∈ Geo(Pp (X)) to a measure on geodesic. This is stated in the following theorem.
Theorem 2.11. Let p > 1 and (X, d) separable, complete and geodesic. Then for any geodesic
(µt ) ∈ Geo(Pp (X)) there exists a measure π ∈ Pp (Geo(X)) such that µt = (et )♯ π for all
t ∈ [0, 1].
Proof. The measure on geodesics is built inductively. First we find, similarly as in the proof
of Theorem 2.8, measures ν0→ 1 , ν 1 →1 ∈ Pp (Geo(X)) such that (e0 , e1 )♯ ν0→ 1 ∈ Opt(µ0 , µ 1 )
2
2
2
2
and (e0 , e1 )♯ ν 1 →1 ∈ Opt(µ 1 , µ1 ). Next we glue these measures together via the disintegration
2
2
theorem: write
x
dν0→ 1 (γ) = dµ 1 (x)dν0→
1 (γ)
2
2
and
2
dν 1 →1 (γ) = dµ 1 (x)dν x1 →1 (γ)
2
2
2
x
x
with ν0→
on geodesics starting from x.
1 concentrated on geodesics ending at x and ν 1
→1
2
2
Now define ν 1 ∈ P(C([0, 1]; X)) as
dν 1 (γ) = dµ 1 (x)dν x (γ)
2
with
1
x
x
2
(restr11 (γ)),
dν x (γ) = dν0→
1 (restr0 (γ)) × dν 1
→1
2
where restrtt21 (γ) = γ ′ with γs′ = γ(1−s)t1 +st2 .
2
2
28
OPTIMAL MASS TRANSPORTATION
Now by the triangle inequality for Lp , we get
!1
Z
p
dp (γ0 , γ1 ) dν 1 (γ)
Wp (µ0 , µ1 ) ≤
C([0,1];X)
≤
=
=
Z
Z
!1
p
dp (γ0 , γ 1 ) dν 1 (γ)
C([0,1];X)
2
!1
+
p
p
Geo(X)
d (γ0 , γ1 ) dν0→ 1 (γ)
2
+
Z
!1
p
dp (γ 1 , γ1 ) dν 1 (γ)
2
C([0,1];X)
Z
!1
p
p
Geo(X)
d (γ0 , γ1 ) dν 1 →1 (γ)
2
1
1
Wp (µ0 , µ1 ) + Wp (µ0 , µ1 ) = Wp (µ0 , µ1 ),
2
2
showing that the inequalities are actually equalities. Since also
!1
Z
p
1
dp (γ0 , γ1 ) dν 1 (γ)
Wp (µ0 , µ1 ) ≤
2
C([0,1];X)

!1
!1 
Z
p
p
 Z
dp (γ 1 , γ1 ) dν 1 (γ)
,
≤ max
dp (γ0 , γ 1 ) dν 1 (γ)
2
2

 C([0,1];X)
C([0,1];X)
1
= Wp (µ0 , µ1 ),
2
this implies that for ν 1 -a.e. γ ∈ C([0, 1]; X) we have d(γ0 , γ 1 ) = 12 d(γ0 , γ1 ) and d(γ 1 , γ1 ) =
2
1
2 d(γ0 , γ1 ).
2
Thus ν 1 is concentrated on Geo(X).
Now, using the above procedure we can define for every n ∈ N a measure ν n ∈ P(Geo(X))
with the property that
(ek2−n )♯ ν n = µk2−n
for all k ∈ {0, 1, . . . , 2n }.
What is left to show is that (ν n ) converges to the measure we were looking for. Since the
¯ 0 , r)) is compact. Hence the tightness
space (X, d) is assumed to be proper the set Geo(B(x
n
of (ν ) follows from the tightness of {µ0 , µ1 }. Thus there is a subsequence of (ν n ) converging
weakly to a measure π ∈ P(Geo(X)). For t = k2−n , k, n ∈ N the equality (et )♯ π = µt is
obvious. For other t ∈ [0, 1] the equality holds since for all n ∈ N
Wp ((et )♯ ν m , µk2−n ) ≤ 2−n Wp (µ0 , µ1 )
for all m ≥ n,
with suitably chosen k ∈ N depending on n and t.
Also the converse of Theorem 2.8 holds.
Theorem 2.12. Suppose that (X, d) is complete and separable, p > 1, and (Pp (X), Wp ) is
geodesic. Then also (X, d) is geodesic.
Proof. Take x, y ∈ X and (µt ) ∈ Geo(Pp (X)) connecting δx to δy . Since
1
1
Wp (µ 1 , δx ) = Wp (µ 1 , δy ) = Wp (δx , δy ) = d(x, y),
2
2
2
2
OPTIMAL MASS TRANSPORTATION
29
we have
1−p p
2
d (x, y) =
≥
≥
Wpp (µ 1 , δx ) +
2
Z
Z
Wpp (µ 1 , δy )
2
=
Z
p
d (x, z) dµ 1 (z) +
2
X
Z
dp (y, z) dµ 1 (z)
X
2
(dp (x, z) + (d(x, y) − d(x, z))p ) dµ 1 (z)
2
X
21−p dp (x, y) dµ 1 (z) = 21−p dp (x, y),
2
X
where the inequalities are thus equalities. Hence µ 1 is concentrated on the set
2
1
z ∈ X : d(x, z) = d(y, z) = d(x, y) .
2
In particular this set is nonempty. Hence there exists x 1 ∈ X with
2
1
1
d(x, x 1 ) = d(y, ) = d(x, y).
2
2
2
Taking inductively midpoints between x and x 1 , x 1 and y and so on, we obtain a dense set
2
2
of points on a “geodesic”. By completeness this gives the geodesic between x and y.
Notice that if p = 1, the conclusion from the chain of (in)equalities in the previous proof
would only be that µ 1 is concentrated on the set
2
{z ∈ X : d(x, z) + d(y, z) = d(x, y)} .
In particular, it could be that µ 1 = 21 (δx + δy ).
2
Definition 2.13. A geodesic space (X, d) is called nonbranching if for any γ 1 , γ 2 ∈ Geo(X)
with γ01 = γ02 and γt1 = γt2 for some t ∈ (0, 1) implies γ 1 = γ 2 .
Let us first observe that (Pp (X), Wp ) inherits the nonbranching property of (X, d), if p > 1.
In the case p = 1, nontrivial transports can branch in nonbranching geodesic spaces: The two
curves
γt1 := δt
for all t ∈ [0, 1],
and
(
δt ,
if t ∈ [0, 12 ]
2
γt :=
(2 − 2t)δ 1 + (2t − 1)δ1 , if t ∈ [ 12 , 1]
2
are both geodesics in (P(R), W1 ). They start as the same geodesic and then branch at t = 21 .
Thus (P(R), W1 ) is branching (i.e. not nonbranching).
Theorem 2.14. Let (X, d) be a complete, separable, geodesic, nonbranching metric space and
1 < p < ∞. Then (Pp (X), Wp ) is also nonbranching.
Proof. Suppose that (Pp (X), Wp ) is branching. Thus there exist Γ, Γ′ ∈ Geo(Pp (X)) and
t ∈ (0, 1) such that Γ0 = Γ′0 and Γt = Γ′t , but still Γ 6= Γ′ meaning that there exists s ∈ (0, 1]
such that Γs 6= Γ′s . We may assume t < s.
Let π1 , π2 ∈ P(Geo(X)) be such that (er )♯ π1 = Γrt+(1−r)s and (er )♯ π2 = Γ′rt+(1−r)s for
all r ∈ [0, 1]. Disintegrating both π1 and π2 with respect to e0 , (i.e. Γt ), we get two sets of
measures
i = 1, 2.
dπi (γ) = dΓt (x)dπix (γ),
30
OPTIMAL MASS TRANSPORTATION
In a set of positive Γt -measure π1x (γ) 6= π2x (γ).
For the beginning of the geodesics, consider π3 ∈ P(Geo(X)) such that (er )♯ π3 = Γrt for
all r ∈ [0, 1]. Now disintegrating π3 w.r.t. e1 gives
dπ3 (γ) = dΓt (x)dπ3x (γ).
For Γt -almost every x ∈ X the gluing of π3x and π1x lives on geodesics. The same holds for
the gluing of π3x and π1x since Γ′0 = Γ0 and Γ′t = Γt .
All in all, we get for x ∈ X on a set of positive Γt -measure branching geodesics going via
x, contradicting the nonbranching assumption.
Let us next investigate how the nonbranching assumption is connected to the existence of
optimal transport maps. We start with the observation that any inner point of a geodesic in
Pp (X) in a nonbranching X has an optimal transport map to the endpoint of the geodesic.
Theorem 2.15. Let (X, d) be a complete, separable, geodesic, nonbranching metric space
and 1 < p < ∞. Let (µt ) ∈ Geo(Pp (X)). Then for every t ∈ (0, 1) there exists a unique
σ ∈ Opt(µt , µ1 ) and it is induced by a map.
Proof. Suppose this is not the case. Let σ ∈ Opt(µt , µ1 ) be such that it is not induced by
a map. Then there exist x, y1 , y2 ∈ X such that y1 6= y2 and (x, y1 ), (x, y2 ) ∈ spt(σ). Let
σ ′ ∈ Opt(µ0 , µt ) and z ∈ X such that (z, x) ∈ spt(σ ′ ). Gluing these optimal transports
together at µt gives an optimal plan σ ′′ ∈ Opt(µ0 , µ1 ) with (z, y1 ), (z, y2 ) ∈ spt(σ ′′ ). Now
z, x, y1 lie on the same geodesic as well as z, x, y2 , as otherwise µt would not be on a geodesic
connecting µ0 and µ1 . Thus we have a contradiction with the nonbranching of (X, d).
In the Euclidean case X = Rn with p = 2 we can say more about the intermediate transport
maps of Theorem 2.15:
Theorem 2.16. Let (µt ) ∈ Geo(P2 (Rn )). Then for any t ∈ (0, 1) the unique σ ∈ Opt(µt , µ1 )
is induced by a 1t -Lipschitz map.
Proof. For any µ0 , µ1 ∈ P2 (Rn ) and σ ∈ Opt(µ0 , µ1 ) by the uniqueness of geodesics in Rn ,
there exists only one π ∈ P(Geo(Rn )) such that (e0 , e1 )♯ π = σ. The corresponding geodesic
in P(Rn ) is
µt = ((1 − t)e0 + te1 )♯ π = ((1 − t)P1 + tP2 )♯ σ.
Thus the unique tranport plan from µt to µ1 is given by
((1 − t)P1 + tP2 , P2 )♯ σ.
This plan is supported on a set
G := {((1 − t)x + ty, y) : y ∈ ∂ − ϕ(x)}
with ϕ convex. Recall that for a convex function ϕ and any (x1 , y1 ), (x2 , y2 ) ∈ ∂ − ϕ we have
hy1 − y2 , x1 − x2 i ≥ 0.
Therefore, for ((1 − t)x1 + ty1 , y1 ), ((1 − t)x2 + ty2 , y2 ) ∈ G we have
|(1 − t)x1 +ty1 − (1 − t)x2 + ty2 |2
≥ (1 − t)2 |x1 − x2 |2 + t2 |y1 − y2 |2 + 2t(1 − t)hx1 − x2 , y1 − y2 i ≥ t2 |y1 − y2 |2 .
Hence G is a subset of the graph of a 1t -Lipschitz map, giving the claim.
OPTIMAL MASS TRANSPORTATION
31
In order to have that the optimal transport from the endpoint µ0 of the geodesic is given
by an optimal map, one has to assume something on the measure µ0 (as we already saw in the
Euclidean case, Theorem 1.34). One sufficient condition for existence of optimal transport
maps in nonbranching spaces is to have absolute continuity for the starting measure µ0 with
respect to a nice enough reference measure m on X. This is due to Cavalletti and Huesmann
[2], following the idea of Gigli [3].
Theorem 2.17. Let (X, d) be a proper, nonbranching metric space that supports a measure
m such that for every compact set K ⊂ X there exists a measurable funciton f : [0, 1] → [0, 1]
with lim supt→0 f (t) > 21 and a positive δ ≤ 1 such that
m(At,x ) ≥ f (t)m(A)
for all 0 ≤ t ≤ δ,
(2.1)
for all A ⊂ K compact, x ∈ K where At,x is defined as
At,x := et ({γ ∈ Geo(X) : γ0 ∈ A, γ1 = x}).
Then for any µ0 , µ1 ∈ P2 (X) with µ0 ≪ m there exists a unique σ ∈ Opt(µ0 , µ1 ), for the
cost c(x, y) = d2 (x, y), and this plan is induced by a map.
The result of Cavalletti and Huesmann holds for any cost function c(x, y) = h(d(x, y) with
h stricty convex and nondecreasing. The assumption (2.1) on the reference measure means
that when we contract a set towards a point, its measure does not shrink to zero too fast.
This is used to push the nonbranching of geodesics to nonbranching at time zero. In the
Euclidean Rn space with the Lebesgue measure the control function is simply f (t) = (1 − t)n ,
since the set At,x is a scaling of A by a factor (1 − t).
A sketch of a proof of Theorem 2.17. Let µ0 , µ1 and σ ∈ Opt(µ0 , µ1 ) be fixed and let (ϕ, ϕc )
be the Kantorovich potentials associated with σ and
Γ := {(x, y) ∈ X × X : ϕ(x) + ϕc (y) = c(x, y)}
so that in particular σ(Γ) = 1.
The first step is to prove that the control 2.1 on the contractions towards points can be
transfered to more general targets. In more detail, for any compact Λ ∈ X × X, t ∈ [0, 1] and
compact A ⊂ X one defines
At,Λ := et ((e0 , e1 )−1 ((A × X) ∩ Λ))
and
ˆ := (P1 (Λ) × P2 (Λ)) ∩ Γ.
Λ
Then one shows3 that for any Λ ⊂ Γ compact
m(At,Λˆ ) ≥ f (t)m(A)
for all t ∈ [0, δ] and A ⊂ P1 (Λ).
(2.2)
Next one shows that for any Λ1 , Λ2 ⊂ Γ such that
P1 (Λ1 ) = P2 (Λ2 )
and
P2 (Λ1 ) ∩ P2 (Λ2 ) = ∅
it necessarily holds that m(P1 (Λ1 )) = m(P1 (Λ2 )) = 0. This is seen by taking A := P1 (Λ1 ) =
P1 (Λ2 ) and observing that since At,Λi , with i = 1, 2, converge to A in the Hausdorff topology,
3This is the most technical part of the proof.
32
OPTIMAL MASS TRANSPORTATION
one has
m(A) = lim sup m(Aǫ ) ≥ lim sup m(At,Λ1 ∪ At,Λ2 )
ǫ→0
t→0
= lim sup(m(At,Λ1 ) + m(At,Λ2 )) ≥ lim sup 2f (t)m(A) > m(A),
t→0
t→0
where Aǫ := {x ∈ X : d(x, A) ≤ ǫ} and the second equality follows from the nonbranching
assumption.
Assume then that σ is not given by a map. Then there exists a continuous T : X → X and
compact E ⊂ X such that {(x, T (x)) : x ∈ E} ⊂ Γ, for every x ∈ E there exists y ∈ X such
that y 6= T (x) and (x, y) ∈ Γ, and µ0 (E) > 0. Since
∞ [
1
Γ ∩ (E × X) =
(x, y) ∈ Γ : x ∈ E, d(y, T (x)) ≥
,
n
n=1
there exists n ∈ N such that m(P1 (Λ)) > 0 with
1
.
Λ := (x, y) ∈ Γ : x ∈ E, d(y, T (x)) ≥
n
Since T is continuous, there exists δ > 0 such that for every x ∈ E and y ∈ X with d(x, y) ≤ δ
1
. Let us take x
¯ ∈ P1 (Λ) such that
we have d(T (x), T (y)) < 2n
¯ x, δ)) > 0.
m(P1 (Λ) ∩ B(¯
Now, defining
¯ x, δ)}
Λ1 := {(x, T (x)) ∈ Γ : x ∈ E ∩ B(¯
¯ x, δ)},
Λ2 := {(x, y) ∈ Λ : x ∈ B(¯
¯ x, δ)) > 0. On the
we have Λ1 , Λ2 ⊂ Γ with P1 (Λ1 ) = P2 (Λ2 ) and m(P1 (Λ1 )) = m(P1 (Λ) ∩ B(¯
¯ x, δ) such that d(y, T (w)) ≥ 1 . Hence for any
other hand, for any y ∈ P2 (Λ2 ) we have w ∈ B(¯
n
¯ x, δ) we have
z ∈ B(¯
1
1
1
d(y, T (z)) ≥ d(y, T (w)) − d(T (w), d(z)) ≥ −
=
,
n 2n
2n
showing that P2 (Λ1 )∩P2 (Λ2 ) = ∅. This contradicts the previous step. Thus σ is concentrated
on a graph of a function from which the uniqueness follows as in the previous proofs.
and
Proper metric measure spaces satisfying the assumption of Theorem 2.17 are necessarily
locally doubling: for every x ∈ X there exists a radius R > 0 and a constant C > 0 such that
m(B(y, 2r)) ≤ Cm(B(y, r))
for all y ∈ B(x, R) and 0 < r < R.
Problem 1. What properties of a metric measures space (X, d, m) imply the condition (2.1)?
In which geodesic metric spaces (X, d) can one find a reference measure m satisfying the
condition?
For some sufficient conditions, see [2, Remark 3.5].
OPTIMAL MASS TRANSPORTATION
33
3. Optimal transport and curvature
Optimal transportation has been used to give definitions of Ricci curvature lower bounds
in metric measure spaces. Before going into Ricci curvature, let us briefly visit Alexandrov
spaces, i.e. metric spaces with (generalized) sectional curvature bounds.
Curvature bounds in the sense of Alexandrov can be stated by comparing triangles to
triangles in model spaces. Let us define here only sectional curvature upper and lower bound
zero.
Definition 3.1. A geodesic space (X, d) is said to be positively curved in the sense of Alexandrov if for every γ ∈ Geo(X) and every z ∈ X we have
d2 (γt , z) ≥ (1 − t)d2 (γ0 , z) + td2 (γ1 , z) − t(1 − t)d2 (γ0 , γ1 )
(3.1)
for all t ∈ [0, 1]. If the converse inequality always holds, the space is called non positively
curved (in the sense of Alexandrov).
Non positively curved spaces are also called CAT(0) spaces, or Hadamard spaces. If (X, d)
is both positively and non positively curved, it is a convex subset of a Hilbert space.
Theorem 3.2. Assume (X, d) is positively curved. Then (P2 (X), W2 ) is positively curved.
Proof. Let (µt ) ∈ Geo(P2 (X)) and π ∈ P(Geo(X)) such that
for all t ∈ [0, 1].
µt = (et )♯ π
Fix t ∈ [0, 1] and σ ∈ Opt(µt , ν). Gluing π and σ together at µt (by first disintegrating both
with respect to µt , taking the products of the disintegrated measures, and then integrating)
we obtain α ∈ P(Geo(X) × X) such that
P♯1 α = π
and
(et , P 2 )♯ α = σ,
with P 1 (γ, x) = γ, P 2 (γ, x) = x and et (γ, x) = γt . Since
(e0 , P 2 )♯ α ∈ A(µ0 , ν)
and
(e1 , P 2 )♯ α ∈ A(µ1 , ν),
we get
W22 (µt , ν)
Z
Z
d (γt , x) dα(γ, x) ≥ (1 − t)d2 (γ0 , x) + td2 (γ1 , x) − t(1 − t)d2 (γ0 , γ1 ) dα(γ, x)
Z
Z
Z
2
2
= (1 − t) d (γ0 , x) dα(γ, x) + t d (γ1 , x) dα(γ, x) − t(1 − t) d2 (γ0 , γ1 ) dα(γ, x)
=
2
≥ (1 − t)W22 (µ0 , ν) + tW22 (µ1 , ν) − t(1 − t)W22 (µ0 , µ1 ),
giving the claim.
Let us next observe that upper bounds on the (sectional) curvature of the space (X, d) do
not imply curvature bounds on the space (P2 (X), W2 ).
Example 3.3. Let X = R2 with d the Euclidean distance. Then (X, d) is non positively (and
positively) curved. Still (P2 (R2 ), W2 ) is not non positively curved. To see this, define
1
1
1
µ0 := (δ(1,1) + δ(5,3) ), µ1 := (δ(−1,1) + δ(−5,3) ), ν := (δ(0,0) + δ(0,−4) ).
2
2
2
Then
A(µ0 , µ1 ) = {aσ1 + (1 − a)σ2 : a ∈ [0, 1]}
34
OPTIMAL MASS TRANSPORTATION
with
1
σ1 = (δ((1,1),(−1,1)) + δ((5,3),(−5,3)) )
2
and
σ2 =
1
(δ
+ δ((5,3),(−1,1)) ).
2 ((1,1),(−5,3))
Since
Z
1
1
d2 (x, y) d(aσ1 + (1 − a)σ2 )(x, y) = a (22 + 102 ) + (1 − a) (62 + 22 + 62 + 22 )
2
2
= a · 52 + (1 − a) · 40,
we have W22 (µ0 , µ1 ) = 40. Similarly one can calculate that W22 (µ0 , ν) = W22 (µ1 , ν) = 30.
From the above computation we also see that the unique geodesic µt connecting µ0 to µ1 is
1
µt = (δ(1−6t,1+2t) + δ(5−6t,3−2t) ).
2
But now,
30 30 40
+
− .
40 = W22 (µ 1 , ν) >
2
2
2
4
2
Hence (P2 (R ), W2 ) is not non positively curved.
3.1. Ricci curvature lower bounds. Let us then turn to Ricci curvature lower bounds.
Let R denote the Riemann curvature tensor on a Riemannian manifold M , and let x ∈ M
and u, v ∈ Tx M . The Ricci curvature Ric(u, v) ∈ R is defined as
X
Ric(u, v) :=
hR(u, ei )v, ei i,
i
where (ei ) is an orthonormal basis of Tx M . The manifold M is said to have Ricci curvature
bounded below by λ ∈ R if
Ric(u, u) ≥ λ|u|2
for every x ∈ M and u ∈ Tx M .
One geometric interpretation of Ricci curvature is in terms of infinitesimal volume comparison. Let x ∈ M and B ⊂ Tx M a small neighbourhood of the origin. Since expx : B → M
is injective and smooth, the density
d(expx )♯ Ld
,
dVol
where Ld is the Lebesgue measure on the tangent space Tx M and Vol is the volume measure
on M , is also smooth. For u ∈ B this density has the Taylor expansion
1
ρ(expx (u)) = 1 + Ric(u, u) + o(|u|2 ).
2
Hence, where sectional curvature bounds tell us something about the tendency for geodesics
to converge (in positive curvature) or diverge (in negative curvature), Ricci curvature bounds
give us information how volume-elements shrink or grow. This effect of Ricci curvature can
also be studied using optimal mass transport. Notice that in the above interpretation of Ricci
curvature, we compared to the volume measure of the manifold. Unlike in the definition of
(non) positively curved spaces, such reference measure plays a crucial role. We will usually
write the reference measure as m.
Next we will take a look at how Ricci curvature lower bounds can be formulated using
optimal transport. Before this, let us list properties we would like to have for the formulation.
It should at least:
ρ=
OPTIMAL MASS TRANSPORTATION
35
(1) agree with the Ricci curvature lower bound on Riemannian manifolds,
(2) be stable under suitable convergence,
(3) make sense on as general setting as possible, and
(4) imply useful analytic and geometric properties.
The point (1) is more or less obvious. The requirement for stability is motivated for example
by the fact that the set of Riemannian manifolds with uniform Ricci curvature lower bound,
and dimension and diameter upper bound is precompact in a natural topology, the measured
Gromov-Hausdorff topology. We will soon look at this topology.
The points (3) and (4) fight against each other. It turns out that the more properties we
require, the more we have to restrict the definitions. Let us list some of the properties one
could ask for:
• For the classical “analysis on metric spaces”: Doubling property (at least locally)
for the reference measure m: meaning that there exists a constant C such that
m(B(x, 2r)) ≤ Cm(B(x, r))
for every x ∈ X and r > 0, and local Poincar´
e inequality (at least locally), meaning
that there exist constants C and λ such that for any Lipschitz function f : X → R
and ball B ⊂ X we have
Z
Z
1
1
|f (x) − hf iB | dm(x) ≤ Cr
|∇f (x)| dm(x),
m(B) B
m(λB) λB
where
Z
1
f (x) dm(x).
=
hf iB :=
m(B)
B
• Locality: if the Ricci curvature lower bounds hold locally, they should hold globally
(after all, the Ricci curvature is infinitesimal notion in the classical setting).
• Restriction: if we take a geodesically convex subset, it should also have the same
lower bound as the initial space.
• Tensorization: if X and Y have a Ricci curvature lower bound, then should X × Y
have as well.
• Variants of the classical volume comparisons: the Bishop-Gromov volume comparison, and the Brunn-Minkowski inequality. (Both will be defined later.)
• Estimates on the size of the cut-locus and possible branching of geodesics.
• Functional inequalities: Sobolev inequality (recall Theorem 1.37), log-Sobolev
inequality, HWI-inequality.
• Geometric rigidity results: Splitting theorem, maximal diameter theorem, .
• ...
Let us then look at the different topologies used on the collection of metric measure spaces
{(X, d, m)}.
Definition 3.4. Let (Z, D) be a metric space and X, Y ⊂ Z. The Hausdorff distance between
X and Y is
dH (X, Y ) = max{sup dist(x, Y ), sup dist(y, X)}.
x∈X
y∈Y
This notion can be generalized to metric spaces (X, dX ) (Y, dY ) be defining the GromovHausdorff distance between X and Y as
dGH (X, Y ) = inf dH (f (X), g(Y )),
f
36
OPTIMAL MASS TRANSPORTATION
where the infimum is over all metirc spaces Z and isometric embeddings f : X → Z and
g : Y → Z.
For non-compact spaces often it makes more sense to consider the pointed Gromov-Hausdorff
convergence. A sequence (Xi , di , pi )∞
i=1 of pointed metric spaces (i.e. metric spaces (Xi , di )
with chosen points pi ∈ Xi ) converges in the pointed Gromov-Hausdorff sense to a pointed
metric space (X∞ , d∞ , p∞ ) if there exist a metric space (Z, dZ ) and isometric embeddings
fi : Xi → Z for i = N ∪ {∞} such that for every ǫ > 0 and R > 0 there exists i0 ∈ N such
that for every i > i0
f∞ (B(p∞ , R)) ⊂ (fi (B(pi , R)))ǫ
and
fi (B(pi , R)) ⊂ (f∞ (B(p∞ , R)))ǫ .
Furthermore, if the spaces (Xi , di ) are equipped with measures mi , we may consider the
pointed measured Gromov-Hausdorff convergence, where in addition to pointed GromovHausdorff convergence we require, with the above notation, that (fi )♯ mi weak∗ converges to
(f∞ )♯ m∞ (or sometimes the convergence is required to be weak (i.e. narrow)).
Let us also give another notion of convergence for metric measure spaces, called they Ddistance. It was introduced by Sturm in [4]. From now on we will assume all reference
measures to be probability measures with finite second moment. In order to define the Ddistance we need the following
Definition 3.5. Given two metric measure spaces (X, dX , mX ) and (Y, dY , mY ) (with mX ∈
P2 (X) and mY ∈ P2 (Y )), we consider the product space (X × Y, DXY ) where
q
DXY ((x1 , y1 ), (x2 , y2 )) := d2X (x1 , x2 ) + d2Y (y1 , y2 ).
A couple (d, σ) is called an admissible coupling between (X, dX , mX ) and (Y, dY , mY ), if
• d is a pseudo distance on spt(mX ) ⊔ spt(mY ) such that when restricted to spt(mX ) ×
spt(mX ) it equals dX and when restricted to spt(mY ) × spt(mY ) it equals dY . (Recall
that a pseudo distance d is the same as a distance with the exception that d(x, y) = 0
does not necessarily imply x = y.)
• σ ∈ P(spt(mX ) × spt(mY )) such that (P 1 )♯ σ = mX and (P 2 )♯ σ = mY . Here the Borel
structure in P(spt(mX ) × spt(mY )) is with the distance DXY .
We write A((X, dX , mX ), (Y, dY , mY )) for the set of admissible couplings.
With the notion of couplings of metric measure spaces we can define the D-distance similarly
as the W2 -distance.
Definition 3.6. The distance D between metric measure spaces (X, dX , mX ) and (Y, dY , mY )
is defined as
D((X, dX , mX ), (Y, dY , mY )) := inf C(d, σ),
(3.2)
(d,σ)
where the infimum is over all (d, σ) ∈ A((X, dX , mX ), (Y, dY , mY )) and the cost C(d, σ) is
defined as
!1
Z
2
2
.
d (x, y) dσ(x, y)
C(d, σ) :=
spt(mX )×spt(mY )
Notice that where the measured (pointed) Gromov-Hausdorff convergence requires both
the convergence of the spaces and the measures, the D-distance only cares about the supports
of the measures mX and mY . For example D((X, dX , δx ), (Y, dY , δy )) = 0 regardless of what
the spaces X and Y are.
OPTIMAL MASS TRANSPORTATION
37
Remark 3.7. In the definition of D-distance it is enough to consider couplings (d, σ) where d is
a real distance. Indeed, given (d, σ) ∈ A((X, dX , mX ), (Y, dY , mY )) and ǫ > 0 we can consider
(
d(z1 , z2 ),
if (z1 , z2 ) ∈ (X × X) ∪ (Y × Y )
dǫ (z1 , z2 ) =
d(z1 , z2 ) + ǫ, if (z1 , z2 ) ∈ (X × Y ) ∪ (Y × X).
Proposition 3.8. There always exists an optimal couling of metric measure spaces, i.e. an
admissible coupling realizing the infimum in (3.2).
Proof. As for the existence of minimizers for the Kantorovich problem, we need to show
that the cost C is suitably lower semicontinuous and that A((X, dX , mX ), (Y, dY , mY )) has
suitable compactness properties. The weak compactness of the set of measures {σ : (d, σ) ∈
A((X, dX , mX ), (Y, dY , mY ))} is clear since they are tight by the tightness of {mX } and {mY }.
For the pseudo distances we will only use the fact that if we take a minimizing sequence, the
distances stay bounded between fixed points. Let (X, dX , mX ), (Y, dY , mY ) be metric measure
spaces as in the previous definition. Suppose that (di , σi )∞
i=1 ⊂ A((X, dX , mX ), (Y, dY , mY ))
is a sequence such that
C(di , σi ) → D((X, dX , mX ), (Y, dY , mY )).
Without loss of generality we may assume that σi converge to some σ. In order to obtain a
converging subsequence of di we can take a dense subset (xj , yj )∞
j=1 ∈ spt(mX ) × spt(mY ). If
there would be a subsequence of di such that di (xj , yj ) → ∞ for some j, then C(di , σi ) → ∞.
Thus the sequences stay bounded and we can take a converging subsequence for di (x1 , y1 ),
a converging subsequence of this for di (x2 , y2 ) and so on, and finally the diagonal sequence
which converges for all (xj , yj ). By continuity this will define a distance d on the whole space
spt(X) ⊔ spt(Y ). It then follows that
C(d, σ) = D((X, dX , mX ), (Y, dY , mY )).
We denote the set of optimal couplings (i.e. the ones realizing the infimum in (3.2)) between
metric measure spaces (X, dX , mX ) and (Y, dY , mY ) by Opt((X, dX , mX ), (Y, dY , mY )).
Definition 3.9. We call two metric measure spaces (X, dX , mX ) and (Y, dY , mY ) isomorphic
if there exists a bijective isometry f : spt(mX ) → spt(mY ) such that f♯ mX = mY .
We denote by X the set of isomorphism classes of complete and separable metric measure
spaces (X, d, m), with m ∈ P2 (X).
Proposition 3.10. (X, D) is a metric space.
Proof. Let us start by showing that D is a distance. First of all, it is clearly symmetric and
for isomorphic spaces (X, dX , mX ) and (Y, dY , mY ) we have D((X, dX , mX ), (Y, dY , mY )) = 0.
If we are able to show the triangle inequality, we get that D is always finite by going via a
Dirac mass and recalling that m ∈ P2 (X) for all the spaces in X.
In order to see the triangle inequality, take (Xi , di , mi ) ∈ X, i = 1, 2, 3. We may assume
that Xi = spt(mi ) for i = 1, 2, 3. For every ǫ > 0 there exist
(dij , σij ) ∈ A((Xi , di , mi ), (Xj , dj , mj ))
with (i, j) = (1, 2), (2, 3) such that
C(dij , σij ) ≤ D((Xi , di , mi ), (Xj , dj , mj )) + ǫ.
38
OPTIMAL MASS TRANSPORTATION
Morever, we may assume dij to be distances. Let us then
X2 ⊔ X3 by setting

d12 (x, y),



d (x, y),
23
d123 (x, y) :=

inf z∈X2 (d12 (x, z) + d23 (z, y)),



inf z∈X2 (d23 (x, z) + d12 (z, y)),
define the distance d123 on X1 ⊔
if
if
if
if
(x, y) ∈ X1 ⊔ X2
(x, y) ∈ X2 ⊔ X3
x ∈ X1 , y ∈ X3
x ∈ X3 , y ∈ X1 .
This gives a good competitor (pseudo) distance for computing D((X1 , d1 , m1 ), (X3 , d3 , m3 ). We
still need to find a good σ to acompany it. This is given by gluing σ12 to σ23 at m2 . Another
way to conclude the triangle inequality without gluing is to consider the space (P2 (X1 ⊔ X2 ⊔
X3 ), W2 ) with the distance d123 . There we get
D((X1 , d1 , m1 ), (X3 , d3 , m3 ) = inf C(d, σ) ≤ W2 (m1 , m3 ) ≤ W2 (m1 , m2 ) + W2 (m2 , m3 )
(d,σ)
≤ C(d12 , σ12 ) + C(d23 , σ23 )
≤ D((X1 , d1 , m1 ), (X2 , d2 , m2 )) + D((X2 , d2 , m2 ), (X3 , d3 , m3 )) + 2ǫ.
In order to show that D is a distance, we still need to verify that D((X, dX , mX ), (Y, dY , mY )) =
0 implies that X and Y are isomorphic. Let (d, σ) be an optimal coupling of (X, dX , mX ) and
(Y, dY , mY ). Then
!1
Z
2
2
C(d, σ) =
d (x, y) dσ(x, y)
= 0.
spt(mX )×spt(mY )
Thus d(x, y) = 0 for σ-almost every (x, y) ∈ spt(mX ) × spt(mY ). Since d is a pseudo distance
and a distance when restricted to spt(mY ), for every x ∈ spt(mX ) there exists a unique
f (x) := y ∈ spt(mY ) such that d(x, y) = 0. By the triangle inequality for d, the map f
is an isometry. Now σ = (id, f )♯ mX , so in particular mY = f♯ mX . Thus X and Y are
isomorphic.
Proposition 3.11. (X, D) is complete and separable.
Proof. Let us start with completeness. Let (Xi , di , mi )∞
i=1 be a Cauchy-sequence in (X, D).
Next we take a subsequence such that
D((Xik , dik , mik ), (Xik+1 , dik+1 , mik+1 )) ≤ 2−k−1
for all k ∈ N. There exist (dˆk , σk ) ∈ A((Xik , dik , mik ), (Xik+1 , dik+1 , mik+1 )) with dˆk a distance
such that
C(dˆk , σk ) ≤ 2−k .
Now we recursively attach the spaces to each other by defining (X1′ , d′1 ) := (Xi1 , di1 ) and
′
/ ∼ with x ∼ y if d′k (x, y) = 0, where
Xk′ := Xik ⊔ Xk−1

′
′

if x, y ∈ Xk−1
dk−1 (x, y),
d′k (x, y) := dˆk−1 (x, y),
if x, y ∈ Xik−1 ⊔ Xik


′
′
ˆ
inf z∈Xk−1 (dk−1 (x, z) + dk−1 (z, y)), if x ∈ Xk−1
, y ∈ Xik−1 ⊔ Xik .
′
. We define
This way
a sequence of nested metric spaces (Xk′ , d′k ), Xnk ⊂ Xk′ ⊂ Xk+1
S∞ we have
′
′
′
′
′
X = k=1 Xk and d = limk→∞ dk a distance on X . Let (X, d) be the completion of (X ′ , d′ ).
OPTIMAL MASS TRANSPORTATION
39
Now, all the measures mik can be naturally embedded to (X, d) and for them we have (in the
W2 -distance defined from d)
W2 (mik , mik+1 ) ≤ C(dˆk , σk ) ≤ 2−k .
Thus (mik )k is Cauchy in (P2 (X), W2 ). Since this space is complete, there exists a measure
m ∈ P2 (X) such that
D((Xik , dik , mik ), (X, d, m)) ≤ W2 (mik , m) → 0.
Thus (X, D) is complete.
The space (X, D) is separable since
n
o
X
Xdisc := (X, d, m) ∈ X : X = {x1 , . . . , xn }, d(xi , xj ) ∈ Q, m =
ai δxi , ai ∈ Q, n ∈ N
is dense in (X, D).
Since we will often consider separately absolutely continuous measures in a given metric
measure space (X, d, m), we denote
P a (X) := {µ ∈ P(X) : µ ≪ m}
and
Ppa (X) := {µ ∈ Pp (X) : µ ≪ m} .
(Recall that µ1 is absolutely continuous with respect to µ2 , denoted µ1 ≪ µ2 , if µ2 (A) =
0 ⇒ µ1 (A) = 0 for all (Borel) A ⊂ X, and µ1 and µ2 are singular with respect to eachother,
denoted µ1 ⊥ µ2 , if there exists a set S ∈ B(X) such that µ1 (S) = µ2 (X \ S) = 0.)
Given a coupling (d, σ) between metric measure spaces (X, dX , mX ) and (Y, dY , mY ) we
can assosiate a map σ♯ : P2a (X) → P2a (Y ) defined as
Z
µ = ρmX 7→ σ♯ := ηmY ,
where η(y) := ρ(x) dσy (x),
with {σy } the disintegration σ with respect to the projection on Y . Similarly we can define
σ♯−1 : P2a (Y ) → P2a (X).
Now that we have introduced a suitable topology (or two) for convergence of metric measure
spaces, let us return to Ricci curvature lower bounds. We would like to deifnitions of Ricci
curvature lower bounds to be stable under convergence in the D-distance. The stability will
be obtained by defining the bounds using inequalities for semicontinuous funtionals on the
space of probability measures.
Let us next introduce the functionals we will consider. Let u : [0, ∞) → R be convex and
continuous with u(0) = 0 and define
u(z)
.
z→∞ z
With these we can define a functional E : P(X) × P(X) → R ∪ {+∞} by setting
Z
E (µ|ν) := u(ρ) dν + u′ (∞)µs (X),
u′ (∞) := lim
where µ = ρν + µs is the decomposition of µ into the absolutely continuous part ρν with
respect to ν and the singular part µs with respect to ν.
The most important functionals for our purpose are the entropy functionals EN for N ∈
(1, ∞]. They are given by the functions uN which are defined as
1
uN (z) := N (z − z 1− N )
when N < ∞, and as
u∞ (z) := z log(z).
40
OPTIMAL MASS TRANSPORTATION
Notice that for N < ∞
1
u′N (∞)
z − z 1− N
=N
= lim N
z→+∞
z
and
u′∞ (∞) = lim
z log(z)
z→+∞
z
s
+ µ ∈ P(X))
= +∞.
Thus for N < ∞ we have (for µ = ρm
Z
Z
1
1
EN (µ|m) = N (ρ − ρ1− N ) dm + N µs (X) = N − ρ− N dµ
and
(R
ρ log(ρ) dm, if µs (X) = 0
+∞,
if µs (X) > 0.
In the following we always assume that E be given by some continuous and convex function
u with u(0) = 0.
E∞ (µ|m) =
Lemma 3.12. Let (X, dX , mX ), (Y, dY , mY ) ∈ X and (d, σ) ∈ A((X, dX , mX ), (Y, dY , mY )).
Then
E (σ♯ µ|mY ) ≤ E (µ|mX ),
for all µ ∈ P2a (X),
E (σ♯−1 ν|mX ) ≤ E (ν|mY ),
for all ν ∈ P2a (Y ).
Proof. Let us only prove the
inequality we get
Z
E (σ♯ µ|mY ) =
Z
≤
Z
=
first inequality. Let µ = ρmX and σ♯ µ = ηmY . By Jensen’s
Z Z
ρ(x) dσy (x) dmY (y)
u(η(y)) dmY (y) = u
Z
Z
u(ρ(x)) dσy (x) dmY (y) = u(ρ(x)) dσ(x, y)
u(ρ(x)) dmX (x) = E (µ|mX ).
Lemma 3.13. The funtional E is weakly lower semincontinuous with respect to both variables.
In other words, for µn → µ and νn → ν weakly we have
E (µ|ν) ≤ lim inf E (µn |νn ).
n→∞
Let us proof Lemma 3.13 only in the case where (X, d) is compact. We will use the Legendre
transform for representing the functional.
Definition 3.14. Let u : [0, ∞) → R be continuous and convex with u(0) = 0. The Legendre
transform of u is defined on R as
u∗ (r) := sup (rs − u(s)) .
s∈[0,∞)
Proposition 3.15. Let X be compact and u : [0, ∞) → R be continuous and convex with
u(0) = 0. Then
Z
Z
∗
′
′
u (ϕ) dν : ϕ ∈ C(X), u (1/M ) ≤ ϕ ≤ u (M ), M ∈ N .
ϕ dµ −
E (µ|ν) = sup
X
X
OPTIMAL MASS TRANSPORTATION
41
Proof of Lemma 3.13. Since u∗ is continuous on [u′ (1/M ), u′ (M )], so for ϕ continuous with
values in [u′ (1/M ), u′ (M )] also u∗ (ϕ) is continuous. Thus by Proposition 3.15 the functional
E is the supremum of continuous functionals
Z
Z
u∗ (ϕ) dν.
ϕ dµ −
X
X
and thus lower semincontinuous.
Let us next prove suitable Γ-convergence of the functionals in the D-distance.
Theorem 3.16. Suppose limn→∞ D((Xn , dn , mn ), (X, d, m)) = 0 for spaces Xn , X ∈ X and
let (dn , σn ) ∈ Opt((Xn , dn , mn ), (X, d, m)). Then
a
(i) For any sequence (µn )∞
n=1 with µn ∈ P2 (Xn ) such that (σn )♯ µn converges weakly to
some µ ∈ P(X) it holds
lim inf E (µn |mn ) ≥ E (µ|m).
(ii)
n→∞
a
For any µ ∈ P2 (X) with bounded density
P2a (Xn ) with W2 ((σn )♯ µn , µ) → 0 and
there exists a sequence (µn )∞
n=1 with µn ∈
lim sup E (µn |mn ) ≤ E (µ|m).
n→∞
References
[1] V. I. Bogachev, Measure Theory, vol. 2, Springer.
[2] F. Cavalletti, M. Huesmann, Existence and uniqueness of optimal transport maps, to appear in Ann. Inst.
H. Poincar´e Anal. Non Lin´eaire, arXiv:1301.1782
[3] N. Gigli, Optimal maps in non branching spaces with Ricci curvature bounded from below, Geometric And
Functional Analysis 22 (2012), 990–999.
[4] K.-T. Sturm, On the geometry of metric measure spaces I, Acta Math. 196 (2006), 65–131.