Linear Coding, Applications and Supremus Typicality

Linear Coding, Applications and Supremus Typicality
SHENG HUANG
Doctoral Thesis in Electrical Engineering
Stockholm, Sweden 2015
TRITA-EE 2015:008
ISSN 1653-5146
ISBN 978-91-7595-462-2
KTH, School of Electrical Engineering
Communication Theory Department
SE-100 44 Stockholm
SWEDEN
Akademisk avhandling som med tillstånd av Kungl Tekniska högskolan framlägges
till offentlig granskning för avläggande av teknologie doktorsexamen i Elektro- och
Systemteknik fredag den 20 mars 2015 klockan 13.15 i hörsal Q2, Osquldas väg 10,
Stockholm.
© 2014 Sheng Huang, unless otherwise noted.
Tryck: Universitetsservice US AB
Sammanfattning
Detta arbete börjar med att presentera en kodningssats gällande linjär kodning
över ändliga ringar för kodning av korrelerade diskreta minneslösa källor. Denna sats inkluderar som specialfall motsvarande uppnåbarhetssatser från Elias och
Csiszár gällande linjär kodning över ändliga kroppar. Dessutom visas det att för
varje uppsättning av ändliga korrelerade diskreta minneslösa källor, så finns alltid
en sekvens av linjära kodare över vissa ändliga icke-kropp-ringar som uppnår datakompressionsgränsen bestämd av Slepian-Wolf-regionen. Därmed sluter vi problemet med linjär kodning över ändlig icke-kropps-ringar för i.i.d. datakomprimering
med positiv bekräftelse gällande existens.
Vi studerar också kodning av funktioner, där avkodaren är intresserad av att
återskapa en diskret mappning av data som genererats av flera korrelerade i.i.d.
källor och som kodats individuellt. Vi föreslår linjär kodning över ändliga ringar
som en alternativ lösning på detta problem. Vi visar att linjär kodning över ändliga
ringar presterar bättre än sin ändliga-kropp-motsvarighet, liksom dessutom SlepianWolf-kodning, i termer av att uppnå bättre kodningshastigheter för kodning av flera
diskreta funktioner.
För att generalisera ovannämnda genomförbarhetssatser, både gällande datakompression och funktionskodningsproblemet, till Markov-källor (homogena irreducerbara Markov-källor), så introducerar vi ett nytt koncept gällande klassificering av typiska sekvenser, benämnd Supremus-typiska sekvenser. Den asymptotiska
likafördelningsprincipen samt en generaliserad version av typiskhets-hjälpsatsen för
Supremus-typiska sekvenser bevisas. Jämfört med traditionell (stark och svag) typiskhet, så tillåter Supremus-typiskhet oss att härleda bättre tillgängliga verktyg
och resultat, som låter oss bevisa att linjär kodning över ringar är överlägsen andra metoder. I motsats härtill misslyckas argument baserade på den traditionella
versionen antingen med att nå liknande resultat eller så är de härledda resultaten
svåra att analysera på grund av en utmanande utvärdering av entropitakt.
För att ytterligare undersöka den grundläggande skillnaden mellan traditionell typiskhet och Supremus-typiskhet och dessutom göra våra resultat än mer
allmänt gällande, så betraktar vi även asymptotiskt medelvärdesstationära ergodiska källor. Våra resultat visar att en inducerad transformation med avseende på
en ändligt mätbar mängd över ett rekurrent asymptotiskt medelvärdesstationärt
dynamiskt system med ett sigma-ändlig sannolikhetsmått är asymptotiskt medelvärdesstationär. Följaktligen så gäller Shannon-McMillan-Breiman-teoremet, liksom Shannon-McMillan-teoremet, för alla reducerade processer härledda ur rekurrenta asymptotiskt medelvärdesstationära stokastisk processer. Alltså ser vi att
det traditionella typiskhetkonceptet endast realiserar Shannon-McMillan-Breimanteoremet i ett globalt hänseende, medan Supremus-typiskhet leder till att resultatet
håller samtidigt även för alla härledda reducerade sekvenser.
Abstract
This work first presents a coding theorem on linear coding over finite rings for
encoding correlated discrete memoryless sources. This theorem covers corresponding achievability theorems from Elias and Csiszár on linear coding over finite fields
as special cases. In addition, it is shown that, for any set of finite correlated discrete
memoryless sources, there always exists a sequence of linear encoders over some finite non-field rings which achieves the data compression limit, the Slepian–Wolf
region. Hence, the optimality problem regarding linear coding over finite non-field
rings for i.i.d. data compression is closed with positive confirmation with respect
to existence.
We also address the function encoding problem, where the decoder is interested
in recovering a discrete function of the data generated and independently encoded
by several correlated i.i.d. sources. We propose linear coding over finite rings as an
alternative solution to this problem. It is demonstrated that linear coding over finite
rings strictly outperforms its field counterpart, as well as the Slepian–Wolf scheme,
in terms of achieving better coding rates for encoding many discrete functions.
In order to generalise the above achievability theorems, on both the data compression and the function encoding problems, to the Markovian settings (homogeneous irreducible Markov sources), a new concept of typicality for sequences, termed
Supremus typical sequences, is introduced. The Asymptotically Equipartition Property and a generalised typicality lemma of Supremus typical sequences are proved.
Compared to traditional (strong and weak) typicality, Supremus typicality allows
us to derive more accessible tools and results, based on which it is once again proved
that linear technique over rings is superior to others. In contrast, corresponding arguments based on the traditional versions either fail to draw similar conclusions or
the derived results are often hard to analyse because it is complicated to evaluate
entropy rates.
To further investigate the fundamental difference between traditional typicality and Supremus typicality and to bring our results to a more universal setting,
asymptotically mean stationary ergodic sources, we look into the ergodic properties featured in these two concepts. Our studies prove that an induced transformation with respect to a finite measure set of a recurrent asymptotically mean
stationary dynamical system with a sigma-finite measure is asymptotically mean
stationary. Consequently, the Shannon–McMillan–Breiman Theorem, as well as the
Shannon–McMillan Theorem, holds simultaneously for all reduced processes of any
finite-state recurrent asymptotically mean stationary random process. From this,
we see that the traditional typicality concept only realises the Shannon–McMillan–
Breiman Theorem in the global sequence, while Supremus typicality engraves the
simultaneous effects claimed in the previous statement into all reduced sequences
as well.
Acknowledgments
I want to express my deepest gratitude to my supervisor, Professor Mikael
Skoglund, for accepting me to work in his research group. This not only let to
this thesis but also made a great impact on my future career. Mikael is extremely
kind to allow me to pursue my research interests. His helpful comments and suggestions have influenced my work in many ways. I will always remember the time
working with him.
I wish to thank all my friends and colleagues from the Communication Theory
Department for creating such a wonderful working environment. They are always
very supportive. The last few years of academic life is certainly less enjoyable without them.
I am also very grateful to Farshad Naghibi and Hady Ghaouch for proofreading
this thesis.
Sheng Huang
Stockholm, February 2015
Contents
Sammanfattning
iii
Abstract
v
Acknowledgments
vii
Contents
0 Introduction
0.1 Motivations . . . . . . . .
0.2 Outline and Contributions
0.3 Copyright Notice . . . . .
0.4 Notation . . . . . . . . . .
ix
.
.
.
.
1
1
5
7
7
1 Preliminaries: Finite Rings and Polynomial Functions
1.1 Finite Rings . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
1.2 Polynomial Functions . . . . . . . . . . . . . . . . . . . . . . . .
9
9
13
2 Linear Coding
2.1 Linear Coding over Finite Rings . .
2.2 Proof of the Achievability Theorems
2.3 Optimality . . . . . . . . . . . . . .
2.A Appendix . . . . . . . . . . . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
17
18
26
28
35
Sources
. . . . . .
. . . . . .
. . . . . .
. . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
37
37
38
44
48
4 Stochastic Complements and Supremus Typicality
4.1 Markov Chains and Stochastic Complements . . . . . . . . . . .
4.2 Supremus Typical Sequences . . . . . . . . . . . . . . . . . . . .
51
52
55
3 Encoding Functions of Correlated
3.1 A Polynomial Approach . . . . .
3.2 Source Coding for Computing . .
3.3 Non-field Rings versus Fields I .
3.A Appendix . . . . . . . . . . . . .
ix
.
.
.
.
.
.
.
.
.
.
.
.
x
Contents
4.A Appendix . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
5 Irreducible Markov Sources
5.1 Linear Coding over Finite Rings for Irreducible Markov Sources
5.2 Source Coding for Computing Markovian Functions . . . . . .
5.3 Non-field Rings versus Fields II . . . . . . . . . . . . . . . . . .
5.A Appendix . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
6 Extended Shannon–McMillan–Breiman Theorem
6.1 Asymptotically Mean Stationary Dynamical Systems and
dom Processes . . . . . . . . . . . . . . . . . . . . . . . .
6.2 Induced Transformations of A.M.S. Systems . . . . . . . .
6.3 Extended Shannon–McMillan–Breiman Theorem . . . . .
6.A Appendix . . . . . . . . . . . . . . . . . . . . . . . . . . .
.
.
.
.
61
67
68
73
80
83
85
Ran. . . .
. . . .
. . . .
. . . .
86
91
99
101
7 Asymptotically Mean Stationary Ergodic Sources
7.1 Supremus Typicality in the Weak Sense . . . . . . . . . . . . . .
7.2 Hyper Supremus Typicality in the Weak Sense . . . . . . . . . .
7.3 Linear Coding over Finite Rings for A.M.S. Sources . . . . . . .
103
104
108
114
8 Conclusion
8.1 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
8.2 Future Research Directions . . . . . . . . . . . . . . . . . . . . .
119
119
120
Bibliography
123
Chapter 0
Introduction
0.1
Motivations
This thesis resulted from attempting to prove the conjecture:
Linear encoders over finite rings are optimal
for Slepian–Wolf data compression.
This problem is interesting for several reasons.
Reason one: it is “intrinsically interesting.” In 1955, Elias [Eli55] (cf. [Gal68])
introduced a binary linear coding scheme which compresses binary sources up to
their Shannon limits, also known as the Slepian–Wolf limits [SW73]. Csiszár [Csi82]
then showed that linear encoders over finite fields are optimal for all Slepian–Wolf
data compression scenarios. This settles the previous problem for the special case
when all rings considered are fields. Unfortunately, the general case of linear coding
over finite rings is left open. In fact, Elias’ or Csiszár’s argument does not present
an optimal conclusion when applied to the non-field ring scenario.
In this work, we will show that linear encoders over some classes of finite rings
can be equally optimal for Slepian–Wolf data compression. In addition, it is proved
that:
For any Slepian–Wolf data compression scenario, there always exist linear
encoders over some finite non-field rings that achieve the optimal data
compression limit.
The conjecture is then closed on the regard of existence. As a matter of fact, our
general conclusion also includes the corresponding field scenarios from Elias and
Csiszár as special cases.
Reason two: linear coding over non-field rings appears superior to others in
some source network problems. As a generalisation of the Slepian–Wolf problem,
the function encoding problem considers to recover a function of the source messages, instead of the original messages, from the encoder outputs. It comes with the
following applications.
1. The function encoding problem is actually a special case of the “manyhelp-one” source network problem. For example, if, in Figure 0.1, Zi =
Xi ⊕2 Yi for all feasible i’s and f3 is a constant function (namely, the
1
2
Introduction
f1 (X n )
X n = X1 , X2 , · · · , Xn
Encoder f1
f2 (Y n )
Y n = Y1 , Y2 , · · · , Yn
Ẑ n
Decoder
Encoder f2
f3 (Z n )
Z n = Z1 , Z2 , · · · , Zn
Encoder f3
Figure 0.1: Two-Help-One Source Network
A
b
a
B
C
a
b
D
a
a+b
b
E
a+b
F
a+b
G
Figure 0.2: Network Coding
encoding rate of f3 is 0), then this two-help-one problem renders to the
encoding the modulo-two sum problem.
2. In network coding, an intermediate node is only interested in a function
of the output messages from the sources (or its preceding nodes), instead
of the original messages. As showed in Figure 0.2, node D is only required
to bridge the message a + b, which is a function of the outputs from B
and C, respectively. If the function encoding scheme is implemented, then
the required capacities of link BD and link CD can be reduced, while it
is still guaranteed that a + b is decoded correctly at node D.
0.1. Motivations
3
A = [A1 , A2 , · · · , An ]
R1
f (A1 , B1 , C1 )
B = [B1 , B2 , · · · , Bn ]
R2
f (A2 , B2 , C2 )
..
.
R3
f (An , Bn , Cn )
C = [C1 , C2 , · · · , Cn ]
data
backup
Figure 0.3: Partial Data Backup
3. “Partial data backup”. Consider that a vast amount of correlated data
A, B and C are stored in three data centres (see Figure 0.3). The “partial
backup” reliability requirement demands that, if one of the data centres
fails, then we must be able to recover the original data stored in this
centre from the other two data centres and the backup. Since facilities
are highly reliable nowadays, failures occur with very low risk. It is even
more unlikely that more than two data centres malfunction at the same
time. Therefore, the maintenance cost is mainly from the network traffic
required to perform frequent backups. The worst solution is to transfer
all the data A, B and C and store their duplicated copies in the backup.
Yet a better one is to store a function, say f , of the data in the backup.
This will reduce not only the size of the backup data but also the required
network traffic. This is because the sum rate for encoding the function
f is usually significantly smaller than the one for encoding the original
sources A, B and C. One specific method is to view data A, B and C as
sequences of elements from set {0, 1} and let
f (0, 0, 0) = 0;
f (0, 0, 1) = 3;
f (0, 1, 0) = 2;
f (1, 0, 0) = 1;
f (0, 1, 1) = 1;
f (1, 0, 1) = 0;
f (1, 1, 0) = 3;
f (1, 1, 1) = 2.
It is easy to see that, with the backup storing this function of the data,
when one data centre fails we can recover the data with the aid of the
backup and the two available data centres.
However, the achievable coding rate region for encoding an arbitrary function of
sources is unknown. Making use of the binary linear coding scheme, [KM79] showed
4
Introduction
that the Slepian–Wolf limit is often sub-optimal for encoding the modulo-two sum
of two memoryless binary sources. For the special case of symmetric binary sources,
[KM79] in addition proved that the optimal coding rate limit is a symmetric region,
known as the Körner–Marton region. Yet the case of asymmetric binary sources is
left open, although [AH83] proved that the Körner–Marton region is unfortunately
sub-optimal.
From [KM79,AH83], it is seen that the linear coding technique (over the binary
field) is the key element that allows for achieving better coding rates outside the
Slepian–Wolf region. It is easy to generalise their results to encoding functions over
other finite fields with the help of the linear coding technique (over finite fields)
from Csiszár1 . In this thesis, we propose to use linear encoders over finite rings. We
will show that:
There exist (infinitely) many function encoding scenarios, in which the
non-field ring linear coding scheme strictly outperforms its field counterpart, as well as the Slepian–Wolf scheme, in terms of achieving better
coding rates.
Notice that the function encoding problem is a sub-problem in the above applications. Hence, it is plausible that linear coding technique over finite rings can also
provide an alternative, possibly better, solution to these applications as it does to
the function encoding problem.
Reason three: the classical typicality idea does not work beyond the independent and identically distributed (i.i.d.) source scenarios. One can call upon the
Shannon–MacMillan–Breiman (SMB) Theorem [Sha48, McM53, Bre57] to generalise the Slepian–Wolf data compression theorem from i.i.d. source scenarios to stationary ergodic source scenarios [Cov75] (and to asymptotically mean stationary
(a.m.s.) ergodic source scenarios if the SMB Theorem for a.m.s. ergodic processes
from [GK80] is applied). Similarly, the generalisation can also be done for the result
on linear encoders over fields from Csiszár. Unfortunately, that is not the case when
trying to generalise our results on linear coding over rings beyond the i.i.d. case
(e.g. irreducible Markov sources). One of the technical obstacles is that Shannon’s
argument on (strong or weak) typical sequences no longer works as expected. As a
result, a new concept of typical sequence, called Supremus typical sequence, and its
corresponding asymptotical equipartition property (AEP) and conditional typicality lemmeta are introduced instead. Built on these new tools, corresponding results
on linear coding over finite rings are established for both irreducible Markov and
a.m.s. ergodic source scenarios.
The major differences of the mechanisms between classical typicality and Supremus typicality are seen by investigating the dynamical systems describing the random processes (sources). It is proved that all induced systems of an a.m.s. system
are a.m.s.. As a consequence:
1 This
observation is part of the motivation of Csiszár’s studies on linear codes over finite fields.
In [Csi82], it reads “in some source network problems linear codes appear superior to others (cf.
Körner and Marton [KM79]).”
0.2. Outline and Contributions
5
The SMB Theorem simultaneously holds for all reduced processes of an
a.m.s. ergodic random process.
From this we see that the classical typical sequences and the SMB Theorem do not
represent and characterise, respectively, the corresponding a.m.s. ergodic random
process good enough. To be more precise, the property that the SMB Theorem
holds simultaneously for all reduced processes is not featured in the classical typicality concept. On the contrary, Supremus typicality takes the effect of all reduced
processes into account. Its AEP further states that all non-typical sequences in the
classical sense together with all classical typical sequences that are not Supremus
typical are all “negligible in probability.”
Reason four: algorithms designed for rings are easier to implement compared to
the ones for fields. This is because a finite field is normally given by its polynomial
representation. Corresponding field operations are carried out based on the polynomial operations (addition and multiplication) followed by the polynomial long
division algorithm. In contrast, implementing arithmetic of many finite rings is
rather straightforward. For instance, the arithmetic of modulo integers ring Zq , for
any positive integer q, is simply the integer modulo q arithmetic, and the arithmetic
of matrix rings are matrix additions and multiplications.
Up to the point this work is written, we can only conclude the conjecture to the
extent of existence. Nevertheless, there are already several interesting discoveries
along the process. Hopefully, more is to be unveiled when the studies are carried
on in the future.
0.2
Outline and Contributions
The remaining of the thesis is divided into several chapters. We summarise the
contents in each of them along with the contributions below.
Chapter 1 introduces some fundamental algebraic concepts and some related
properties that will be used in succeeding chapters.
Chapter 2 establishes an achievability theorem on linear coding over finite rings.
This theorem includes corresponding results from [Eli55] and [Csi82] as special
cases. In addition, we will also prove the optimality part (the converse) of this
theorem in various cases. In particular, it is showed that for some finite non-field
rings optimality is always claimed. This implies that for any Slepian–Wolf data
compression scenarios, there always exist linear encoders over some non-field rings
that achieve the optimal coding rate limits as their field counterparts.
Chapter 3 addresses the function encoding problem. In this problem, the first
issue raised is how to handle an arbitrary function whose algebraic structure is
unclear. We suggest a polynomial approach based on the fact that any discrete
function defined on a finite domain is equivalent to a restriction of some polynomial
function over some finite ring. Namely, we can assume that the function considered
is presented as a polynomial function over some finite ring. This allows us to use the
linear coding technique over corresponding ring to construct encoders and achieve
6
Introduction
better coding rates by exploring the polynomial structure. As a demonstration,
we prove that linear coding over non-finite rings strictly outperforms all of its field
counterparts in terms of achieving better coding rates for encoding many functions.
Chapter 4 provides some theoretical background used to generalise results from
Chapter 2 and Chapter 3 to the Markovian settings. This chapter investigates a new
type of typicality for sequences, termed Supremus typical sequences, for irreducible
Markov sources. It is seen that Supremus typicality is a condition stronger than
classical typicality from Shannon. Even though Supremus typical sequences form a
(often strictly smaller) subset of classical typical sequences, the AEP is still valid.
Furthermore, Supremus typicality possesses properties that are more accessible and
easier to analyse than its classical counterpart.
Chapter 5 generalises results from Chapter 2 and Chapter 3 to the Markovian
settings. Seemingly, this can be easily done based on the SMB Theorem and the
argument built on Shannon’s typical sequences. Unfortunately, the end results so
obtained are often difficult to analyse. This is because it involves evaluating entropy rates of functions of a Markov process. Since a function of a Markov process
is usually not Markov, the results cannot provide much insight of the achievable
coding rates (optimal or not). To overcome this, we replace the argument based on
classical typicality with the one built on Supremus typicality introduced in Chapter
4. By exploring the properties from Supremus typicality, we obtain results that do
not involve any analysis of entropy rates. In fact, calculations of the end results
are simple and straightforward. Moreover, they are optimal as showed in many
examples.
Chapter 6 is dedicated to proving that an induced transformation with respect to
a finite measure set of a recurrent a.m.s. dynamical system with a σ-finite measure is
a.m.s.. Since the SMB Theorem and the Shannon–McMillan Theorem hold for any
finite-state a.m.s. ergodic process [GK80], it is concluded that the SMB Theorem,
as well as the Shannon–McMillan Theorem, holds simultaneously for all reduced
processes of any finite-state recurrent a.m.s. ergodic random process. We term this
recursive property the Extended SMB Theorem. This theorem is important because
it provides the theoretical background to further generalise some of our results from
the Markovian settings to a more general case, a.m.s. ergodic sources. It is seen from
Chapter 4 and Chapter 5 that the idea of Supremus typicality is important for our
analysis to work. Generalising this concept from Markov sources to a.m.s. sources
can also be straightforward. However, one needs to justify a corresponding AEP
of Supremus typicality defined for a.m.s. sources in order to prove corresponding
coding theorems. The Extended SMB Theorem is a key element for proving such
an AEP as seen in Chapter 7.
Chapter 7 establishes the AEP of Supremus typicality defined for recurrent
a.m.s. ergodic sources based on the Extended SMB Theorem given in Chapter 6.
The achievability theorem of linear coding over finite rings is then generalised to
the recurrent a.m.s. ergodic sources settings.
Chapter 8 summarises the thesis and provides some suggestions on future research directions.
0.3. Copyright Notice
0.3
7
Copyright Notice
Parts of the material presented in this thesis are based on the author’s joint works
which are previously published or submitted to conferences [HS12d, HS12a, HS12b,
HS13b,HS13c,HS14b] and journals [HS,HS12c,HS13a,HS14a] held by or sponsored
by the Institute of Electrical and Electronics Engineer (IEEE) or World Scientific
or Royal Institute of Technology (KTH). IEEE or World Scientific or KTH holds
the copyright of the published papers and will hold the copyright of the submitted
papers if they are accepted. Materials (e.g., figure, graph, table, or textual material)
are reused in this thesis with permission.
0.4
Notation
We denote random variables, their corresponding realisations or deterministic values, and their alphabets by the upper case, lower case, and script letters, respectively. For a positive integer n, X n is designated for the array
n
o
X (1) , X (2) , · · · , X (n) .
Suppose that X (i) =
Qm
j=1
(i)
Xj
 (i) 
X1
 (i) 
X2 
(i)

:= 
 .. , then XT (T ⊆ {1, 2, · · · , m}) is defined
 . 
(i)
Xm
to be
(i)
j∈T Xj
Q
and XTn stands for
n
o
(1)
(2)
(n)
XT , XT , · · · , XT
.
(i)
(i)
Similarly, xn , xT , xnT , X n , XT and XTn resemble corresponding definitions. In
addition, the cardinality of a set X is denoted by |X |, and all logarithms in the
thesis are of base 2, unless stated otherwise.
Other notation used in this thesis are listed in the following:
N
The set of non-negative integers
N+
The set of positive integers
R
The set of real numbers
cov(A)
The convex hull of set A ∈ Rn
supp(p)
The support of the (probability mass) function p
PX , p(x) Probability distribution of discrete random variable X
X ∼ PX Random variable X is distributed according to PX
H(X)
Entropy of random variable X
H(X, Y ) Joint entropy of random variables X and Y
H(X|Y ) Conditional entropy of random variable X given Y
8
Introduction
I(X; Y )
I(X; Y |Z)
Pr {E}
Pr { E1 | E2 }
T (n, X)
T (n, P)
TH (n, P)
S (n, P)
S n, X (n)
H n, X (n)
Mutual information between random variables X and Y
Conditional mutual information between X and Y given Z
Probability of the event E
Probability of the event E1 conditional on E2
The set of all -typical sequences of length n with respect to
X ∼ PX
The set of all Markov -typical sequences of length n with
respect to an irreducible Markov process with transition
matrix P
The set of all modified weak -typical sequences of length n
with respect to an irreducible Markov process with transition
matrix P
The set of all Supremus -typical sequences of length n with
respect to an irreducible Markov process with transition
matrix P
The set of all Supremus -typical sequences of length n with
respect to the random process X (n)
The set of all Hyper Supremus -typical sequences of length n
with respect to the random process X (n)
Chapter 1
Preliminaries: Finite Rings and
Polynomial Functions
T
1.1
his chapter provides some fundamental algebraic concepts and related properties. Readers who are already familiar with this material may still choose
to go through quickly to identify our notation.
Finite Rings
Definition 1.1.1. The touple [R, +, ·] is called a ring if the following criteria are
met:
1. [R, +] is an Abelian group;
2. There exists a multiplicative identity 1 1 ∈ R, namely, 1·a = a·1 = a, ∀ a ∈ R;
3. ∀ a, b, c ∈ R, a · b ∈ R and (a · b) · c = a · (b · c);
4. ∀ a, b, c ∈ R, a · (b + c) = (a · b) + (a · c) and (b + c) · a = (b · a) + (c · a).
We often write R for [R, +, ·] when the operations considered are known from
the context. The operation “·” is usually written by juxtaposition, ab for a · b, for
all a, b ∈ R.
A ring [R, +, ·] is said to be commutative if ∀ a, b ∈ R, a · b = b · a. In Definition
1.1.1, the identity of the group [R, +], denoted by 0, is called the zero. A ring [R, +, ·]
is said to be finite if the cardinality |R| is finite, and |R| is called the order of R.
The set Zq of integers modulo q is a commutative finite ring with respect to the
modular arithmetic. For any ring R, the set of all polynomials of s indeterminants
over R is an infinite ring.
1 Sometimes a ring without a multiplicative identity is considered. Such a structure has been
called a rng. We consider rings with multiplicative identities in this thesis. However, similar results
remain valid when considering rngs instead. Although we will occasionally comment on such
results, they are not fully considered in the present work.
9
10
Preliminaries: Finite Rings and Polynomial Functions
Proposition 1.1.1. Given s rings R1 , R2 , · · · , Rs , forQany non-empty set T ⊆
{1, 2, · · · , s}, the Cartesian product (see [Rot10]) RT = i∈T Ri forms a new ring
[RT , +, ·] with respect to the component-wise operations defined as follows:
a0 + a00 = a01 + a001 , a02 + a002 , · · · , a0|T | + a00|T | ,
a0 · a00 = a01 a001 , a02 a002 , · · · , a0|T | a00|T | ,
∀ a0 = a01 , a02 , · · · , a0|T | , a00 = a001 , a002 , · · · , a00|T | ∈ RT .
Remark 1.1. In Proposition 1.1.1, [RT , +, ·] is called the direct product of {Ri |i ∈
T }. It can be easily seen that (0, 0, · · · , 0) and (1, 1, · · · , 1) are the zero and the
multiplicative identity of [RT , +, ·], respectively.
Definition 1.1.2. A non-zero element a of a ring R is said to be invertible, if and
only if there exists b ∈ R, such that ab = ba = 1. b is called the inverse of a,
denoted by a−1 . An invertible element of a ring is called a unit.
Remark 1.2. It can be proved that the inverse of a unit is unique. By definition,
the multiplicative identity is the inverse of itself.
Let R∗ = R \ {0}. The ring [R, +, ·] is a field if and only if [R∗ , ·] is an Abelian
group. In other words, all non-zero elements of R are invertible. All fields are
commutative rings. Zq is a field if and only if q is a prime. All finite fields of the
same order are isomorphic to each other [DF03, pp. 549]. This “unique” field of
order q is denoted by Fq . It is necessary that q is a power of a prime. More details
regarding finite fields can be found in [DF03, Ch. 14.3].
Theorem 1.1.1 (Wedderburn’s little theorem cf. [Rot10, Theorem 7.13]). Let R
be a finite ring. R is a field if and only if all non-zero elements of R are invertible.
Remark 1.3. Wedderburn’s little theorem guarantees commutativity for a finite
ring if all of its non-zero elements are invertible. Hence, a finite ring is either a field
or at least one of its elements has no inverse. However, a finite commutative ring
is not necessary a field, e.g. Zq is not a field if q is not a prime.
Definition 1.1.3 (cf. [DF03]). The characteristic
Pm of a finite ring R is defined to
be the smallest positive integer m, such that j=1 1 = 0, where 0 and 1 are the
zero and the multiplicative identity of R, respectively. The characteristic of R is
often denoted by Char(R).
Remark 1.4. Clearly, Char(Zq ) = q. For a finite field Fq , Char(Fq ) is always the
prime q0 such that q = q0n for some integer n [Rot10, Proposition 2.137].
Proposition 1.1.2. Let Fq be a finite field. For any 0 P
6= a ∈ Fq , m = Char(Fq ) if
m
and only if m is the smallest positive integer such that j=1 a = 0.
1.1. Finite Rings
11
Proof. Since a 6= 0,
m
X
a = 0 ⇒ a−1
j=1
m
X
a = a−1 · 0 ⇒
j=1
m
X
j=1
1=0⇒
m
X
a=0
j=1
The statement is proved.
Definition 1.1.4. A subset I of a ring [R, +, ·] is said to be a left ideal of R,
denoted by I ≤l R, if and only if
1. [I, +] is a subgroup of [R, +];
2. ∀ x ∈ I and ∀ a ∈ R, a · x ∈ I.
If condition 2 is replaced by
3. ∀ x ∈ I and ∀ a ∈ R, x · a ∈ I,
then I is called a right ideal of R, denoted by I ≤r R. {0} is a trivial left (right)
ideal, usually denoted by 0.
The cardinality |I| is called the order of a finite left (right) ideal I.
Remark 1.5. Let {a1 , a2 , · · · , an } be a non-empty
Pn set of elements of some ring
R.
It is easy to verify that ha1 , a2 , · · · , an ir := i=1 ai bi bi ∈ R, ∀ 1 ≤ i ≤ n is a
Pn
right ideal and ha1 , a2 , · · · , an il :=
i=1 bi ai bi ∈ R, ∀ 1 ≤ i ≤ n is a left ideal.
Furthermore, ha1 , a2 , · · · , an ir = ha1 , a2 , · · · , an il = R if some ai is a unit.
It is well-known that if I ≤l R, then R is divided into disjoint cosets which are
of equal size (cardinality). For any coset J, J = x + I = {x + y|y ∈ I}, ∀ x ∈ J. The
set of all cosets forms a left module over R, denoted by R/I. Similarly, R/I becomes
a right module over R if I ≤r R [AF92]. Of course, R/I can also be considered as
a quotient group [Rot10, Ch. 1.6 and Ch. 2.9]. However, its structure is far richer
than simply being a quotient group.
Qs
Proposition 1.1.3. Let Ri (1 ≤ i ≤ s) be a ring and
Qs R = i=1 Ri . For any
A ⊆ R, A ≤l R (or A ≤r R) if and only if A = i=1 Ai and Ai ≤l Ri (or
Ai ≤r Ri ), ∀ 1 ≤ i ≤ s.
Proof. We prove for the ≤l case only, and the ≤r case follows from a similar argument. Let πi (1 ≤ i ≤ s) be the
Qscoordinate function assigning every element in R
its ith component. Then A ⊆ i=1 Ai , where Ai = πi (A). Moreover, for any
Qs
x = (π1 (x1 ), π2 (x2 ), · · · , πs (xs )) ∈ i=1 Ai ,
where xi ∈ A for all feasible i, we have that
Ps
x = i=1 ei xi ,
12
Preliminaries: Finite Rings and Polynomial Functions
where ei ∈ R has the ith coordinate
Qs being 1 and others being 0. IfQAs ≤l R, then
x ∈ A by definition. Therefore, i=1 Ai ⊆ A. Consequently, A = i=1 Ai . Since
πi is a homomorphism, we also have that Ai ≤l Ri for all feasible i. The other
direction is easily verified by definition.
Remark 1.6. It is worthwhile to Q
point out that Proposition 1.1.3 does not hold
for infinite index set, namely, R = i∈I Ri , where I is not finite.
For any ∅ =
6 T ⊆ S, Proposition 1.1.3 states that any left (right) ideal of RT is
a Cartesian product of some left (right) ideals of Ri , i ∈ TQ. Let Ii be a left (right)
ideal of ring Ri . We define IT to be the left (right) ideal i∈T Ii of RT .
Let xtr be the transpose of a vector (or matrix) x.
Definition 1.1.5. A mapping f : Rn → Rm given as:
!tr
n
n
X
X
f (x1 , x2 , · · · , xn ) =
a1,j xj , · · · ,
am,j xj
, ∀ (x1 , · · · , xn ) ∈ Rn , (1.1.1)
j=1
j=1
where ai,j ∈ R for all feasible i and j, is called a left linear mapping over ring R.
Similarly,
!tr
n
n
X
X
xj am,j
, ∀ (x1 , · · · , xn ) ∈ Rn ,
f (x1 , x2 , · · · , xn ) =
xj a1,j , · · · ,
j=1
j=1
defines a right linear mapping over ring R. If m = 1, then f is called a left (right)
linear function over R.
From now on, left linear mapping (function) or right linear mapping (function)
are simply called linear mapping (function). This will not lead to any confusion
since the intended use can usually be clearly distinguished from the context.
Remark 1.7. The mapping f in Definition 1.1.5 is called linear in accordance with
the definition of linear mapping (function) over field. In fact, the two structures
have several similar properties. Moreover, (1.1.1) is equivalent to
tr
f (x1 , x2 , · · · , xn ) = A (x1 , x2 , · · · , xn ) , ∀ (x1 , x2 , · · · , xn ) ∈ Rn ,
(1.1.2)
where A is an m × n matrix over R and [A]i,j = ai,j for all feasible i and j. A is
named the coefficient matrix. It is easy to prove that a linear mapping is uniquely
determined by its coefficient matrix, and vice versa. The linear mapping f is said
to be trivial, denoted by 0, if A is the zero matrix, i.e. [A]i,j = 0 for all feasible i
and j.
Let A be an m × n matrix over ring R and f (x) = Ax, ∀ x ∈ Rn . For the
system of linear equations
f (x) = Ax = 0, where 0 = (0, 0, · · · , 0)tr ∈ Rm ,
1.2. Polynomial Functions
13
let S(f ) be the set of all solutions, namely S(f ) = {x ∈ Rn |f (x) = 0}. It is obvious
that S(f ) = Rn if f is trivial, i.e. A is a zero matrix. If R is a field, then S(f ) is a
subspace of Rn . We conclude this section with a lemma regarding the cardinalities
of Rn and S(f ) in the following.
Lemma 1.1.1. For a finite ring R and a linear function
f : x 7→ (a1 , a2 , · · · , an )x or f : x 7→ xtr (a1 , a2 , · · · , an )tr , ∀ x ∈ Rn ,
we have
1
|S(f )|
,
n =
|I|
|R|
where I = ha1 , a2 , · · · , an ir (or I = ha1 , a2 , · · · , an il ). In particular, if ai is invertn−1
ible for some 1 ≤ i ≤ n, then |S(f )| = |R|
.
Proof. It is obvious that the image f (Rn ) = I by definition. Moreover, ∀ x 6= y ∈ I,
the pre-images f −1 (x) and f −1 (y) satisfy f −1 (x) ∩ f −1 (y) = ∅ and f −1 (x) =
−1 f (y) = |S(f )|. Therefore, |I| |S(f )| = |R|n , i.e. |S(fn)| = 1 . Moreover, if ai
|I|
|R|
n
n−1
is a unit, then I = R, thus, |S(f )| = |R| / |R| = |R|
.
1.2
Polynomial Functions
Definition 1.2.1. A polynomial function 2 of k variables over a finite ring R is a
function g : Rk → R of the form
g(x1 , x2 , · · · , xk ) =
m
X
m1j
aj x1
m2j
x2
m
· · · xk kj ,
(1.2.1)
j=0
where aj ∈ R and m and mij ’s are non-negative integers. The set of all the polynomial functions of k variables over ring R is designated by R[k].
Remark 1.8. Polynomial and polynomial function are sometimes only defined over
a commutative ring [Rot10, MS84]. It is a very delicate matter to define them over
a non-commutative ring [Hun80, Lam01], due to the fact that x1 x2 and x2 x1 can
become different objects. We choose to define “polynomial functions” with formula
(1.2.1) because those functions are within the scope of this work’s interest.
Lemma 1.2.1 (cf. [LN97, Lemma 7.40]). For any polynomial function g ∈ Fq [k],
where q is a power of a prime and k ∈ N+ , there exists a unique polynomial function
h ∈ Fq [k] of degree less than q in each variable with h = g.
2 Polynomial
and polynomial function are distinct concepts.
14
Preliminaries: Finite Rings and Polynomial Functions
Lemma 1.2.2. Let q be a power of a prime. The number of polynomial functions
k
in Fq [k] (k ∈ N+ ) is q q . Moreover, any function g : Fkq → Fq is a polynomial
function in Fq [k].
Proof. By Lemma 1.2.1, we have that
k
|Fq [k]| ≤ |polynomail functions of degree less than q in each variable| = q q .
On the other hand, it is obvious that two distinct polynomial functions in Fq [k] of
degree less than q is never equal. Thus,
k
|Fq [k]| ≥ |polynomail functions of degree less than q in each variable| = q q .
k
Consequently, |Fq [k]| = q q .
In addition, let A be the set of all functions with domain Fkq and codomain Fq .
k
Obviously, Fq [k] ⊆ A. In the meanwhile, |A| = q q . Therefore, Fq [k] = A.
Remark 1.9. The special case of Lemma 1.2.2 with q being a prime and k = 1
can be easily verified with Fermat’s Little Theorem.
Theorem 1.2.1 (Fermat’s Little Theorem). p divides ap−1 −1 whenever p is prime
and a is coprime to p, i.e. ap = a mod p.
Qs
Qs
Definition 1.2.2. Let g1 : i=1 Xi → Ω1 and g2 : i=1 Yi → Ω2 be two functions.
If there exist bijections µi : Xi → Yi , ∀ 1 ≤ i ≤ s, and ν : Ω1 → Ω2 , such that
g1 (x1 , x2 , · · · , xs ) = ν −1 (g2 (µ1 (x1 ), µ2 (x2 ), · · · , µs (xs ))),
then g1 and g2 are said to be equivalent (via µ1 , µ2 , · · · , µs and ν).
Definition 1.2.3. Given function g : D → Ω, and let ∅ =
6 S ⊆ D. The restriction
of g on S is defined to be the function g|S : S → Ω such that g|S : x 7→
g(x), ∀ x ∈ S .
Qk
Lemma 1.2.3. For any discrete function g : i=1 Xi → Ω with Xi ’s and Ω being
finite, there always exist a finite ring (field) R and a polynomial function ĝ ∈ R[k]
such that
ν (g (x1 , x2 , · · · , xk )) = ĝ (µ1 (x1 ), µ2 (x2 ), · · · , µk (xk ))
for some injections µi : Xi → R (1 ≤ i ≤ k) and ν : Ω → R.
Proof. For any injections µi : Xi → R (1 ≤ i ≤ k) and ν : Ω → R, the function
ĝ = ν ◦ g (µ01 , µ02 , · · · , µ0k ) : Rk → R,
where µ0i is the inverse mapping of µi : Xi → µi (Xi ), must be a polynomial
function by Lemma 1.2.2, the statement is established.
1.2. Polynomial Functions
15
Remark 1.10. Up to equivalence, a function can be presented in many different
formats. For example, the function min{x, y} defined on {0, 1} × {0, 1} (with ordering 0 ≤ 1) can either be seen as F1 (x, y) = xy on Z22 or be treated as the restriction
of F2 (x, y) = x + y − (x + y)2 defined on Z23 to the domain {0, 1} × {0, 1} ( Z23 .
Lemma 1.2.3 states that any discrete function defined on a finite domain is
equivalent to a restriction of some polynomial function over some finite ring (field).
As a consequence, we can restrict a problem considering an arbitrary function with
a finite domain to the problem considering only polynomial functions and their
restrictions that are equivalent to this arbitrary function. This polynomial approach
offers valuable insight into the general problem, because the algebraic structure of
a polynomial function is clearer than that of an arbitrary function. We often call ĝ
in Lemma 1.2.3 a polynomial presentation of g. In addition, if ĝ admits that
ĝ = h ◦ k, where k(x1 , x2 , · · · , xs ) =
s
X
ki (xi ),
i=1
and h, ki ’s are functions mapping R to R, then it is named a nomographic function
over R (by terminology borrowed from [Buc82]), and it is said to be a nomographic
presentation of g if g is equivalent to a restriction of ĝ.
Lemma 1.2.4.
Qs Let X1 , X2 , · · · , Xs and Ω be some finite sets. For any discrete
function g : i=1 Xi → Ω, there exists a nomographic function ĝ over some finite
ring (field) R such that
ν (g (x1 , x2 , · · · , xk )) = ĝ (µ1 (x1 ), µ2 (x2 ), · · · , µk (xk ))
for some injections µi : Xi → R (1 ≤ i ≤ k) and ν : Ω → R.
s
Proof. Let F be a finite field such that |F| ≥ |Xi | for all 1 ≤ i ≤ s and |F| ≥ |Ω|,
s
and let R be the splitting field of F of order |F| (one example of the pair F and R
is the Zp , where p is some prime, and its Galois extension of degree s). It is easily
seen that R is an s dimensional vector space over F. Hence, there exist s vectors
v1 , v2 , · · · , vs ∈ R that are linearly independent. Let µi be an injectionPfrom Xi
s
to the subspace generated by vector vi . It is easy to verify that k = i=1 µi is
0
injective
· · · , vs are linearly independent. Let k be the inverse mapping
Qs since v1 , v2 , Q
s
of k : i=1 Xi → k ( i=1 Xi ) and ν : Ω → R be any injection. By the second half
0
of Lemma 1.2.2, there exists
Pas polynomial function h ∈ R[s] such that h = ν ◦ g ◦ k .
Let ĝ(x1 , x2 , · · · , xs ) = h ( i=1 xi ). The statement is proved.
Remark 1.11. In the above proof, k is chosen to be injective because the proof
includes the case that g is an identity function. In general, k is not necessarily
injective.
Chapter 2
Linear Coding
his chapter is dedicated to establishing an achievability theorem regarding
linear coding over finite rings (LCoR). It will be seen that this includes
corresponding results from [Eli55] and [Csi82] as special cases. In addition,
we will also prove the optimality part (the converse) of this theorem in various
cases. In particular, it is shown that for some finite non-field rings optimality is
always claimed. This implies that for any Slepian–Wolf data compression scenarios,
there always exist linear encoders over some non-field rings that achieve the optimal
coding rate limits as their field counterparts.
From Chapter 0, we learnt that the Slepian–Wolf problem is a special case of
the function encoding problem. To define the later in rigorous terms:
T
Problem 2.1 (Source Coding for Computing). Let i ∈ S = {1, 2, · · · , s} be a
discrete memoryless source (DMS) that randomly generates i.i.d. discrete data
(1)
(2)
(n)
(n)
(n)
Xi , Xi , · · · , Xi , · · · , where Xi has a finite sample space Xi and XS ∼ p,
∀ n ∈ N+ . For a discrete function g : XS → Ω, what is the largest region R[g] ⊂ Rs ,
+
such that, ∀ (R1 , R2 , · · · , Rs ) ∈ R[g] and ∀ > 0, there exists an
N0 ∈ N , such
n
nRi
that for all n > N
, i ∈ S, and one
0 , there
exist s encoders φi : Xi → 1, 2
Q
decoder ψ : i∈S 1, 2nRi → Ωn , with
Pr {~g (XSn ) 6= ψ [φ1 (X1n ) , φ1 (X2n ) , · · · , φs (Xsn )]} < ,

 (1)
g XS




..
where ~g (XSn ) = 
 ∈ Ωn ?
 . 
(n)
g XS
The region R[g] is called the achievable coding rate region for computing g. A
rate tuple R ∈ Rs is said to be achievable for computing g (or simply achievable)
if and only if R ∈ R[g]. A region R ⊂ Rs is said to be achievable for computing g
(or simply achievable) if and only if R ⊆ R[g].
Obviously, in the problem of source coding for computing, Problem 2.1, the
decoder is only interested in recovering a function of the message(s), other than
17
18
Linear Coding
the original message(s), that is (are) i.i.d. generated and independently encoded
by the source(s). If g is an identity function, the computing problem is exactly the
Slepian–Wolf source coding problem. R[g] is then the Slepian–Wolf region [SW73],
n
R[X1 , X2 , · · · , Xs ] = (R1 , R2 , · · · , Rs ) ∈ Rs o
X
Rj > H(XT |XT c ), ∀ ∅ 6= T ⊆ S ,
j∈T
where T c is the complement of T in S. However, from [SW73] it is hard to draw
conclusions regarding the structure of the optimal encoders, as the corresponding
mappings are chosen randomly among all feasible mappings. This limits the scope
of their potential applications. As a completion, linear coding over finite fields
(LCoF), namely Xi ’s are injectively mapped into some subsets of some finite fields
and the φi ’s are chosen as linear mappings over these fields, is considered. It is shown
that LCoF achieves the same encoding limit, the Slepian–Wolf region [Eli55,Csi82].
Although it seems straightforward to study linear mappings over rings (non-field
rings in particular), it has not been proved (nor denied) that linear encoding over
non-field rings can be equally optimal.
This chapter will concentrate on addressing this problem. We will prove that
linear encoding over non-field rings can be equally optimal.
2.1
Linear Coding over Finite Rings
In this section, we will present a coding rate region achieved with LCoR for the
Slepian–Wolf source coding problem, i.e. g is an identity function in Problem 2.1.
This region is exactly the Slepian–Wolf region if all the rings considered are fields.
However, being field is not necessary as seen in Section 2.3, where the issue of
optimality is addressed.
Before proceeding, a subtlety needs to be cleared out. It is assumed that a
source, say i, generates data taking values from a finite sample space Xi , while Xi
does not necessarily admit any algebraic structure. We have to either assume that
Xi is with a certain algebraic structure, for instance Xi is a ring, or injectively
map elements of Xi into some algebraic structure. In our subsequent discussions,
we assume that Xi is mapped into a finite ring Ri of order at least |Xi | by some
injection Φi . Hence, Xi can simply be treated as a subset Φi (Xi ) ⊆ Ri for a fixed
Φi . When required, Φi can also be selected to obtain desired outcomes.
To facilitate our discussion, the following notation is used. For XS ∼ p, we
denote the marginal of p with respect to XT (∅ 6= T ⊆ S) by pXT , i.e. XT ∼ pXT ,
and define H(pXT ) to be H(XT ). In addition,
M (XS , RS ) := { (Φ1 , Φ2 , · · · , Φs )| Φi : Xi → Ri is injective, ∀ i ∈ S}
Q
(|Ri | ≥ |Xi | is implicitly assumed), and Φ(xT ) :=
i∈T Φi (xi ) for any Φ ∈
M (XS , RS ) and xT ∈ XT .
2.1. Linear Coding over Finite Rings
19
For any Φ ∈ M (XS , RS ), let
X
Ri log |Ii |
s
RΦ = (R1 , R2 , · · · , Rs ) ∈ R > r (T, IT ) ,
log |Ri |
i∈T
∀∅=
6 T ⊆ S, ∀ 0 6= Ii ≤l Ri ,
(2.1.1)
where r (T, IT ) = H(XT |XT c ) − H(YRT /IT |XT c ) and YRT /IT = Φ(XT ) + IT is a
random variable with sample space RT /IT (a left module).
Theorem 2.1.1. RΦ is achievable with linear coding over the finite rings R1 , R2 ,
· · · , Rs . In exact terms, ∀ > 0, there exists N0 ∈ N+ , for all n > N0 , there exist
linear encoders (left linear mappings to be more precise) φi : Φ(Xi )n → Rki i (i ∈ S)
and a decoder ψ, such that
Pr {ψ (φ1 (X1 ) , φ2 (X2 ) , · · · , φs (Xs )) 6= (X1 , X2 , · · · , Xs )} < ,
tr
(1)
(2)
(n)
, as long as
where Xi = Φ Xi
, Φ Xi
, · · · , Φ Xi
ks log |Rs |
k1 log |R1 | k2 log |R2 |
,
,··· ,
∈ RΦ .
n
n
n
Proof. The proof is given in Section 2.2.
The following is a concrete example helping to interpret this theorem.
Example 2.1.1. Consider the single source scenario, where X1 ∼ p and X1 = Z6 ,
specified as follows.
X1
p(X1 )
0
0.05
1
0.1
2
0.15
3
0.2
4
0.2
5
0.3
Obviously, Z6 contains 3 non-trivial ideals I1 = {0, 3}, I2 = {0, 2, 4} and Z6 .
Meanwhile, YZ6 /I1 and YZ6 /I2 admit the distributions
YZ6 /I1
p(YZ6 /I1 )
I1
0.25
1 + I1
0.3
2 + I1
0.45
and
YZ6 /I2
p(YZ6 /I2 )
I2
0.4
1 + I2
,
0.6
respectively. In addition, YZ6 /Z6 = Z6 is a constant. Thus, by Theorem 2.1.1, rate
R1 is achievable if
R1 log |I1 | R1 log 2
=
> H(X1 ) − H(YZ6 /I1 ) = 2.40869 − 1.53949 = 0.86920,
log |Z6 |
log 6
R1 log |I2 | R1 log 3
=
> H(X1 ) − H(YZ6 /I2 ) = 2.40869 − 0.97095 = 1.43774
log |Z6 |
log 6
R1 log |Z6 |
and
=R1 > H(X1 ) − H(YZ6 /Z6 ) = H(X1 ) = 2.40869.
log |Z6 |
20
Linear Coding
In other words,
R = {R1 ∈ R|R1 > max{2.24685, 2.34485, 2.40869}}
= {R1 ∈ R|R1 > 2.40869 = H(X1 )}
is achievable with linear coding over ring Z6 . Obviously, R is just the Slepian-Wolf
region R[X1 ]. Optimality is claimed.
Besides, we would like to point out that some of the inequalities defining (2.1.1)
are not active for specific scenarios. Two classes of these scenarios are discussed in
the following theorems.
Qki
Theorem 2.1.2. Suppose Ri (1 ≤ i ≤ s) is a (finite) product ring l=1
Rl,i of
finite rings Rl,i ’s, and the sample space Xi satisfies |Xi | ≤ |Rl,i | for all feasible i
and l. Given injections Φl,i : Xi → Rl,i and let
Φ = (Φ1 , Φ2 , · · · , Φs ),
where Φi =
Qki
l=1
Φl,i is defined as
Φi : xi 7→ (Φ1,i (xi ), Φ2,i (xi ), · · · , Φki ,i (xi )) ∈ Ri , ∀ xi ∈ Xi .
We have that
RΦ,prod =
X
Ri log |Ii |
> H(XT |YRT /IT , XT c ),
(R1 , R2 , · · · , Rs ) ∈ Rs log |Ri |
i∈T
ki
Y
Il,i with 0 6= Il,i ≤l Rl,i ,
∀∅=
6 T ⊆ S, ∀ Ii =
(2.1.2)
l=1
where YRT /IT = Φ(XT ) + IT , is achievable with linear coding over R1 , R2 , · · · , Rs .
Moreover, RΦ ⊆ RΦ,prod .
Proof. The proof is found in Section 2.2.
Let R be a finite ring and

a1
0





a
a
 2
1
ML,R,m = 







am am−1
..
.


0 





0 
 a1 , a2 , · · · , am ∈ R ,







a1
where m is a positive integer. It is easy to verify that ML,R,m is a ring with respect
to matrix operations. Moreover, I is a left ideal of ML,R,m if and only if



0
0 

 a1







a
a
0
 2
 aj ∈ Ij ≤l R, ∀ 1 ≤ j ≤ m;
1


I= 
.
..


Ij ⊆ Ij+1 , ∀ 1 ≤ j < m
.










am am−1
a1 2.1. Linear Coding over Finite Rings
21
Define O(ML,R,m ) to be the set of all left ideals of the form



a1
0
0 






 aj ∈ Ij ≤l R, ∀ 1 ≤ j ≤ m;


a1
0 
 a2


I
⊆
I
,
∀
1
≤
j
<
m;
.
j
j+1
..



.


 I = 0 for some 1 ≤ i ≤ m



i




am am−1
a1 Theorem 2.1.3. Let Ri (1 ≤ i ≤ s) be a finite ring such that |Xi | ≤ |Ri |. For
any injections Φ0i : Xi → Ri , let
Φ = (Φ1 , Φ2 , · · · , Φs ),
where Φi : Xi → ML,Ri ,mi is defined as
 0
Φi (xi )
0
 0
Φi (xi ) Φ0i (xi )
Φi : xi 7→ 


Φ0i (xi ) Φ0i (xi )
0
0
..
.
Φ0i (xi )



 , ∀ xi ∈ Xi .


We have that
X
Ri log |Ii |
(R1 , R2 , · · · , Rs ) ∈ Rs > H(XT |YRT /IT , XT c ),
log |Ri |
i∈T
∀∅=
6 T ⊆ S, ∀ Ii ≤l ML,Ri ,mi and Ii ∈
/ O(ML,Ri ,mi ) ,
(2.1.3)
RΦ,m =
where YRT /IT = Φ(XT ) + IT , is achievable with linear coding over ML,R1 ,m1 ,
ML,R2 ,m2 , · · · , ML,Rs ,ms . Moreover, RΦ ⊆ RΦ,m .
Proof. The proof is found in Section 2.2.
Remark 2.1. The difference between (2.1.1), (2.1.2) and (2.1.3) lies in their restrictions defining Ii ’s, respectively, as highlighted in the proofs given in Section
2.2.
Remark 2.2. Without much effort, one can see that RΦ (RΦ,prod and RΦ,m ,
resp.) in Theorem 2.1.1 (Theorem 2.1.2 and Theorem 2.1.3, resp.) depends on Φ
via random variables YRT /IT ’s whose distributions are determined by Φ. For each
|Ri |!
i ∈ S, there exist
distinct injections from Xi to a ring Ri of order
(|Ri | − |Xi |)!
at least |Xi |. Let cov(A) be the convex hull of a set A ⊆ Rs . By a straightforward
time sharing argument, we have that
!
[
Rl = cov
RΦ
(2.1.4)
Φ∈M (XS ,RS )
is achievable with linear coding over R1 , R2 , · · · , Rs .
22
Linear Coding
Remark 2.3. From Theorem 2.3.1, one will see that (2.1.1) and (2.1.4) are the
same when all the rings are fields. Actually, both are identical to the Slepian–Wolf
region. However, (2.1.4) can be strictly larger than (2.1.1) (see Section 2.3), when
not all the rings are fields. This implies that, in order to achieve the desired rate,
a suitable injection is required. Nevertheless, be reminded that taking the convex
hull in (2.1.4) is not always needed for optimality as shown in Example 2.1.1. A
more sophisticated elaboration on this issue is found in Section 2.3.
The rest of this section provides key supporting lemmata and concepts used
to prove Theorem 2.1.1, Theorem 2.1.2 and Theorem 2.1.3. The final proofs are
presented in Section 2.2.
Lemma 2.1.1. Let x, y ∈ Rn be two distinct sequences, where R is a finite ring,
tr
and assume that y − x = (a1 , a2 , · · · , an ) . If f : Rn → Rk is a random linear
mapping chosen uniformly at random, i.e. generate the k × n coefficient matrix A
of f by independently choosing each entry of A from R uniformly at random, then
Pr {f (x) = f (y)} = |I|−k ,
where I = ha1 , a2 , · · · , an il .
Proof. Let f = (f1 , f2 , · · · , fk )tr , where fi : Rn → R is a random linear function.
Then
( k
)
k
Y
\
Pr {f (x) = f (y)} = Pr
{fi (x) = fi (y)} =
Pr {fi (x − y) = 0} ,
i=1
i=1
since the fi ’s are independent from each other. The statement follows from Lemma
1.1.1 which assures that Pr {fi (x − y) = 0} = |I|−1 .
Remark 2.4. In Lemma 2.1.1, if R is a field and x 6= y, then I = R because every
non-zero ai is a unit. Thus, Pr {f (x) = f (y)} = |R|−k .
Definition 2.1.1 (cf. [Yeu08]). Let X ∼ pX be a discrete random variable with
sample space X . The set T (n, X) of strongly -typical sequences of length n with
respect to X is defined to be
n N (x; x)
− pX (x) ≤ , ∀ x ∈ X ,
x ∈ X n
where N (x; x) is the number of occurrences of x in the sequence x.
The notation T (n, X) is sometimes replaced by T when the length n and the
random variable X referred to are clear from the context.
Now we conclude this section with the following lemma. It is a crucial part
for our proofs of the achievability theorems. It generalizes the classic conditional
typicality lemma [CT06, Theorem 15.2.2], yet at the same time distinguishes our
argument from the one for the field version.
2.1. Linear Coding over Finite Rings
23
Lemma 2.1.2. Let (X1 , X2 ) ∼ p be a jointly random variable whose sample space
is a finite ring R = R1 × R2 . For any η > 0, there exists > 0, such that,
∀ (x1 , x2 )tr ∈ T (n, (X1 , X2 )) and ∀ I ≤l R1 ,
|D (x1 , I|x2 )| < 2n[H(X1 |YR1 /I ,X2 )+η] ,
(2.1.5)
where
D (x1 , I|x2 ) = (y, x2 )tr ∈ T y − x1 ∈ In
and YR1 /I = X1 + I is a random variable with sample space R1 /I.
First Proof. Let R1 /I = {a1 + I, a2 + I, · · · , am + I}, where m = |R1 |/|I|. For
arbitrary > 0 and integer n, without loss of generality, we can assume that
" # "
# "
#
(1)
(2)
(n)
x1
x1 , x1 , · · · , x1
x1,1 , x1,2 , · · · , x1,m
= (1)
(2)
(n) =
x2
x2,1 , x2,2 , · · · , x2,m
x 2 , x2 , · · · , x2
admits a structure satisfying
Pj−1
Pj
 "
"
#  (Pj−1 ck +1)
#cj
(
ck +2)
(
ck )
k=0
k=0
k=0
x1,j
x
,
x
,
·
·
·
,
x
a
+
I
j
1
1
1
∈
Pj−1
Pj
=  Pj−1
,
(
ck +1)
(
ck +2)
(
ck )
x2,j
R2
x2 k=0
, x2 k=0
, · · · , x2 k=0
P
where c0 = 0 and cj = r∈aj +I×R2 N (r, (x1 , x2 )tr ) , 1 ≤ j ≤ m. For any y =
(i)
y (1) , y (2) , · · · , y (n) with (y, x2 )tr ∈ D (x1 , I|x2 ), we have y (i) − x1 ∈ I, ∀ 1 ≤
(i)
i ≤ n, by definition. Thus, y (i) and x1 belong to the same coset, i.e.
Pj−1
Pj−1
Pj
y ( k=0 ck +1) , y ( k=0 ck +2) , · · · , y ( k=0 ck ) ∈ aj + I, ∀ 1 ≤ j ≤ m.
Furthermore, ∀ r ∈ R,
(
|N (r, (x1 , x2 )tr ) /n − p(r)| ≤ |N (r, (y, x2 )tr ) /n − p(r)| ≤ N (r, (y, x2 )tr ) N (r, (x1 , x2 )tr ) ≤ 2.
−
=⇒ n
n
Let
"
#
x1,j
zj =
and
x2,j
 Pj−1
y ( k=0 ck +1) ,
z0j =  (Pj−1 ck +1)
x2 k=0
,
y(
Pj−1
k=0
ck +2)
,
··· ,
Pj−1
(
ck +2)
x2 k=0
, ··· ,
 "
Pj
#cj
y ( k=0 ck )
a
+
I
j
Pj
∈
.
(
ck )
R2
x2 k=0
We have that z0j is a strongly 2-typical sequence of length cj with respect to
the random variable Zj ∼ pj = emp(zj ) (the empirical distribution of zj ). The
24
Linear Coding
sample space of Zj is (aj + I) × R2 . Therefore, the number of all possible z0j ’s
" #
w1
(namely, all elements
∈ T2 (cj , Zj ) such that w2 = x2,j ) is upper bounded
w2
by 2cj [H(pj )−H(pj,2 )+2] , where pj,2 is the marginal of pj with respect to the second
coordinate, by [Yeu08, Theorem 6.10]. Consequently,
Pm
c [H(pj )−H(pj,2 )+2]
.
(2.1.6)
|D (x1 , I|x2 )| ≤ 2 j=1 j
Direct computation yields
m
m
X
1X
cj
cj H(pj ) =
n j=1
n
j=1
=
X
r∈aj +I×R2
N (r, (x1 , x2 )tr )
cj
log
cj
N (r, (x1 , x2 )tr )
m
X
X N (r, (x1 , x2 )tr )
n
cj
n
log
−
log
tr
n
N (r, (x1 , x2 ) ) j=1 n
cj
r∈R
and
m
1X
cj H(pj,2 )
n j=1
"
#
P
tr
m
X
cj X
cj
r1 ∈aj +I N ((r1 , r2 ), (x1 , x2 ) )
=
log P
tr
n
cj
r1 ∈aj +I N ((r1 , r2 ), (x1 , x2 ) )
j=1
r2 ∈R2
P
tr
m X
X
n
r1 ∈aj +I N ((r1 , r2 ), (x1 , x2 ) )
=
log P
tr
n
N
((r
1 , r2 ), (x1 , x2 ) )
r1 ∈aj +I
j=1
−
r2 ∈R2
m
X cj
j=1
n
log
n
.
cj
Since the entropy H is a continuous function, there exists some small 0 < < η/4,
such that
X N (r, (x , x )tr )
n
1
2
log
−
H(X
,
X
)
1
2 < η/8,
n
N (r, (x1 , x2 )tr )
r∈R
m
X cj
n
log − H(YR1 /I ) < η/8 and
n
c
j
j=1
m
P
tr
X X
r1 ∈aj +I N ((r1 , r2 ), (x1 , x2 ) )
n
j=1 r2 ∈R2
n
−
H(X
,
Y
)
× log P
< η/8.
2
R
/I
1
tr
r1 ∈aj +I N ((r1 , r2 ), (x1 , x2 ) )
2.1. Linear Coding over Finite Rings
25
Therefore,
m
1X
cj H(pj ) <H(X1 , X2 ) − H(YR1 /I ) + η/4
n j=1
(2.1.7)
m
1X
cj H(pj,2 ) >H(X2 , YR1 /I ) − H(YR1 /I ) − η/4
n j=1
(2.1.8)
where (2.1.7) and (2.1.8) are guaranteed for small 0 < < η/4. Substituting (2.1.7)
and (2.1.8) into (2.1.6), (2.1.5) follows.
Second Proof. Define the mapping Γ : R1 → R1 /I by
Γ : x1 7→ x1 + I, ∀ x1 ∈ R1 .
(1)
(2)
(n)
Assume that x1 = x1 , x1 , · · · , x1 , and let
(1)
(2)
(n)
y = Γ x1 , Γ x1 , · · · , Γ x1
.
By definition, ∀ (y, x2 )tr ∈ D (x1 , I|x2 ), where y = y (1) , y (2) , · · · , y (n) ,
Γ y (1) , Γ y (2) , · · · , Γ y (n)
= y.
Obviously, (y, y, x2 )tr is a function of (y, x2 )tr . Thus,
(y, y, x2 )tr ∈T (n, (X1 , YR1 /I , X2 ))
by [Yeu08, Theorem 6.8]. Therefore, for a fixed (y, x2 )tr ∈ T , the number of strongly
-typical sequences y such that (y, y, x2 )tr is strongly -typical is strictly upper
bounded by 2n[H(X1 |YR1 /I ,X2 )+η] if n is large enough and is small. Since
|D (x1 , I|x2 )| = (y, y, x2 )tr ∈ T y − x1 ∈ In ,
we conclude that |D (x1 , I|x2 )| < 2n[H(X1 |YR1 /I ,X2 )+η] .
Remark 2.5. The second proof was suggested by an anonymous reviewer for our
paper [HS13b]. The mechanisms behind the first proof and the second one are in
fact very different. However, this is not quite clear for i.i.d. scenarios. For noni.i.d. scenarios, the results proved by these two approaches diverse. Although the
technique from the first proof is more complicated, it provides results with its own
advantages. More details on the differences are given in Chapter 5.
26
Linear Coding
2.2
2.2.1
Proof of the Achievability Theorems
Proof of Theorem 2.1.1
As mentioned, Xi can be seen as a subset of Ri for a fixed Φ = (Φ1 , · · · , Φs ). In
this section, we assume that Xi has sample space Ri , which makes sense since Φi
is injective.
nRi
Let R = (R1 , R2 , · · · , Rs ) and ki =
, ∀ i ∈ S, where n is the length
log |Ri |
P
Ri log |Ii |
> r (T, IT ) , (this implies
of the data sequences. If R ∈ RΦ , then i∈T
log |Ri |
1P
that
ki log |Ii | − r (T, IT ) > 2η for some small constant η > 0 and large
n i∈T
enough n), ∀ ∅ =
6 T ⊆ S, ∀ 0 6= Ii ≤l Ri . We claim that R is achievable by linear
coding over R1 , R2 , · · · , Rs .
Encoding:
For every i ∈ S, randomly generate a ki × n matrix Ai based on a uniform
distribution, i.e. independently choose each entry of Ai uniformly at random from
Ri . Define a linear encoder φi : Rni → Rki i such that
φi : x 7→ Ai x, ∀ x ∈ Rni .
Obviously the coding rate of this encoder is
nRi
1
log |Ri |
1
ki
n
log |φi (Ri )| ≤ log |Ri | =
≤ Ri .
n
n
n
log |Ri |
Decoding:
Subject to observing yi ∈ Rki i (i ∈ S) from the ith encoder, the decoder claims
Qs
tr
that x = (x1 , x2 , · · · , xs ) ∈ i=1 Rni is the array of the encoded data sequences,
if and only if:
1. x ∈ T ; and
tr
2. ∀ x0 = (x10 , x20 , · · · , xs0 ) ∈ T , if x0 6= x, then φj (xj0 ) 6= yj , for some j.
Error:
Assume that Xi ∈ Rni (i ∈ S) is the original data sequence generated by the
ith source. It is readily seen that an error occurs if and only if one of the following
events occurs:
tr
E1 : X = (X1 , X2 , · · · , Xs ) ∈
/ T ;
tr
E2 : There exists X 6= (x10 , x20 , · · · , xs0 ) ∈ T , such that φi (xi0 ) = φi (Xi ), ∀ i ∈ S.
Error Probability:
By the joint asymptotic equipartition principle (AEP) [Yeu08, Theorem 6.9],
Pr {E1 } → 0, n → ∞.
2.2. Proof of the Achievability Theorems
27
Additionally, for ∅ =
6 T ⊆ S, let
tr
D (X; T ) = (x10 , x20 , · · · , xs0 ) ∈ T xi0 6= Xi , ∀ i ∈ T and xi0 = Xi , ∀ i ∈ T c .
We have
[
D (X; T ) =
[D (XT , I|XT c ) \ {X}] ,
(2.2.1)
06=I≤l RT
Q
Q
where XT = i∈T Xi and XT c = i∈T c Xi , since I goes over all possible nontrivial left ideals. Consequently,
Y
X
Pr {φi (xi0 ) = φi (Xi )|E1c }
Pr {E2 |E1c } =
tr
=
X
X
(x01 ,··· ,x0s ) ∈T \{X} i∈S
Y
Pr {φi (xi0 ) = φi (Xi )|E1c }
(2.2.2)
∅6=T ⊆S (x0 ,··· ,x0 )tr i∈T
1
s
∈D (X;T )
≤
X
X
X
∅6=T ⊆S 06=I≤l RT
<
X
∅6=T ⊆S 06=
i∈T
Pr {φi (xi0 ) = φi (Xi )|E1c }
(2.2.3)
tr
i∈T
(x01 ,··· ,x0s )
∈D (XT ,I|XT c )\{X}
X
Q
Y
2n[r(T,IT )+η] − 1
Y
|Ii |−ki
(2.2.4)
i∈T
Ii
≤l RT
< (2s − 1) 2|RS | − 2 ×
−n
max
06=
2
Q∅6=T ⊆S,
i∈T
1 P
n
i∈T
ki log |Ii |−[r(T,IT )+η]
, (2.2.5)
Ii ≤l RT
where
(2.2.2) is from the fact that T \ {X} =
`
∅6=T ⊆S
D (X; T ) (disjoint union);
(2.2.3) follows from (2.2.1) by Boole’s inequality [Boo10, Fré35];
(2.2.4) is from Lemma 2.1.1 and Lemma 2.1.2, as well as the fact that every left
ideal of RT is a Cartesian product of some left ideals Ii of Ri , i ∈ T (see
Proposition 1.1.3). At the same time, is required to be sufficiently small;
(2.2.5) is due to the facts that the number of non-empty subsets of S is 2s − 1 and
the number of non-trivial left ideals of the finite ring RT is less than 2|RS | −1,
which is the number of non-empty subsets of RS .
Thus, Pr {E2 |E1c } → 0, when n → ∞, from (2.2.5), since for sufficiently large n
1P
ki log |Ii | − [r (T, I) + η] > η > 0.
and small ,
n i∈T
Therefore, Pr {E1 ∪ E2 } = Pr {E1 } + Pr {E1c } Pr { E2 | E1c } → 0 as → 0 and
n → ∞.
28
Linear Coding
2.2.2
Proof of Theorem 2.1.2
The proof follows from almost the same argument as in proving Theorem 2.1.1, except that the performance analysis only focuses on sequences (ai,1 , ai,2 , · · · , ai,n ) ∈
Rni (1 ≤ i ≤ s) such that
ki
Y
(j)
(j)
(j)
ai,j = Φ1,i xi
, Φ2,i xi
, · · · , Φki ,i xi
∈
Rl,i
l=1
(j)
for some xi ∈ Xi . Let Xi , Yi be any two such sequences satisfying Xi − Yi ∈ Ini
for some Ii ≤l Ri . Based
Qki on the special structure of Xi and Yi , it is easy to verify
that Ii 6= 0 ⇔ Ii = l=1
Il,i and 0 6= Il,i ≤l Rl,i , for all 1 ≤ l ≤ ki (This causes the
difference between (2.1.1) and (2.1.2)). In addition, it is obvious that RΦ ⊆ RΦ,prod
by their definitions.
2.2.3
Proof of Theorem 2.1.3
The proof is similar to that for Theorem 2.1.1, except that it only focuses on
sequences (ai,1 , ai,2 , · · · , ai,n ) ∈ MnL,Ri ,mi (1 ≤ i ≤ s) such that ai,j ∈ ML,Ri ,mi
(
a, u ≥ v;
satisfies [ai,j ]u,v =
for some a ∈ Ri . Let Xi , Yi be any two such
0, otherwise,
sequences such that Xi − Yi ∈ Ini for some Ii ≤l ML,Ri ,mi . It is easily seen that
Ii 6= 0 if and only if Ii ∈
/ O(ML,Ri ,mi ) (This causes the difference between (2.1.1)
and (2.1.3)). In addition, it is obvious that RΦ ⊆ RΦ,m by their definitions.
2.3
Optimality
Obviously, Theorem 2.1.1 specializes to its field counterpart if all rings considered
are fields, as summarized in the following theorem.
Theorem 2.3.1. Region (2.1.1) is the Slepian–Wolf region if Ri contains no proper
non-trivial left ideal, equivalently1 , Ri is a field, for all i ∈ S. As a consequence,
region (2.1.4) is the Slepian–Wolf region.
Proof. In Theorem 2.1.1, random variable YRT /IT admits a sample space of cardinality 1 for all ∅ =
6 T ⊆ S, since the only non-trivial left ideal of Ri is itself for all
feasible i. Thus, 0 = H(YRT /IT ) ≥ H(YRT /IT |XT c ) ≥ 0. Consequently,
X
n
o
RΦ = (R1 , R2 , · · · , Rs ) ∈ Rs Ri > H(XT |XT c ), ∀ ∅ =
6 T ⊆S ,
i∈T
which is the Slepian–Wolf region R[X1 , X2 , · · · , Xs ]. Therefore, region (2.1.4) is
also the Slepian–Wolf region.
1 Equivalency
does not necessarily hold for rngs.
2.3. Optimality
29
If Ri is a field, then obviously it has no proper non-trivial left (right) ideal.
Conversely, ∀ 0 6= a ∈ Ri , hail = Ri implies that ∃ 0 6= b ∈ Ri , such that ba = 1.
Similarly, ∃ 0 6= c ∈ Ri , such that cb = 1. Moreover, c = c · 1 = cba = 1 · a = a.
Hence, ab = cb = 1. b is the inverse of a. By Wedderburn’s little theorem, Theorem
1.1.1, Ri is a field.
One important question to address is whether linear coding over finite nonfield rings can be equally optimal for data compression. Hereby, we claim that,
for any Slepian–Wolf scenario, there always exist linear encoders over some finite
non-field rings which achieve the data compression limit. Therefore, optimality of
linear coding over finite non-field rings for data compression is established in the
sense of existence.
2.3.1
Existence Theorem I: Single Source
For any single source scenario, the assertion that there always exists a finite ring
R1 , such that Rl is in fact the Slepian–Wolf region
R[X1 ] = {R1 ∈ R|R1 > H(X1 )},
is equivalent to the existence of a finite ring R1 and an injection Φ1 : X1 → R1 ,
such that
log |R1 | (2.3.1)
H(X1 ) − H(YR1 /I1 ) = H(X1 ),
max
06=I1 ≤l R1 log |I1 |
where YR1 /I1 = Φ1 (X1 ) + I1 .
Theorem 2.3.2. Let R1 be a finite ring of order |R1 | ≥
p|X1 |. If R1 contains one
and only one proper non-trivial left ideal I0 and |I0 | = |R1 |, then region (2.1.4)
coincides with the Slepian–Wolf region, i.e. there exists an injection Φ1 : X1 → R1 ,
such that (2.3.1) holds.
Remark 2.6. Examples of such a non-field ring R1 in the above theorem include
("
#
)
x 0 ML,p =
x, y ∈ Zp
y x (ML,p is a ring with respect to matrix addition and multiplication) and Zp2 , where
p is any prime. For any single source scenario, one can always choose R1 to be
either ML,p or Zp2 . Consequently, optimality is attained.
Proof of Theorem 2.3.2. Notice that the random variable YR1 /I0 depends on the
injection Φ1 , so does its entropy H(YR1 /I0 ). Obviously H(YR1 /R1 ) = 0, since the
sample space of the random variable YR1 /R1 contains only one element. Therefore,
log |R1 | H(X1 ) − H(YR1 /R1 ) = H(X1 ).
log |R1 |
30
Linear Coding
Consequently, (2.3.1) is equivalent to
log |R1 | H(X1 ) − H(YR1 /I0 ) ≤ H(X1 )
log |I0 |
⇔H(X1 ) ≤ 2H(YR1 /I0 ),
(2.3.2)
since |I0 | = |R1 |. By Lemma 2.A.1, there exists injection Φ̃1 : X1 → R1 such
that (2.3.2) holds if Φ1 = Φ̃1 . The statement follows.
p
Up to isomorphism, there are exactly 4 distinct rings of order p2 for a given
prime p. They include 3 non-field rings, Zp × Zp , ML,p and Zp2 , in addition to
the field Fp2 . It has been proved that, using linear encoders over the last three,
optimality can always be achieved in the single source scenario. Actually, the same
holds true for all multiple sources scenarios.
2.3.2
Existence Theorem II: Multiple Sources
Theorem 2.3.3. Let R1 , R2 , · · · , Rs be s finite rings with |Ri | ≥ |Xi |. If Ri is
isomorphic to either
1. a field, i.e. Ri contains no proper non-trivial left (right) ideal; or
2. a
pring containing one and only one proper non-trivial left ideal I0i and |I0i | =
|Ri |,
for all feasible i, then (2.1.4) coincides with the Slepian–Wolf region R[X1 , X2 , · · · ,
Xs ].
Remark 2.7. It is obvious that Theorem 2.3.3 includes Theorem 2.3.2 as a special
case. In fact, its proof resembles the one of Theorem 2.3.2. Examples of Ri ’s include
all finite fields, ML,p and Zp2 , where p is a prime. However, Theorem 2.3.3 does not
guarantee that all rates, except the vertexes, in the polytope of the Slepian–Wolf
region are “directly” achievable for the multiple sources case. A time sharing scheme
is required in our current proof. Nevertheless, all rates are “directly” achievable if
all Ri ’s are fields or if s = 1. This is partially the reason that the two theorems are
stated separately.
Remark 2.8. Theorem 2.3.3 also includes Theorem 2.3.1 as a special case. However, Theorem 2.3.1 admits a simpler proof compared to the one for Theorem 2.3.3.
Proof of Theorem 2.3.3. It suffices to prove that, for any R = (R1 , R2 , · · · , Rs ) ∈
Rs satisfies
Ri > H(Xi |Xi−1 , Xi−2 , · · · , X1 ), ∀ 1 ≤ i ≤ s,
R ∈ RΦ for some set of injections Φ = (Φ1 , Φ2 , · · · , Φs ), where Φi : Xi → Ri . Let
Φ̃ = (Φ̃1 , Φ̃2 , · · · , Φ̃s ) be the set of injections, where, if
2.3. Optimality
31
(i) Ri is a field, Φ̃i is any injection;
(ii) Ri satisfies 2, Φ̃i is the injection such that
H(Xi |Xi−1 , Xi−2 , · · · , X1 ) ≤2H(YRi /I0i |Xi−1 , Xi−2 , · · · , X1 ),
when Φi = Φ̃i . The existence of Φ̃i is guaranteed by Lemma 2.A.1.
If Φ = Φ̃, then
log |Ii |
H(Xi |Xi−1 , Xi−2 , · · · , X1 )
log |Ri |
≥H(Xi |Xi−1 , Xi−2 , · · · , X1 ) − H(YRi /Ii |Xi−1 , Xi−2 , · · · , X1 )
=H(Xi |YRi /Ii , Xi−1 , Xi−2 , · · · , X1 ),
for all 1 ≤ i ≤ s and 0 6= Ii ≤l Ri . As a consequence,
X Ri log |Ii |
i∈T
X log |Ii |
H(Xi |Xi−1 , Xi−2 , · · · , X1 )
log |Ri |
log |Ri |
i∈T
X
≥
H(Xi |YRi /Ii , Xi−1 , Xi−2 , · · · , X1 )
>
i∈T
X
≥
H(Xi |YRT /IT , XT c , Xi−1 , Xi−2 , · · · , X1 )
i∈T
≥H XT YRT /IT , XT c
=H (XT |XT c ) − H YRT /IT |XT c ,
for all ∅ =
6 T ⊆ {1, 2, · · · , s}. Thus, R ∈ RΦ̃ .
By Theorem 2.3.1, Theorem 2.3.2 and Theorem 2.3.3, we draw the conclusion
that
Corollary 2.3.1. For any Slepian–Wolf scenario, there always exists a sequence
of linear encoders over some finite rings (fields or non-field rings) which achieves
the data compression limit, the Slepian–Wolf region.
In fact, LCoR can be optimal even for rings beyond those stated in the above
theorems (see Example 2.1.1). We classify some of these scenarios in the remaining
parts of this section.
2.3.3
Product Rings
Theorem 2.3.4. Let Rl,1 , Rl,2 , · · · , Rl,s (l = 1, 2) be a set of finite rings of equal
size, and Ri = R1,i × R2,i for all feasible i. If the coding rate R ∈ Rs is achievable
with linear encoders over Rl,1 , Rl,2 , · · · , Rl,s (l = 1, 2), then R is achievable with
linear encoders over R1 , R2 , · · · , Rs .
32
Linear Coding
Proof. By definition, R is a convex combination of coding rates which are achieved
by different linear encoding schemes over Rl,1 , Rl,2 , · · · , Rl,s (l = 1, 2), respectively. To be more precise,
Rs and positive numbers
Pm there exist R1 , R2 , · · · , Rm
P∈
m
w1 , w2 , · · · , wm with j=1 wj = 1, such that R = j=1 wj Rj . Moreover, there
exist injections Φl = (Φl,1 , Φl,2 , · · · , Φl,s ) (l = 1, 2), where Φl,i : Xi → Rl,i , such
that
X
Ri log |Il,i |
s
>
Rj ∈ RΦl = (R1 , R2 , · · · , Rs ) ∈ R log |Rl,i |
i∈T
H(XT |XT c ) − H(YRl,T /Il,T |XT c ), ∀ ∅ =
6 T ⊆ S, ∀ 0 6= Il,i ≤l Rl,i ,
(2.3.3)
Q
Q
where Rl,T = i∈T Rl,i , Il,T = i∈T Il,i and YRl,T /Il,T = Φl (XT ) + Il,T is a random variable with sample space Rl,T /Il,T . To show that R is achievable with linear
encoders over R1 , R2 , · · · , Rs , it suffices to prove that Rj is achievable with linear
encoders over R1 , R2 , · · · , Rs for all feasible j. Let Rj = (Rj,1 , Rj,2 , · · · , Rj,s ). For
all ∅ =
6 T ⊆ S and 0 6= Ii = I1,i × I2,i ≤l Ri with 0 6= Il,i ≤l Rl,i (l = 1, 2), we
have
X Rj,i log |Ii |
i∈T
log |Ri |
=
X Rj,i log |I1,i |
log |R1,i |
i∈T
X Rj,i log |I2,i | c2
c1
+
,
c1 + c2
log |R2,i | c1 + c2
i∈T
where cl = log |Rl,1 |. By (2.3.3), it can be easily seen that
X Rj,i log |Ii |
i∈T
log |Ri |
Meanwhile, let RT =
Φ1,s × Φ2,s ) (Note:
>H(XT |X
Tc
2
X
1
)−
cl H(YRl,T /Il,T |XT c ).
c1 + c2
l=1
Q
i∈T
Ri , IT =
Q
i∈T
Ii , Φ = (Φ1,1 × Φ2,1 , Φ1,2 × Φ2,2 , · · · ,
Φ1,i × Φ2,i : xi 7→ (Φ1,i (xi ), Φ2,i (xi )) ∈ Ri
for all xi ∈ Xi .) and YRT /IT = Φ(XT ) + IT . It can be verified that YRl,T /Il,T
(l = 1, 2) is a function of YRT /IT , hence, H(YRT /IT |XT c ) ≥ H(YRl,T /Il,T |XT c ).
Consequently,
X Rj,i log |Ii |
i∈T
log |Ri |
> H(XT |XT c ) − H(YRT /IT |XT c ),
which implies that Rj ∈ RΦ,prod by Theorem 2.1.2. We therefore conclude that Rj
is achievable with linear encoders over R1 , R2 , · · · , Rs for all feasible j, so is R.
Obviously, R1 , R2 , · · · , Rs in Theorem 2.3.4 are of the same size. Inductively,
one can verify the following without any difficulty.
2.3. Optimality
33
Theorem 2.3.5. Let L be any finite index
Q set, Rl,1 , Rl,2 , · · · , Rl,s (l ∈ L ) be a
set of finite rings of equal size, and Ri = l∈L Rl,i for all feasible i. If the coding
rate R ∈ Rs is achievable with linear encoders over Rl,1 , Rl,2 , · · · , Rl,s (l ∈ L ),
then R is achievable with linear encoders over R1 , R2 , · · · , Rs .
Remark 2.9. There are delicate issues to the situation Theorem 2.3.5 (Theorem
2.3.4) illustrates. Let Xi (1 ≤ i ≤ s) be the set of all symbols generated by the
ith source. The hypothesis of Theorem 2.3.5 (Theorem 2.3.4) implicitly implies the
alphabet constraint |Xi | ≤ |Rl,i | for all feasible i and l.
Let R1 , R2 , · · · , Rs be s finite rings each of which is isomorphic to either
1. a ring
p R containing one and only one proper non-trivial left ideal whose order
is |R|, e.g. ML,p and Zp2 (p is a prime); or
2. a ring of a finite product of finite field(s) and/or ring(s) satisfying 1, e.g.
Qm
Qm0
Qm00
ML,p × j=1 Zpj (p and pj ’s are prime) and i=1 ML,pi × j=1 Fqj (m0 and
m00 are non-negative, pi ’s are prime and qj ’s are powers of primes).
Theorem 2.3.3 and Theorem 2.3.5 ensure that linear encoders over ring R1 , R2 , · · · ,
Rs are always optimal in any applicable (subject to the condition specified in
the corresponding theorem) Slepian–Wolf coding scenario. As a very special case,
Zp × Zp , where p is a prime, is always optimal in any (single source or multiple
sources) scenario with alphabet size less than or equal to p. However, using a field
or product rings is not necessary. As shown in Theorem 2.3.2, neither ML,p nor
Zp2 is (isomorphic to) a product of rings nor a field. It is also not required to have
a restriction on the alphabet size (see Theorem 2.3.3), even for product rings (see
Example 2.1.1 for a case of Z2 × Z3 ).
2.3.4
Trivial Case: Uniform Distributions
The following theorem is trivial, however we include it for completeness.
Theorem 2.3.6. Regardless which set of rings R1 , R2 , · · · , Rs is chosen, as long as
|Ri | = |Xi | for all feasible i, region (2.1.1) is the Slepian–Wolf region if (X1 , X2 , · · · ,
Xs ) ∼ p is a uniform distribution.
Proof. If p is uniform, then, for any ∅ =
6 T ⊆ S and 0 6= IT ≤l RT , YRT /IT is
uniformly distributed on RT /IT . Moreover, XT and XT c are independent, so are
YRT /IT and XT c . Therefore, H(XT |XT c ) = H(XT ) = log |RT | and H(YRT /IT |XT c )
|RT |
. Consequently,
= H(YRT /IT ) = log
|IT |
r(T, IT ) = H(XT |XT c ) − H(YRT /IT |XT c ) = log |IT |.
Region (2.1.1) is the Slepian–Wolf region.
34
Linear Coding
Remark 2.10. When p is uniform, it is obvious that the uncoded strategy (all
encoders are one-to-one mappings) is optimal in the Slepian–Wolf source coding
problem. However, optimality stated in Theorem 2.3.6 does not come from deliberately fixing the linear encoding mappings, but generating them randomly.
So far, we have only shown that there exist linear encoders over finite non-field
rings that are equally good as their field counterparts. In Chapter 3, Problem 2.1
is considered with an arbitrary g. It will be demonstrated that linear coding over
finite non-field rings can strictly outperform its field counterpart for encoding some
discrete functions, and there are infinitely many such functions.
2.A. Appendix
2.A
2.A.1
35
Appendix
A Supporting Lemma
Lemma 2.A.1. Let R be a finite ring, X and Y be two correlated discrete random
variables, and X be the sample space of X with |X
p | ≤ |R|. If R contains one and
only one proper non-trivial left ideal I and |I| = |R|, then there exists injection
Φ̃ : X → R such that
H(X|Y ) ≤ 2H(Φ̃ (X) + I|Y ).
(2.A.1)
Proof. Let
Φ̃ ∈ arg max H(Φ (X) + I|Y ),
Φ∈M
where M is the set of all possible Φ’s (maximum can always be reached because
|R|!
|M | =
is finite, but it is not uniquely attained by Φ̃ in general).
(|R| − |X |)!
Assume that Y is the sample space (not necessarily finite) of Y . Let q = |I|,
I = {r1 , r2 , · · · , rq } and R/I = {a1 + I, a2 + I, · · · , aq + I}. We have that
H(X|Y ) = −
H(Φ̃ (X) + I|Y ) = −
q
X X
y∈Y i,j=1
q
XX
pi,j,y log
pi,y log
y∈Y i=1
pi,j,y
and
py
pi,y
,
py
where
pi,j,y = Pr Φ̃(X) = ai + rj , Y = y ,
py =
pi,y =
q
X
pi,j,y ,
i,j=1
q
X
pi,j,y .
j=1
(Note: Pr Φ̃(X) = r = 0 if r ∈ R \ Φ̃(X ). In addition, every element in R can
be uniquely expressed as ai + rj .) Therefore, (2.A.1) is equivalent to
q
XX
pi,j,y
pi,y
≤ −2
pi,y log
p
py
y
y∈Y i=1
y∈Y i,j=1
q
X X
pi,y
pi,1,y pi,2,y
pi,q,y
⇔
py
H
,
,··· ,
py
pi,y pi,y
pi,y
i=1
y∈Y
X
p1,y p2,y
pq,y
≤
py H
,
,··· ,
,
py py
py
−
q
X X
pi,j,y log
y∈Y
(2.A.2)
36
Linear Coding
where H (v1 , v2 , · · · , vq ) = −
pp. 49]. Let
A=
X
Pq
j=1
vj log vj , by the grouping rule for entropy [CT06,
q
q
q
X
X
pi,1,y X pi,2,y
pi,q,y
,
,··· ,
p
p
py
y
y
i=1
i=1
i=1
py H
y∈Y
!
.
The concavity of the function H implies that
X
py
y∈Y
q
X
pi,y
H
py
i=1
pi,1,y pi,2,y
pi,q,y
,
,··· ,
pi,y pi,y
pi,y
≤ A.
(2.A.3)
At the same time,
X
y∈Y
py H
pq,y
p1,y p2,y
,
,··· ,
py py
py
= max H(Φ(X) + I|Y )
Φ∈M
by the definition of Φ̃. We now claim that
A ≤ max H(Φ(X) + I|Y ).
Φ∈M
Suppose otherwise, i.e. A >
P
y∈Y
py H
(2.A.4)
pq,y
p1,y p2,y
,
,··· ,
. Let Φ0 : X → R
py py
py
be defined as
Φ0 : x 7→ aj + ri ⇔ Φ̃(x) = ai + rj .
We have that
!
q
q
q
X
X
X
p
p
p
i,2,y
i,q,y
i,1,y
,
,··· ,
=A
H(Φ0 (X) + I|Y ) =
py H
py i=1 py
py
i=1
i=1
y∈Y
X
p1,y p2,y
pq,y
>
py H
,
,··· ,
= max H(Φ(X) + I|Y ).
Φ∈M
py py
py
X
y∈Y
It is absurd that H(Φ0 (X) + I|Y ) > maxΦ∈M H(Φ(X) + I|Y )! Therefore, (2.A.2)
is valid by (2.A.3) and (2.A.4), so is (2.A.1).
Chapter 3
Encoding Functions of Correlated
Sources
or an arbitrary discrete function g, Problem 2.1 remains open in general, and
R[X1 , X2 , · · · , Xs ] ⊆ R[g] obviously. Making use of Elias’ theorem on binary
linear codes [Eli55], Körner–Marton [KM79] shows that R[⊕2 ] (“⊕2 ” is the
modulo-two sum) contains the region
R⊕2 = (R1 , R2 ) ∈ R2 | R1 , R2 > H(X1 ⊕2 X2 ) .
F
This region is not contained in the Slepian–Wolf region for certain distributions. In
other words, R[⊕2 ] ) R[X1 , X2 ]. Combining the standard random coding technique
and Elias’ result, [AH83] shows that R[⊕2 ] can be strictly larger than the convex
hull of the union R[X1 , X2 ] ∪ R⊕2 .
However, the functions considered in these works are relatively simple. As general as it can be, their work only considers functions defined on some finite field.
This is because (part of) the encoding technique used is the linear coding technique
over finite fields from [Eli55, Csi82]. Unfortunately, we will see later that this is
a suboptimal solution. Instead, we will propose replacing the linear encoders over
finite fields with the more generalized version, linear encoders over finite rings. We
will see that in many examples, the later strictly outperform the first in various
aspects.
3.1
A Polynomial Approach
The first question arising here is how do we handle the arbitrary function g in
Problem 2.1. This brings us back to Lemma 1.2.3 and Lemma 1.2.4. As commented
in Section 1.2, any function defined on a finite domain is equivalent to a restriction
of some polynomial function and some nomographic function. Conceptually, this is
a nice observation. At least, there is well-defined polynomial structure associated
with the function considered. Moreover, if Problem 2.1 were concluded for all the
polynomial functions or all the nomographic functions, then it is concluded for all
functions defined on finite domains.
37
38
Encoding Functions of Correlated Sources
Thus, from now on we will consider Problem 2.1 with this polynomial approach,
namely, only polynomial functions are considered. We will prove the claim that
LCoR dominates LCoF in terms of achieving better coding rates based on this
approach.
3.2
Source Coding for Computing
We begin with establishing the following theorem which can be recognized as a
generalization of Körner–Marton [KM79].
Theorem 3.2.1. Let R be a finite ring, and
ĝ = h ◦ k, where k(x1 , x2 , · · · , xs ) =
s
X
ki (xi )
(3.2.1)
i=1
and h, ki ’s are functions mapping R to R. Then
n
o
log |R| H(X) − H(YR/I )
(3.2.2)
Rĝ = (R1 , R2 , · · · , Rs ) ∈ Rs Ri > max
06=I≤l R log |I|
⊆R[ĝ],
where X = k(X1 , X2 , · · · , Xs ) and YR/I = X + I.
Proof. By Theorem 2.1.1, ∀ > 0, there exists a large enough n, an m × n matrix A ∈ Rm×n and a decoder ψ, such that Pr {X n 6= ψ (AX n )} < , if m >
n(H(X) − H(YR/I ))
max06=I≤l R
. Let φi = A ◦ ~ki (1 ≤ i ≤ s) be the encoder
log |I|
of the ith source.
Upon receiving φi (Xin ) from the ith source, the decoder claims
Ps
n
that ~h X̂ , where X̂ n = ψ [
φi (X n )], is the function, namely ĝ, subject to
i
i=1
computation. The probability of decoding error is
n h
i
o
Pr ~h ~k (X1n , X2n , · · · , Xsn ) 6= ~h X̂ n
n
o
≤ Pr X n 6= X̂ n
(
" s
#)
X
n
n
= Pr X 6= ψ
φi (Xi )
i=1
(
"
= Pr X n 6= ψ
s
X
#)
A~ki (Xin )
i=1
(
"
n
= Pr X 6= ψ A
s
X
#)
~ki (X n )
i
i=1
n
h
io
= Pr X n 6= ψ A~k (X1n , X2n , · · · , Xsn )
= Pr {X n 6= ψ (AX n )} < .
3.2. Source Coding for Computing
39
Therefore, all (R1 , R2 , · · · , Rs ) ∈ Rs , where
Ri =
m log |R|
log |R| > max
H(X) − H(YR/I ) ,
06=I≤l R log |I|
n
is achievable, i.e. Rĝ ⊆ R[ĝ].
Corollary 3.2.1. In Theorem 3.2.1, let X = k(X1 , X2 , · · · , Xs ) ∼ pX . We have
Rĝ = { (R1 , R2 , · · · , Rs ) ∈ Rs | Ri > H(X)} ⊆ R[ĝ],
if either of the following conditions holds:
1. R is isomorphic to a finite field;
2. R is isomorphic to apring containing one and only one proper non-trivial left
ideal I0 with |I0 | = |R|, and
H(X) ≤ 2H(X + I0 ).
Proof. If either 1 or 2 holds, then it is guaranteed that
log |R| H(X) − H(YR/I ) = H(X)
06=I≤l R log |I|
max
in Theorem 3.2.1. The statement follows.
Remark 3.1. By Lemma 3.A.1, examples of non-field rings satisfying 2 in Corollary
3.2.1 include
(1) Z4 with pX (0) = p1 , pX (1) = p2 , pX (3) = p3 and pX (2) = p4 satisfying
(
0 ≤ max{p2 , p3 } <
6 min{p1 , p4 } ≤ 1
(3.2.3)
0 ≤ max{p1 , p4 } <
6 min{p2 , p3 } ≤ 1;
(2) ML,2 with
"
pX
pX
#!
"
#!
0
1 0
= p1 , pX
= p2 ,
0
0 1
"
#!
"
#!
1 0
0 0
= p3 and pX
= p4
1 1
1 0
0
0
satisfying (3.2.3).
Interested readers can figure out even more explicit examples deduced from Lemma
2.A.1. Besides, if R is isomorphic to Z2 and ĝ is the modulo-two sum, then Corollary
3.2.1 recovers the theorem of Körner–Marton [KM79].
40
Encoding Functions of Correlated Sources
However, Rĝ given by (3.2.2) is sometimes strictly smaller than R[g]. This was
first shown by Ahlswede–Han [AH83] for the case of g being the modulo-two sum.
Their approach combines the linear coding technique over binary field with the
standard random coding technique. In the following, we generalize the result of
Ahlswede–Han [AH83, Theorem 10] to the settings, where g is arbitrary, and, at
the same time, LCoF is replaced by its generalized version, LCoR.
Consider function ĝ admitting
s
X
ĝ(x1 , x2 , · · · , xs ) = h k0 (x1 , x2 , · · · , xs0 ),
kj (xj ) , 0 ≤ s0 < s,
(3.2.4)
j=s0 +1
where k0 : Rs0 → R and h, kj ’s are functions mapping R to R. By Lemma 1.2.4, a
discrete function with a finite domain is always equivalent to a restriction of some
function of format (3.2.4). We call ĝ from (3.2.4) a pseudo nomographic function
over ring R.
Theorem 3.2.2. Let S0 = {1, 2, · · · , s0 } ⊆ S = {1, 2, · · · , s}. If ĝ is of format
(3.2.4), and R = (R1 , R2 , · · · , Rs ) ∈ Rs satisfying
X
log |R| H(X|VS ) − H(YR/I |VS )
06=I≤l R log |I|
Rj > |T \ S0 | max
j∈T
+ I(YT ; VT |VT c ), ∀ ∅ =
6 T ⊆ S,
(3.2.5)
where ∀ j ∈ S0 , Vj = Yj = Xj ; ∀ j ∈ S \ S0 , Yj = kj (Xj ), Vj ’s are discrete random
variables such that
p(y1 , y2 , · · · , ys , v1 , v2 , · · · , vs ) = p(y1 , y2 , · · · , ys )
s
Y
p(vj |yj ),
(3.2.6)
j=s0 +1
and X =
Ps
j=s0 +1
Yj , YR/I = X + I, then R ∈ R[ĝ].
Proof. Choose δ > 6 > 0, such that Rj = Rj0 + Rj00 , ∀ j ∈ S,
I(YT ; VT |VT c ) + 2 |T | δ, ∀ ∅ =
6 T ⊆ S, and Rj00 > r + 2δ, where
r = max
06=I≤l R
P
j∈T
Rj0 >
log |R| H(X|VS ) − H(YR/I |VS ) ,
log |I|
∀ j ∈ S \ S0 .
Encoding: Fix the joint distribution p which satisfies (3.2.6). For all j ∈ S0 , let
Vj, = T (n, Xj ). For all j ∈ S \ S0 , generate randomly 2n[I(Yj ;Vj )+δ] strongly typical sequences according to distribution pVjn and let Vj, be the set of these
generated sequences. Define mapping φ0j : Rn → Vj, as follows:
(
x, if x ∈ T ;
n
0
1. If j ∈ S0 , then, ∀ x ∈ R , φj (x) =
where x0 ∈ Vj, is fixed.
x0 , otherwise,
3.2. Source Coding for Computing
41
2. If j ∈ S \ S0 , then for every x ∈ Rn , let Lx = {v ∈ Vj, |(~kj (x), v) ∈ T }. If
x ∈ T and Lx 6= ∅, then φ0j (x) is set to be some element in Lx ; otherwise
φ0j (x) is some fixed v0 ∈ Vj, .
0
Define mapping ηj : Vj, → [1, 2nRj ] by randomly choosing the value for each
v ∈ Vj, according to a uniform distribution.
nRj00
n[r + δ]
. When n is big enough, we have k >
.
Let k = minj∈S\S0
log |R|
log |R|
Randomly generate a k × n matrix M ∈ Rk×n , and let θj : Rn → Rk (j ∈ S \ S0 )
be the function θj : x 7→ M~kj (x), ∀ x ∈ Rn .
Define the encoder φj as the follows
(
ηj ◦ φ0j ,
j ∈ S0 ;
φj =
0
(ηj ◦ φj , θj ), otherwise.
Decoding: Upon observing (a1 , a2 , · · · , as0 , (as0 +1 , bs0 +1 ), · · · , (as , bs )) at the decoder, the decoder claims that
h i
~h ~k0 V̂ n , V̂ n , · · · , V̂ n , X̂ n
1
2
s0
is the function of the generated data, if and only if there exists one and only one
s
Y
V̂ = V̂1n , V̂2n , · · · , V̂sn ∈
Vj, ,
j=1
such that aj = ηj (V̂jn ), ∀ j ∈ S, and X̂ n is the only element in the set
s
n
o
X
LV̂ = x ∈ Rn (x, V̂) ∈ T , Mx =
bj .
j=t+1
Error: Assume that Xjn is the data generated by the jth source and let X n =
Ps
n
~
j=s0 +1 kj Xj . An error happens if and only if one of the following events happens.
E1 : (X1n , X2n , · · · , Xsn , Y1n , Y2n , · · · , Ysn , X n ) ∈
/ T ;
E2 : There exists some j0 ∈ S \ S0 , such that LXjn = ∅;
0
E3 :
(Y1n , Y2n , · · · , Ysn , X n , V)
φ0j (Xjn ), ∀ j ∈ S;
∈
/ T , where V = (V1n , V2n , · · · , Vsn ) and Vjn =
E4 : There exists V0 = (v10 , v20 , · · · , vs0 ) ∈ T ∩
ηj (vj0 ) = ηj Vjn , ∀ j ∈ S;
Qs
j=1
Vj, , V0 6= V, such that
42
Encoding Functions of Correlated Sources
E5 : X n ∈
/ LV or |LV | > 1, i.e. there exists X0n ∈ Rn , X0n 6= X n , such that
MX0n = MX n and (X0n , V) ∈ T .
o P
nS
5
5
E
≤ l=1 Pr { El | El,c }, where E1,c = ∅ and
Error Probability: Let γ = Pr
l
l=1
Tl−1 c
El,c = τ =1 Eτ for 1 < l ≤ 5. In the following, we show that γ → 0, n → ∞.
(a). By the jointnAEP [Yeu08,
Theorem 6.9], Pr{E1 } → 0, n → ∞.
o
(b). Let E2,j = LXjn = ∅ , ∀j ∈ S \ S0 . Then
X
Pr{E2 |E2,c } ≤
Pr {E2,j |E2,c } .
(3.2.7)
j∈S\S0
For any j ∈ S \ S0 , because the sequence v ∈ Vj, and Yjn = ~kj (Xjn ) are drawn
independently, we have
Pr{(Yjn , v) ∈ T } ≥(1 − )2−n[I(Yj ;Vj )+3]
=(1 − )2−n[I(Yj ;Vj )+δ/2]+n(δ/2−3)
>2−n[I(Yj ;Vj )+δ/2]
when n is big enough. Thus,
n
o
Pr {E2,j |E2,c } = Pr LXjn = ∅ | E2,c
n
o
Y
=
Pr ~kj (Xjn ), v ∈
/ T
v∈Vj,
n
o2n[I(Yj ;Vj )+δ]
< 1 − 2−n[I(Yj ;Vj )+δ/2]
(3.2.8)
→ 0, n → ∞.
where (3.2.8) holds true for all big enough n and the limit follow from the fact that
a
(1 − 1/a) → e−1 , a → ∞. Therefore, Pr{E2 |E2,c } → 0, n → ∞ by (3.2.7).
(c). By (3.2.6), it is obvious that VJ1 − YJ1 − YJ2 − VJ2 forms a Markov chain
for any two disjoint nonempty sets J1 , J2 ( S. Thus, if (Yjn , Vjn ) ∈ T for all j ∈ S
and (Y1n , Y2n , · · · , Ysn ) ∈ T , then (Y1n , Y2n , · · · , Ysn , V) ∈ T . In the meantime, X −
(Y1 , Y2 , · · · , Ys ) − (V1 , V2 , · · · , Vs ) is also a Markov chain. Hence, (Y1n , Y2n , · · · , Ysn ,
X n , V) ∈ T if (Y1n , Y2n , · · · , Ysn , X n ) ∈ T . Therefore, Pr{E3 |E3,c } = 0.
(d). For all ∅ =
6 J ⊆ S, let J = {j1 , j2 , · · · , j|j| } and
s
n
o
Y
ΓJ = V0 = (v10 , v20 , · · · , vs0 ) ∈
Vj, vj0 = Vjn if and only if j ∈ S \ J .
j=1
3.2. Source Coding for Computing
=
n
Q
By definition, |ΓJ | =
Pr{E4 |E4,c }
X X
43
j∈J |Vj, | − 1 = 2
P
j∈J
I(Yj ;Vj )+|J|δ
− 1 and
Pr ηj (vj0 ) = ηj (Vjn ), ∀ j ∈ J, V0 ∈ T |E4,c
∅6=J⊆S V0 ∈ΓJ
=
X
X
Pr ηj (vj0 ) = ηj (Vjn ), ∀ j ∈ J × Pr {V0 ∈ T |E4,c }
(3.2.9)
∅6=J⊆S V0 ∈ΓJ
<
X
−n
X
2
P
j∈J
Rj0
−n
×2
P|J|
i=1
I(Vji ;VJ c ,Vj1 ,··· ,Vji−1 )−|J|δ
(3.2.10)
∅6=J⊆S V0 ∈ΓJ
<
X
n
2
P
j∈J
I(Yj ;Vj )+|j|δ
−n
×2
P
j∈J
Rj0
−n
×2
P|j|
i=1
I(Vji ;VJ c ,Vj1 ,··· ,Vji−1 )−|j|δ
∅6=J⊆S
P
−n
≤C max 2
j∈J
Rj0 −I(YJ ;VJ |VJ c )−2|j|δ
(3.2.11)
∅6=J⊆N
→ 0, n → ∞,
where C = 2s − 1. Equality (3.2.9) holds because the processes of choosing ηj ’s and
generating V0 are done independently. (3.2.10) follows from Lemma 3.A.3 and the
definitions of ηj ’s. (3.2.11) is from Lemma 3.A.4.
(e). Let E5,1 = {LV = ∅} and E5,2 = {|LV | > 1}. We have Pr{E5,1 |E5,c } = 0,
because E5,c contains the event that (X n , V) ∈ LV and V is unique. Therefore,
Pr {E5 |E5,c } = Pr {E5,2 |E5,c }
X
=
Pr {MX0n = MX n }
(X0n ,V)∈T \(X n ,V)
<
X
06=I≤l R D
Choose a small η > 0 such that η <
X
Pr {MX0n = MX n }
(X n ,I|V)\(X n ,V)
δ
. Then
2 log |R|
Pr {E5 |E5,c } < 2|R| − 2 max 2n[H(X|VS )−H(YR/I |VS )+η] × 2−k log|I|
06=I≤l R
|R|
= 2 − 2 max 2−n[k log|I|/n−H(X|VS )+H(YR/I |VS )−η]
06=I≤l R
|R|
< 2 − 2 max 2−n[δ log|I|/ log|R|−η]
06=I≤l R
|R|
< 2 − 2 2−nδ/2 log|R|
(3.2.12)
(3.2.13)
→ 0, n → ∞,
where (3.2.12) is from Lemma 2.1.1 and Lemma 2.1.2 (for all large enough n and
small enough ) and (3.2.13) is because |I| ≥ 2 for all I 6= 0.
44
Encoding Functions of Correlated Sources
To summarize, by (a)–(e), we have γ → 0, n → ∞. The theorem is established.
Remark 3.2. The achievable region given by (3.2.5) always contains the Slepian–
Wolf region. Furthermore, it is in general larger than the Rĝ from (3.2.2). If ĝ is the
modulo-two sum, namely s0 = 0 and h, kj ’s are identity functions for all s0 < j ≤ s,
then (3.2.5) resumes the region of Ahlswede–Han [AH83, Theorem 10].
3.3
Non-field Rings versus Fields I
Given some finite ring R, let ĝ be of format (3.2.1), a nomographic presentation of
g. We say that the region Rĝ given by (3.2.2) is achievable for computing g in the
sense of Körner–Marton. From Theorem 3.2.2, we know that Rĝ might not be the
largest achievable region one can obtain for computing g. However, Rĝ still captures
the ability of linear coding over R when used for computing g. In other words, Rĝ
is the region purely achieved with linear coding over R for computing g. On the
other hand, regions from Theorem 3.2.2 are achieved by combining the linear coding
and the standard random coding techniques. Therefore, it is reasonable to compare
LCoR with LCoF in the sense of Körner–Marton.
We are now to show that linear coding over finite rings, non-field rings in particular, strictly outperforms its field counterpart, LCoF, in the following example.
Example 3.3.1. Let g : {α0 , α1 }3 → {β0 , β1 , β2 , β3 } (Figure 3.1) be a function
such that
g : (α0 , α0 , α0 ) 7→ β0 ; g : (α0 , α0 , α1 ) 7→ β3 ;
g : (α0 , α1 , α0 ) 7→ β2 ;
g : (α0 , α1 , α1 ) 7→ β1 ;
g : (α1 , α0 , α0 ) 7→ β1 ;
g : (α1 , α0 , α1 ) 7→ β0 ;
g : (α1 , α1 , α0 ) 7→ β3 ;
g : (α1 , α1 , α1 ) 7→ β2 .
(3.3.1)
Define µ : {α0 , α1 } → Z4 and ν : {β0 , β1 , β2 , β3 } → Z4 by
µ : αj 7→ j, ∀ j ∈ {0, 1}, and
ν : βj 7→ j,
∀ j ∈ {0, 1, 2, 3},
(3.3.2)
respectively. Obviously, g is equivalent to x + 2y + 3z ∈ Z4 [3] (Figure 3.2) via
µ1 = µ2 = µ3 = µ and ν. However, by Proposition 3.3.1, there exists no ĝ ∈ F4 [3] of
format (3.2.1) so that g is equivalent to any restriction of ĝ. Although, Lemma 1.2.4
ensures that there always exists a bigger field Fq such that g admits a presentation
ĝ ∈ Fq [3] of format (3.2.1), the size q must be strictly bigger than 4. For instance,
let
X ĥ(x) =
a 1 − (x − a)4 − 1 − (x − 4)4 ∈ Z5 [1].
a∈Z5
Then, g has presentation ĥ(x + 2y + 4z) ∈ Z5 [3] (Figure 3.3) via µ1 = µ2 = µ3 =
µ : {α0 , α1 } → Z5 and ν : {β0 , β1 , β2 , β3 } → Z5 defined (symbolic-wise) by (3.3.2).
3.3. Non-field Rings versus Fields I
45
β2
β1
β0
β3
β3
β2
β1
y z
x
β0
Figure 3.1: g : {α0 , α1 }3 → {β0 , β1 , β2 , β3 }
2
2
1
0
1
0
3
3
3
2
1
y z
x
0
Figure 3.2: x + 2y + 3z ∈ Z4 [3]
2
3 = ĥ(4)
y z
1
x
0
Figure 3.3: ĥ(x + 2y + 4z) ∈ Z5 [3]
Proposition 3.3.1. There exists no polynomial function ĝ ∈ F4 [3] of format
(3.2.1), such that a restriction of ĝ is equivalent to the function g defined by (3.3.1).
Proof. Suppose ν ◦ g = ĝ ◦ (µ1 , µ2 , µ3 ), where µ1 , µ1 , µ3 : {α0 , α1 } → F4 , ν :
{β0 , · · · , β3 } → F4 are injections, and ĝ = h ◦ (k1 + k2 + k3 ) with h, ki ∈ F4 [1]for
all feasible i. We claim that ĝ and h are both surjective, since g {α0 , α1 }3 =
|{β0 , β1 , β2 , β3 }| = 4 = |F4 | . In particular, h is bijective. Therefore, h−1 ◦ ν ◦ g =
k1 ◦ µ1 + k2 ◦ µ2 + k3 ◦ µ3 , i.e. g admits a presentation k1 (x) + k2 (y) + k3 (z) ∈ F4 [3].
A contradiction to Lemma 3.A.2.
As a consequence of Proposition 3.3.1, in the sense of Körner–Marton, in order
to use LCoF to encode function g, the alphabet sizes of the three encoders need to
46
Encoding Functions of Correlated Sources
be at least 5. However, LCoR offers a solution in which the alphabet sizes are 4,
strictly smaller than using LCoF. Most importantly, the region achieved with linear
coding over any finite field Fq , is always a subset of the one achieved with linear
coding over Z4 . This is proved in the following proposition.
Proposition 3.3.2. Let g be the function defined by (3.3.1), {α0 , α1 }3 be the sample space of (X1 , X2 , X3 ) ∼ p and pX be the distribution of X = g(X1 , X2 , X3 ).
If pX (β0 ) = p1 , pX (β1 ) = p2 , pX (β3 ) = p3 and pX (β2 ) = p4 satisfy (3.2.3), then,
in the sense of Körner–Marton, the region R1 achieved with linear coding over Z4
contains the one, that is R2 , obtained with linear coding over any finite field Fq for
computing g. Moreover, if supp(p) is the whole domain of g, then R1 ) R2 .
Proof. Let ĝ = h ◦ k ∈ Fq [3] be a polynomial presentation of g with format (3.2.1).
By Corollary 3.2.1 and Remark 3.1, we have
R1 = (R1 , R2 , R3 ) ∈ R3 Ri > H(X1 + 2X2 + 3X3 ) ,
R2 = (R1 , R2 , R3 ) ∈ R3 Ri > H(k(X1 , X2 , X3 )) .
Assume that ν ◦ g = h ◦ k ◦ (µ1 , µ2 , µ3 ), where µ1 , µ1 , µ3 : {α0 , α1 } → Fq and
ν : {β0 , · · · , β3 } → Fq are injections. Obviously, g(X1 , X2 , X3 ) is a function of
k(X1 , X2 , X3 ). Hence,
H(k(X1 , X2 , X3 )) ≥ H(g(X1 , X2 , X3 )).
(3.3.3)
On the other hand, H(X1 + 2X2 + 3X3 ) = H(g(X1 , X2 , X3 )). Therefore,
H(k(X1 , X2 , X3 )) ≥ H(X1 + 2X2 + 3X3 ),
and R1 ⊇ R2 . In addition, we claim that h|S , where S = k
Q
(3.3.4)
3
µ
{α
,
α
}
,
j
0
1
j=1
is not injective. Otherwise, h : S → S 0 , where S 0 = h(S ), is bijective, hence,
−1
(h|S 0 ) ◦ν◦g = k◦(µ1 , µ2 , µ3 ) = k1 ◦µ1 +k2 ◦µ2 +k3 ◦µ3 . A contradiction to Lemma
3.A.2. Consequently, |S | > |S 0 | = |ν ({β0 , · · · , β3 })| = 4. If supp(p) = {α0 , α1 }3 ,
then (3.3.3) as well as (3.3.4) hold strictly, thus, R1 ) R2 .
A more intuitive comparison (which is not as conclusive as Proposition 3.3.2)
can be identified from the presentations of g given in Figure 3.2 and Figure 3.3.
According to Corollary 3.2.1, linear encoders over field Z5 achieve
RZ5 = (R1 , R2 , R3 ) ∈ R3 Ri > H(X1 + 2X2 + 4X3 ) .
The one achieved by linear encoders over ring Z4 is
RZ4 = (R1 , R2 , R3 ) ∈ R3 Ri > H(X1 + 2X2 + 3X3 ) .
Clearly, H(X1 + 2X2 + 3X3 ) ≤ H(X1 + 2X2 + 4X3 ), thus, RZ4 contains RZ5 .
Furthermore, as long as
0 < Pr (α0 , α0 , α1 ) , Pr (α1 , α1 , α0 ) < 1,
3.3. Non-field Rings versus Fields I
(X1 , X2 , X3 )
(α0 , α0 , α0 )
(α1 , α0 , α1 )
(α1 , α0 , α0 )
(α0 , α1 , α1 )
47
p
1/90
1/90
42/90
42/90
(X1 , X2 , X3 )
(α0 , α1 , α0 )
(α1 , α1 , α1 )
(α0 , α0 , α1 )
(α1 , α1 , α0 )
p
1/90
1/90
1/90
1/90
Table 1
RZ4 is strictly larger than RZ5 , since H(X1 + 2X2 + 3X3 ) < H(X1 + 2X2 + 4X3 ).
To be specific, assume that (X1 , X2 , X3 ) ∼ p satisfies Table 1, we have
R[X1 , X2 , X3 ] (RZ5 = (R1 , R2 , R3 ) ∈ R3 Ri > 0.4812
(RZ = (R1 , R2 , R3 ) ∈ R3 Ri > 0.4590 .
4
Based on Proposition 3.3.1 and Proposition 3.3.2, we conclude that LCoR dominates LCoF, in terms of achieving better coding rates with smaller alphabet sizes
of the encoders for computing g. As a direct conclusion, we have:
Theorem 3.3.1. In the sense of Körner–Marton, LCoF is not optimal.
Remark 3.3. The key property underlying the proof of Proposition 3.3.2 is that
the characteristic of a finite field must be a prime while the characteristic of a finite
ring can be any positive integer larger than or equal to 2. This implies that it is
possible to construct infinitely many discrete functions for which using LCoF always
leads to a suboptimal achievable
Ps region compared to linear coding over finite nonfield rings. Examples include i=1 xi ∈ Z2p [s] for s ≥ 2 and prime p > 2 (note: the
characteristic of Z2p is 2p which is not a prime). One can always find an explicit
distribution of sources for which linear coding over Z2p strictly dominates linear
coding over each and every finite field.
48
Encoding Functions of Correlated Sources
3.A
Appendix
3.A.1
Suppporting Lemmata
Lemma 3.A.1. If
(
and
P4
−
j=1
4
X
0 ≤ max{p2 , p3 } <
6 min{p1 , p4 } ≤ 1
0 ≤ max{p1 , p4 } <
6 min{p2 , p3 } ≤ 1
pj = 1, then
pj log pj ≤ −2 (p2 + p3 ) log (p2 + p3 ) + (p1 + p4 ) log (p1 + p4 ) .
(3.A.1)
j=1
Proof [DA12]. Without loss of generality, we assume that 0 ≤ max{p4 , p3 } ≤
min{p2 , p1 } ≤ 1 which implies that p1 + p2 − 1/2 ≥ |p1 + p4 − 1/2|. Let H2 (c) =
−c log c − (1 − c) log(1 − c), 0 ≤ c ≤ 1, be the binary entropy function. By the
grouping rule for entropy [CT06, pp. 49], (3.A.1) equals to
p1
p1 + p4
p4
p1 + p4
(p1 + p4 )
log
+
log
p1 + p4
p1
p1 + p4
p4
p2
p2 + p3
p3
p2 + p3
+(p2 + p3 )
log
+
log
p2 + p3
p2
p2 + p3
p3
≤ − (p2 + p3 ) log (p2 + p3 ) − (p1 + p4 ) log (p1 + p4 )
⇔
A :=(p1 + p4 )H2
p1
p1 + p4
+ (p2 + p3 )H2
p2
p2 + p3
≤H2 (p1 + p4 ).
Since H2 is a concave function and
P4
j=1
pj = 1, then
A ≤ H2 (p1 + p2 ) .
Moreover, p1 + p2 − 1/2 ≥ |p1 + p4 − 1/2| guarantees that
H2 (p1 + p2 ) ≤ H2 (p1 + p4 ) ,
because H2 (c) = H2 (1 − c), ∀ 0 ≤ c ≤ 1, and H2 (c0 ) ≤ H2 (c00 ) if 0 ≤ c0 ≤ c00 ≤ 1/2.
Therefore, A ≤ H2 (p1 + p4 ) and (3.A.1) holds.
Lemma 3.A.2. No matter which finite field Fq is chosen, g given by (3.3.1) admits
no presentation k1 (x) + k2 (y) + k3 (z), where ki ∈ Fq [1] for all feasible i.
3.A. Appendix
49
Proof. Suppose otherwise, i.e. k1 ◦ µ1 + k2 ◦ µ2 + k3 ◦ µ3 = ν ◦ g for some injections
µ1 , µ1 , µ3 : {α0 , α1 } → Fq and ν : {β0 , · · · , β3 } → Fq . By (3.3.1), we have
ν(β1 ) =(k1 ◦ µ1 )(α1 ) + (k2 ◦ µ2 )(α0 ) + (k3 ◦ µ3 )(α0 )
=(k1 ◦ µ1 )(α0 ) + (k2 ◦ µ2 )(α1 ) + (k3 ◦ µ3 )(α1 )
ν(β3 ) =(k1 ◦ µ1 )(α1 ) + (k2 ◦ µ2 )(α1 ) + (k3 ◦ µ3 )(α0 )
=(k1 ◦ µ1 )(α0 ) + (k2 ◦ µ2 )(α0 ) + (k3 ◦ µ3 )(α1 )
=⇒ ν(β1 ) − ν(β3 ) = τ = −τ
=⇒ τ + τ = 0,
(3.A.2)
where τ = k2 (µ2 (α0 )) − k2 (µ2 (α1 )). Since µ2 is injective, (3.A.2) implies that either τ = 0 or Char(Fq ) = 2 by Proposition 1.1.2. Noticeable that k2 (µ2 (α0 )) 6=
k2 (µ2 (α1 )), i.e. τ 6= 0, otherwise, ν(β1 ) = ν(β3 ) which contradicts the assumption
that ν is injective. Thus, Char(Fq ) = 2. Let ρ = (k3 ◦ µ3 )(α0 ) − (k3 ◦ µ3 )(α1 ).
Obviously, ρ 6= 0 because of the same reason that τ 6= 0, and ρ + ρ = 0 since
Char(Fq ) = 2. Therefore,
ν(β0 ) =(k1 ◦ µ1 )(α0 ) + (k2 ◦ µ2 )(α0 ) + (k3 ◦ µ3 )(α0 )
=(k1 ◦ µ1 )(α0 ) + (k2 ◦ µ2 )(α0 ) + (k3 ◦ µ3 )(α1 ) + ρ
=ν(β3 ) + ρ
=(k1 ◦ µ1 )(α1 ) + (k2 ◦ µ2 )(α1 ) + (k3 ◦ µ3 )(α0 ) + ρ
=(k1 ◦ µ1 )(α1 ) + (k2 ◦ µ2 )(α1 ) + (k3 ◦ µ3 )(α1 ) + ρ + ρ
=ν(β2 ) + 0 = ν(β2 ).
This contradicts the assumption that ν is injective.
Remark 3.4. As a special case, this lemma implies that no matter which finite
field Fq is chosen, g defined by (3.3.1) has no polynomial presentation that is linear
over Fq . In contrast, g admits presentation x + 2y + 3z ∈ Z4 [3] which is a linear
function over Z4 .
Lemma 3.A.3. Let (X1 , X2 , · · · , Xl , Y ) ∼ q. For any > 0 and positive integer
n, choose a sequence X̃jn (1 ≤ j ≤ l) randomly from T (n, Xj ) based on a uniform
distribution. If y ∈ Y n is an -typical sequence with respect to Y , then
Pl
−n
I(Xj ;Y,X1 ,X2 ,··· ,Xj−1 )−3l
j=1
Pr (X̃1n , X̃2n , · · · , X̃ln , Y n ) ∈ T |Y n = y ≤ 2
.
Proof. Let Fj be the event {(X̃1n , X̃2n , · · · , X̃jn , Y n ) ∈ T }, 1 ≤ j ≤ l, and F0 = ∅.
50
Encoding Functions of Correlated Sources
We have
l
Y
Pr (X̃1n , X̃2n , · · · , X̃ln , Y n ) ∈ T |Y n = y =
Pr {Fj |Y n = y, Fj−1 }
j=1
≤
l
Y
2−n[I(Xj ;Y,X1 ,X2 ,··· ,Xj−1 )−3]
j=1
=2
−n
Pl
j=1
I(Xj ;Y,X1 ,X2 ,··· ,Xj−1 )−3l
since X̃1n , X̃2n , · · · , X̃ln , y are generated independent.
Lemma 3.A.4. If (Y1 , V1 , Y2 , V2 , · · · , Ys , Vs ) ∼ q, and
q(y1 , v1 , y2 , v2 , · · · , ys , vs ) = q(y1 , y2 , · · · , ys )
s
Y
q(vi |yi ),
i=1
then, ∀ J = {j1 , j2 , · · · , j|j| } ⊆ {1, 2, · · · , s},
I(YJ ; VJ |VJ c ) =
|j|
X
i=1
I(Yji ; Vji ) − I(Vji ; VJ c , Vj1 , · · · , Vji−1 ).
,
Chapter 4
Stochastic Complements and
Supremus Typicality
s seen in the last two chapters, Lemma 2.1.2 is a very important foundation
for most of the conclusions drawn. Tracing along the arguments, we shall see
that Lemma 2.1.2 is the base for proving the achievability theorems regarding LCoR, Theorem 2.1.1 and its variations. In turn, LCoR is applied to Problem
2.1, Source Coding for Computing, and it is demonstrated that LCoR outperforms
LCoF in various aspects. Therefore, if we want to re-establish some results, say
Theorem 2.1.1 and Theorem 3.2.1, drawn previously for non-i.i.d. sources, then
naturally a correspondence of Lemma 2.1.2 ought to be reproved first.
Usually, generalising achievability results from the i.i.d. case to other stationary
(or a.m.s.) ergodic scenarios is easy. That can be done by extending the typicality
argument (from Shannon [SW49]) to the generalised scenario. Unfortunately, this
process is not as straightforward as it usually is for our particular problem. To be
more precise, it is possible to obtain expressions in characterizing the achievable
coding rates. However, these expressions are often hard to analyse or evaluate.
To overcome such a drawback, we will introduce a new typicality concept, called
Supremus typicality. Built on this, some results of LCoR from the previous two
chapters are re-established, and they become easier to analyse and evaluate. In
addition, we will see that the classical definition of typicality does not characterize
the stochastic properties of the (non-i.i.d.) sources well enough. This is essentially
the reason causing the insufficiency of the classical typical sequence mentioned
before.
In order to clearly present the idea of Supremus typicality in a simpler setting,
we will only focus on the Markov source scenarios in this and the next chapters.
The discussions of the more universal settings (e.g. a.m.s. sources) are deployed
after introducing some mathematical tools on ergodic theory in chapter 6.
A
51
52
4.1
4.1.1
Stochastic Complements and Supremus Typicality
Markov Chains and Stochastic Complements
Index Oriented Matrix Operations
Let X , Y and Z be three countable sets with or without orders defined, e.g. X =
{(0, 0), (0, 1), (1, 1), (1, 0)} and Y = {α, β}×N+ . In many places hereafter, we write
[pi,j ]i∈X ,j∈Y ([pi ]i∈X ) for a “matrix” (“vector”) whose “(i, j)th” (“ith”) entry is
pi,j (pi ) ∈ R. Matrices p0i,j i∈X ,j∈Y and [qj,k ]j∈Y ,k∈Z are similarly defined. Let
P = [pi,j ]i∈X ,j∈Y . For subsets A ⊆ X and B ⊆ Y , PA,B is designated for the
“submatrix” [pi,j ]i∈A,j∈B . We will use “index oriented” operations, namely
"
#
X
[pi ]i∈X [pi,j ]i∈X ,j∈Y =
pi pi,j
;
i∈X
j∈Y
[pi,j ]i∈X ,j∈Y + p0i,j i∈X ,j∈Y = pi,j + p0i,j i∈X ,j∈Y ;


X
[pi,j ]i∈X ,j∈Y [qj,k ]j∈Y ,k∈Z = 
pi,j qj,k 
j∈Y
.
i∈X ,k∈Z
In addition, a matrix PA,A = [pi,j ]i,j∈A is said to be an identity matrix if and only
if pi,j = δi,j (Kronecker delta), ∀ i, j ∈ A. We often indicate an identity matrix
with 1 whose size is known from the context, while designating 0 as a zero matrix
(all of whose entries are 0) of size known from the context. For any matrix PA,A ,
its inverse (if exists) is some matrix QA,A suchPthat QA,A PA,A = PA,A QA,A = 1.
Let [pi ]i∈X be non-negative
and unitary, i.e. i∈X pi = 1, and [pi,j ]i∈X ,j∈Y be
P
non-negative and j∈Y pi,j = 1 (such a matrix is termed a stochastic matrix). For
discrete random variables X and Y with sample spaces X and Y , respectively,
X ∼ [pi ]i∈X and (X, Y ) ∼ [pi ]i∈X [pi,j ]i∈X ,j∈Y state for
Pr {X = i} = pi and Pr {X = i, Y = j} = pi pi,j ,
for all i ∈ X and j ∈ Y , respectively.
4.1.2
Markov Chains and Strong Markov Typical Sequences
Definition 4.1.1.
A
(discrete) Markov chain is defined to be a discrete stochastic
process M = X (n) with state space X such that, ∀ n ∈ N+ ,
n
o
n
o
Pr X (n+1) X (n) , X (n−1) , · · · , X (1) = Pr X (n+1) X (n) .
M is said to be finite-state if X is finite.
Definition 4.1.2. A Markov chain M = X (n) is said to be homogeneous (time
homogeneous) if and only if
n
o
n
o
Pr X (n+1) X (n) = Pr X (2) X (1) , ∀ n ∈ N+ .
4.1. Markov Chains and Stochastic Complements
53
If not specified, we assume that all Markov chains considered throughout this
and the next chapters are finite-state and homogeneous. However, they are not
necessarily stationary [CT06, pp. 71], or their initial distributions are unknown.
Definition 4.1.3. Given a Markov chain M = X (n) with state space X , the
transitionmatrix ofM is defined
to be the stochastic matrix P = [pi,j ]i,j∈X , where
pi,j = Pr X (2) = j X (1) = i . Moreover, M is said to be irreducible if and only if
P is irreducible, namely, there exists no ∅ =
6 A ( X such that PA,Ac = 0.
Definition 4.1.4. A state j of a Markov chain M = X (n) is said to be recurrent
if Pr T < ∞| X (0) = j = 1, where T = inf{n > 0|X (n) = j}. If in addition the
conditional expectation E{T |X (0) = j} < ∞, then j is said to be positive recurrent.
M is said to be positive recurrent if all states are positive recurrent.
Definition 4.1.5. A Markov chain (not necessarily finite-state) is said to be ergodic
if and only if it is irreducible, positive recurrent and aperoidic.
Theorem 4.1.1 (Theorem 1.7.7 of [Nor98]). An irreducible Markov chain M with
state space X is positive recurrent, if and only if it admits a non-negative unitary
vector π = [pj ]j∈X , such that πP = π, where P is the transition matrix of M .
Moreover, π is unique and is called the invariant (stationary) distribution.
Theorem 4.1.2 (Theorem 2.31 of [BB05]). A finite-state irreducible Markov chain
is positive recurrent.
Clearly, every irreducible Markov chain considered in this and the next chapters
admits a unique invariant distribution1 (which is not necessarily the initial distribution), since it is assumed to be simultaneously finite-state and homogeneous (unless
otherwise specified).
Definition
4.1.6 (Strong Markov Typicality (cf. [DLS81, Csi98])). Let M =
(n) X
be an irreducible Markov chain with state space X , and P = [pi,j ]i,j∈X
and π = [pj ]j∈X be its transition matrix and invariant distribution, respectively.
For any > 0, a sequence x ∈ X n of length n (≥ 2) is said to be strong Markov
-typical with respect to P if
N (i, j; x)
N (i; x)
− pi < , ∀ i, j ∈ X ,
N (i; x) − pi,j < and n
where N (i, j; x) is the number of occurrences of sub-sequence [i, j] in x and
X
N (i; x) =
N (i, j; x)
j∈X
The set of all strong Markov -typical sequences with respect to P in X n is denoted
by T (n, P) or T for simplicity.
1 This
can also be proved with the Perron–Frobenius Theorem [Per07, Fro12].
54
Stochastic Complements and Supremus Typicality
Let P and π be some stochastic matrix and non-negative unitary vector. We
define H(π) and H(P|π) to be H(X) and H(Y |X), respectively, for jointly discrete
random variables (X, Y ) such that X ∼ π and (X, Y ) ∼ πP.
Proposition 4.1.1 (AEP of Strong Markov Typicality2 ). Let M = X (n) be an
irreducible Markov chain with state space X , and P = [pi,j ]i,j∈X and π = [pj ]j∈X
be its transition matrix and invariant distribution, respectively. For any η > 0,
there
0 > 0 and N0 ∈ N+ , such that, ∀ 0 > > 0, ∀ n > N0 and ∀ x =
(1) exist
(2)
x , x , · · · , x(n) ∈ T (n, P),
1. exp2 [−n (H(P|π) + η)] < Pr X (1) , X (2) , · · · , X (n) = x
< exp2 [−n (H(P|π) − η)];
2. Pr {X ∈
/ T (n, P)} < η, where X = X (1) , X (2) , · · · , X (n) ; and
3. |T (n, P)| < exp2 [n (H(P|π) + η)].
4.1.3
Stochastic Complements
Given a Markov chain M = X (n) with state space X
A of X , let
 (n)

∈A ;
inf n > 0|X
TA,l = inf n > TA,l−1 |X (n) ∈ A ;


sup n < TA,l+1 |X (n) ∈ A ;
and a non-empty subset
l = 1,
l > 1,
l < 1.
It is well-known that MA = X (TA,l ) is Markov by the strong Markov property
[Nor98, Theorem 1.4.2]. In particular, if M is irreducible, so is MA . To be more
precise, if M is irreducible, and write its invariant distribution and transition matrix
as π = [pi ]i∈X and
"
#
PA,A PA,Ac
P=
,
PAc ,A PAc ,Ac
respectively, then
SA = PA,A + PA,Ac (1 − PAc ,Ac )
−1
PAc ,A ,
is the transition matrix of MA [Mey89, Theorem 2.1 and Section 3].
#
"
pi
πA = P
j∈A pj
i∈A
2 Similar
statements in the literature (cf. [DLS81, Csi98]) assume that the Markov chain is
stationary ergodic. The result is easy to generalise to irreducible Markov chain. To be rigorous,
we include a proof of the irreducible case in Section 4.A.1.
4.2. Supremus Typical Sequences
55
is an invariant distribution of SA , i.e. πA SA = πA [Mey89, Theorem 2.2]. Since MA
inherits irreducibility from M [Mey89, Theorem 2.3], πA is unique. The matrix SA
is termed the stochastic complement of PA,A in P, while MA is named a reduced
Markov chain (or reduced process) of M . It has state space A obviously.
4.2
Supremus Typical Sequences
We will define Supremus typical sequence in this section. This new concept is
stronger in the sense of characterizing the stochastic behaviours of random processes/sources. Although this concept is only defined for and applied to Markov
processes/sources in this chapter, the idea can be generalised to other random
processes/sources, e.g. a.m.s. processes/sources [GK80]. Nevertheless, some background on ergodic theory is required. Thus, we leave the investigation on the more
universal settings to chapter 6.
Definition 4.2.1 (Supremus Typicality). Following the notation
defined in Section
4.1.3, given > 0 and a sequence x = x(1) , x(2) , · · · , x(n) ∈ X n of length n
(≥ 2 |X |), let xA be the subsequence of x formed by all those x(l) ’s that belong
to A in the original ordering. x is said to be Supremus -typical with respect to
P, if and only if xA is strong Markov -typical with respect to SA for any feasible
non-empty subset A of X .
In Definition 4.2.1, the set of all Supremus -typical sequences with respect to P
in X n is denoted as S (n, P) or S for simplicity. xA is called a reduced subsequence
(with respect to A) of x. It follows immediately form the definition that
Proposition 4.2.1. Every reduced subsequence of a Supremus -typical sequence
is Supremus -typical.
However, the above proposition does not hold for strong Markov -typical sequences. Namely, a reduced subsequence of a strong Markov -typical sequence is
not necessarily strong Markov -typical.
Example 4.2.1. Let {α, β, γ} be the state space of an i.i.d. process with a uniform
distribution, i.e.


1/3 1/3 1/3


P = 1/3 1/3 1/3 ,
1/3 1/3 1/3
and x = (α, β, γ, α, β, γ, α, β, γ). It is easy to verify that x is a strong Markov
5/12-typical sequence. However, the reduced subsequence
x{α,γ} = (α, γ, α, γ, α, γ)
56
Stochastic Complements and Supremus Typicality
"
is no longer a strong Markov 5/12-typical sequence, because S{α,γ}
0.5
=
0.5
0.5
0.5
#
and
the number of subsequence (α, α)’s in x{α,γ}
5
− 0.5 = |0 − 0.5| >
.
6
12
Proposition 4.2.2 (AEP of Supremus Typicality). Let M = X (n) be an irreducible Markov chain with state space X , and P = [pi,j ]i,j∈X and π = [pj ]j∈X
be its transition matrix and invariant distribution, respectively. For any η > 0,
+
there exist
(1) 0(2)> 0 and
N0 ∈ N , such that, ∀ 0 > > 0, ∀ n > N0 and
(n)
∀ x = x ,x ,··· ,x
∈ S (n, P),
1. exp2 [−n (H(P|π) + η)] < Pr X (1) , X (2) , · · · , X (n) = x
< exp2 [−n (H(P|π) − η)];
2. Pr {X ∈
/ S (n, P)} < η, where X = X (1) , X (2) , · · · , X (n) ; and
3. |S (n, P)| < exp2 [n (H(P|π) + η)].
Proof. Note that T (n, P) ⊇ S (n, P). Thus, 1 and 3 are inherited from the AEP of
strong Markov typicality. In addition, 2 can be proved without any difficulty since
any reduced Markov chain of M is irreducible and the number of reduced Markov
chains of M is, 2|X | − 1, finite.
Remark 4.1. It is known that Shannon’s (weak/strong) typical sequences [SW49]
are defined to be those sequences “representing” the stochastic behaviour of the
whole random process. To be more precise, a non (weak/strong) typical sequence is
unlikely to be produced by the random procedure (Proposition 4.1.1). However, the
study of induced transformations3 in ergodic theory suggests that (weak/strong)
typical sequences that are not Supremus typical form also a low probability set (see
Theorem 6.3.4). When the random procedure propagates, it is highly likely that all
reduced subsequences of the generated sequence also admit empirical distributions
“close enough” to the genuine distributions of corresponding reduced processes as
proved by Proposition 4.2.2. Therefore, Supremus typical sequences “represent” the
random process better. This difference has been seen from Proposition 4.2.1 and
Example 4.2.1, and will be seen again in comparing the two typicality lemmata,
Lemma 4.2.1 and Lemma 4.2.2, given later.
The following two typicality lemmata of typical sequences are the ring versions
tailored for our discussions from the two given in Section 4.A.2, respectively. From
these two lemmata, we will start to see the impact brought to the analytic results
by the differences between classical typicality and Supremus typicality.
3 See Chapter 6 for the correspondence between an induced transformation and a reduced
process of a random process (a dynamical system).
4.2. Supremus Typical Sequences
57
Lemma 4.2.1. Let R be a finite ring, M = X (n) be an irreducible Markov
chain whose state space, transition matrix and invariant distribution are R, P and
π = [pj ]j∈R , respectively. For any η > 0, there exist 0 > 0 and N0 ∈ N+ , such
that, ∀ 0 > > 0, ∀ n > N0 , ∀ x ∈ S (n, P) and ∀ I ≤l R,
( "
#)
X X
|S (x, I)| < exp2 n
pj H(SA |πA ) + η
(4.2.1)
A∈R/I j∈A
= exp2 n H(SR/I |π) + η ,
(4.2.2)
where
S (x, I) = { y ∈ S (n, P)| y − x ∈ In } ,
"
pi
SA is the stochastic complement of PA,A in P, πA = P
j∈A pj
o
n
distribution of SA and SR/I = diag {SA }A∈R/I .
#
is the invariant
i∈A
Remark 4.2. By definition, for any y ∈ S (x, I) in Lemma 4.2.1, we have that y
and x follow the same sequential pattern, i.e. the ith coordinates of both sequences
are from the same coset of I. If I = R, then S (x, I) is the whole set of Supremus
typical sequences. It is well-known that evaluating the cardinality of the set of
all the (weak/strong) typical sequences is of great importance to the achievability
part of the source coding theorem [SW73]. We will see from the next section that
determining the number of (weak/strong/Supremus) typical sequences of certain
sequential pattern is also very important to the achievability result for linear coding
over finite rings.
Proof of Lemma 4.2.1. Assume that x = x(1) , x(2) , · · · , x(n) and let xA be the
subsequence of x formed by all those x(l) ’sthat belong to A ∈ R/I in the original
ordering. For any y = y (1) , y (2) , · · · , y (n) ∈ S (x, I), obviously y (l) ∈ A if and
only if x(l) ∈ A for all A ∈ R/I and 1 ≤ l ≤ n. Let xA = x(n1 ) , x(n2 ) , x(nmA )
m
P
1
A P
(note: A∈R/I mA = n and − j∈A pj < |A| + ). By Proposition 4.2.1,
n
n
yA = y (n1 ) , y (n2 ) , y (nmA ) ∈ AmA is a Supremus -typical sequence of length mA
with respect to SA , since y is Supremus -typical. Additionally, by Proposition 4.2.2,
there exist A > 0 and positive integer MA such that the number of Supremus typical sequences of length mA is upper bounded by exp2 {mA [H(SA |πA ) + η/2]}
if 0 < < A and mA > MA . Therefore, if 0 < < minA∈R/I A and
n > M = max


1 + MA


P
j∈A pj − |A| 
A∈R/I  58
Stochastic Complements and Supremus Typicality
(this guarantees that mA > MA for all A ∈ R/I), then
(
)
X
|S (x, I)| ≤ exp2
mA [H(SA |πA ) + η/2]
A∈R/I
( "
= exp2
n
#)
X mA
H(SA |πA ) + η/2 .
n
A∈R/I
mA
<
Furthermore, choose 0 < 0 ≤ minA∈R/I A and N0 ≥ M such that
n
P
η
P
for all 0 < < 0 and n > N0 and A ∈ R/I,
j∈A pj +
2 A∈R/I H(SA |πA )
we have
( "
#)
X X
|S (x, I)| < exp2 n
pj H(SA |πA ) + η ,
A∈R/I j∈A
(4.2.1) is established. Direct calculation yields (4.2.2).
At this point, one might argue to replace S (x, I) in Lemma 4.2.1 with T (x, I) =
{ y ∈ T (n, P)| y − x ∈ In }, the set of strong Markov -typical sequences having the
same sequential pattern as those from S (x, I), to keep the argument inside the classical typicality framework. Actually, this change makes Lemma 4.2.1 a Markovian
generalisation of Lemma 2.1.2. Unfortunately, a reduced subsequence of a sequence
from T (x, I) is not necessarily strong Markov -typical anymore (Proposition 4.2.1
fails). Thus, the same proof built on classical typicality does not follow. Another
alternative is to consider weak typicality (defined below). However, even though a
corresponding bound (see Lemma 4.2.2 below) can be obtained, this bound is often
very hard to evaluate as seen later.
Definition
4.2.2 (Modified Weak Typicality). Given an irreducible Markov chain
(n) X
with transition matrix P and a finite state space X . For any > 0, a
sequence x(1) , x(2) , · · · , x(n) ∈ X n of length n is said to be weak -typical with
respect to P, if and only if
n o
1
− log Pr Γ X (l) = Γ x(l) , ∀ 1 ≤ l ≤ n − HΓ,X < ,
n
where HΓ,X is the entropy rate of Γ X (n) , for all feasible function Γ’s.
In Definition 4.2.2, the set of all weak -typical sequences with respect to P in
X n is denoted as TH (n, P) or TH for simplicity.
4
Proposition 4.2.3 (AEP of Modified Weak
). In Definition
4.2.2,for
Typicality
(1)
(2)
any η > 0, there exists N0 > 0 such that Pr X , X , · · · , X (n) ∈
/ TH (n, P) <
η for all n > N0 .
4 The proof of this proposition is presented in Session 6.A.1 when required background are
given.
4.2. Supremus Typical Sequences
59
Lemma 4.2.2. In Lemma 4.2.1, let TH (x, I) = y ∈ TH (n, P) y − x ∈ In . It
follows that
H
T (x, I)
1 (m) (m−1)
(1)
< exp2 n H (P|π) − lim
H YR/I , YR/I , · · · , YR/I + 2 ,
(4.2.3)
m→∞ m
(m)
where YR/I = X (m) + I is a random variable with sample space R/I.
Proof. Assume that x = x(1) , x(2) , · · · , x(n) and let y = x(1) + I, x(2) + I, · · · ,
x(n) + I . For any y (1) , y (2) , · · · , y (n) ∈ TH (x, I), obviously y (l) ∈ A if and only if
x(l) ∈ A for all A ∈ R/I and 1 ≤ l ≤ n. Thus, y = y (1) + I, y (2) + I, · · · , y (n) + I .
As a consequence,
n
o
Pr X (l) + I = x(l) + I, ∀ 1 ≤ l ≤ n
n
o
X
≥
Pr X (l) = y (l) , ∀ 1 ≤ l ≤ n
[y(1) ,y(2) ,··· ,y(n) ]∈TH (x,I)
h
i
X
>
exp2 {−n [H(P|π) + ]} since y (1) , y (2) , · · · , y (n) ∈ TH
[y(1) ,y(2) ,··· ,y(n) ]∈TH (x,I)
H
(4.2.4)
= T (x, I) exp2 {−n [H(P|π) + ]}
1
H X (m) , X (m−1) , · · · , X (1) = H (P |π ) since M is irreducible
m
Markov). On the other hand,
n
o
Pr X (l) + I = x(l) + I, ∀ 1 ≤ l ≤ n
1 (m) (m−1)
(1)
< exp2 −n lim
(4.2.5)
H YR/I , YR/I , · · · , YR/I − m→∞ m
(note: limm→∞
by Definition 4.2.2. Therefore,
H
T (x, I) < exp2 n H (P|π) − lim 1 H Y (m) , Y (m−1) , · · · , Y (1) + 2
R/I
R/I
R/I
m→∞ m
by (4.2.4) and (4.2.5).
Remark 4.3. If R in Lemma 4.2.1 is a field, then both (4.2.2) and (4.2.3) are
equivalent to
exp2 [n (H (P|π) + η)] .
(4.2.6)
Or, if M in Lemma 4.2.1 is i.i.d., then both (4.2.2) and (4.2.3) are equivalent to
i
h (1)
exp2 n H X (1) − H YR/I + η ,
(4.2.7)
which is a special case of the generalised conditional typicality lemma, Lemma 2.1.2.
60
Stochastic Complements and Supremus Typicality
Remark 4.4. In Lemma 4.2.2, if P =nc1 U +o(1 − c1 )1 with all rows of U being
(n)
identical and 0 ≤ c1 ≤ 1, then M 0 = YR/I is Markov by Lemma 4.A.3. As a
conclusion,
io
n h
H
T (x, I) < exp2 n H (P|π) − lim H Y (m) Y (m−1) + 2
R/I
R/I
m→∞
0
= exp2 {n [H (P|π) − H (P |π 0 ) + 2]} ,
where P0 and π 0 are the transition matrix and the invariant distribution of M 0 that
can be easily calculated from P.
From Remark 4.3 and Remark 4.4, we have seen that the two bounds (4.2.2)
and (4.2.3) coincide, and both can be easily calculated for some special scenarios.
Unfortunately, for general settings (when the initial distribution of M is not known
or P 6= c1 U + (1 − c1 )1 for any U of identical rows and c1 ), (4.2.3) becomes
almost
because there is no efficient way to evaluate the entropy rate
o
n unaccessible
(n)
of YR/I . On the other hand, (4.2.2) is always as straightforward as calculating
the conditional entropy.
Example 4.2.2. Let M be an irreducible Markov chain with state space Z4 =
{0, 1, 2, 3}. Its transition matrix P = [pi,j ]i,j∈Z4 is given as the follows.
0
1
2
3
0
.2597
.1208
.0184
.0985
1
.2093
.0872
.2627
.1823
2
.2713
.6711
.4101
.2315
3
.2597
.1208
.3088
.4877
(4.2.8)
Let I = {0, 2}. Notice that the initial distribution is unknown, neither P = c1 U +
(1 − c1 )1 for any U of identical rows and c1 . Thus, the upper bound of TH (x, I)
from (4.2.3) is not very meaningful for calculation since the entropy rate is not
explicitly known. In contrast, we have that
|S (x, I)| < 20.8791+η
by (4.2.2).
The above is partially the reason we forsake the traditional (weak/strong) typical sequence argument of Shannon [SW49], and introduce an argument based on
Supremus typicality.
4.A. Appendix
4.A
61
Appendix
4.A.1
Proof of the AEP of Strong Markov Typicality
1. Let Pr X (1) = x(1) = c. By definition,
i
o
nh
X (1) , X (2) , · · · , X (n) = x
n
o Y
N (i,j;x)
= Pr X (1) = x(1)
pi,j
Pr
i,j∈X

=c exp2 

X
N (i, j; x) log pi,j 
i,j∈X

=c exp2 −n
X
i,j∈X

=c exp2 −n

N (i; x) N (i, j; x)
log pi,j 
−
n
N (i; x)
X i,j∈X
pi pi,j
N (i; x) N (i, j; x)
−
n
N (i; x)

log pi,j − pi pi,j log pi,j  .
In addition, there exists a small enough 0 > 0 and a N0 ∈ N+ such that
N (i; x) N (i, j; x)
2
− pi pi,j < −η 2 |X | min log pi,j
n
pi,j 6=0
N (i; x)
and −
log c
< η/2 for all 0 > > 0 and n > N0 . Consequently,
n
nh
i
o
Pr X (1) , X (2) , · · · , X (n) = x


X
η
log
p
i,j
>c exp2 −n
− pi pi,j log pi,j 
2
2
|X
|
min
log
p
p
=
6
0
i,j
i,j
i,j∈X



X
η
pi pi,j log pi,j 
≥c exp2 −n  −
2
i,j∈X
log c η
= exp2 −n −
+ + H(P|π)
n
2
> exp2 [−n (η + H(P|π))] .
Similarly,
Pr
nh
i
o
X (1) , X (2) , · · · , X (n) = x
62
Stochastic Complements and Supremus Typicality

<c exp2 −n

−η log pi,j
X
2
2 |X | minpi,j 6=0 log pi,j



X
η
pi pi,j log pi,j 
≤c exp2 −n − −
2
i,j∈X
h
η
i
≤ exp2 −n − + H(P|π)
2
< exp2 [−n (−η + H(P|π))] .
− pi pi,j log pi,j 
i,j∈X
2. By Boole’s inequality [Boo10, Fré35],
Pr {X ∈
/ T (n, P)}


!
 [ N (i, j; X)

[ [ N (i; X)
= Pr 
− pi,j ≥ 
− pi ≥ 

N (i; X)
n
i,j∈X
i∈X
X
X
N (i, j; X)
N (i; X)
≤
Pr Pr − pi,j ≥ E +
− pi ≥ ,
N (i; X)
n
i,j∈X
where E =
i∈X
T
i∈X
N (i; X)
− pi < for all feasible i.
n
By the Ergodic Theorem of Markov chains [Nor98, Theorem 1.10.2],
N (i; X)
Pr − pi ≥ → 0 as n → ∞
n
for any > 0. Thus, there is an integer N00 , such that for all n > N00 ,
N (i; X)
η
Pr − pi ≥ <
.
n
2 |X |
On the other hand, for mini∈X pi /2 > > 0 (note: pi > 0, ∀ i ∈ X , because
P is irreducible), N (i; x) → ∞ as n → ∞, conditional on E. Therefore, by
the Strong Law of Large Numbers [Nor98, Theorem 1.10.1],
N (i, j; X)
Pr − pi,j ≥ E → 0, as n → ∞.
N (i; X)
Hence, there exists N000 , for all n > N000 ,
N (i, j; X)
η
Pr − pi,j ≥ E <
2.
N (i; X)
2 |X |
Let N0 = max{N00 , N000 } and 0 = mini∈X pi /2 > 0. We have Pr {X ∈
/ T (n, P)}
< η for all 0 > > 0 and n > N0 .
4.A. Appendix
63
3. Finally, let 0 and N0 be defined as in 1. |T (n, P)| < exp2 [n (H(P|π) + η)]
follows since
X
1≥
Pr {X = x} > |T (n, P)| exp2 [−n (H(P|π) + η)] ,
x∈T (n,P)
if 0 > > 0 and n > N0 .
Let 0 be the smallest one chosen above and N0 be the biggest one chosen. The
statement is proved.
4.A.2
Typicality Lemmata of Supremus Typical Sequences
`
Given a set X , a partition S
k∈K Ak of X is a disjoint union of subsets of X ,
0
00
0
00
i.e.
A
∩
A
=
6
∅
⇔
k
=
k
,
k
k∈K Ak = X and Ak ’s are not empty. Obviously,
` k
A
is
a
partition
of
a
ring
R given the left (right) ideal I.
A∈R/I
Lemma 4.A.1. Given an irreducible Markov chain M = X (n) with finite
`m state
space X , transition matrix P and invariant distribution π = [pj ]j∈X . Let k=1 Ak
+
be any partition of X . For any η >
0, there exist 0 >
(1)
0 and N0 ∈ N , such that,
(2)
(n)
∀ 0 > > 0, ∀ n > N0 and ∀ x = x , x , · · · , x
∈ S (n, P),
( "
|S (x)| < exp2
n
m X
X
#)
pj H(Sk |πk ) + η
(4.A.1)
k=1 j∈Ak
= exp2 {n [H(S|π) + η]} ,
(4.A.2)
where
S (x) =
nh
i
y (1) , y (2) , · · · , y (n) ∈ S (n, P)
o
y (l) ∈ Ak ⇔ x(l) ∈ Ak , ∀ 1 ≤ l ≤ n, ∀ 1 ≤ k ≤ m ,
[pi ]
Sk is the stochastic complement of PAk ,Ak in P, πk = P i∈Ak is the invariant
j∈Ak pj
n
o
distribution of Sk and S = diag {Sk }1≤k≤m .
Proof. Let xAk = x(n1 ) , x(n2 ) , x(nmk ) be the subsequence of x formed by all
Pm
those x(l) ’s belong to Ak in the original ordering. Obviously, k=1 mk = n and
m
1
k P
− j∈Ak pj < |Ak | + . For any y = y (1) , y (2) , · · · , y (n) ∈ S (x),
n
n
i
h
k
yAk = y (n1 ) , y (n2 ) , y (nmk ) ∈ Am
k
64
Stochastic Complements and Supremus Typicality
is a Supremus -typical sequence of length mk with respect to Sk by Proposition
4.2.1, since y is Supremus -typical. Additionally, by Proposition 4.2.2, there exist
k > 0 and positive integer Mk such that the number of Supremus -typical sequences of length mk is upper bounded by exp2 {mk [H(Sk |πk ) + η/2]} if 0 < < k
and mk > Mk . Therefore, if 0 < < min1≤k≤m k and




1 + Mk
P
n > M = max
1≤k≤m  p − |A | 
j∈Ak
j
k
(this guarantees that mk > Mk for all 1 ≤ k ≤ m), then
(m
)
X
|S (x)| ≤ exp2
mk [H(Sk |πk ) + η/2]
k=1
( "
= exp2
n
m
X
mk
k=1
n
#)
H(Sk |πk ) + η/2
.
mk
Furthermore, choose 0 < 0 ≤ min1≤k≤m k and N0 ≥ M such that
<
n
P
η
Pm
for all 0 < < 0 and n > N0 and 1 ≤ k ≤ m,
j∈Ak pj +
2 k=1 H(Sk |πk )
we have
( " m
#)
X X
|S (x)| < exp2 n
pj H(Sk |πk ) + η ,
k=1 j∈Ak
(4.A.1) is established. Direct calculation yields (4.A.2).
By definition, S (x) in Lemma 4.A.1 contains Supremus -typical sequences
`m
whose have the same sequential pattern as x regarding the partition k=1 Ak .
Similarly, let TH (x) be the set of weak
sequences with the same sequential
`-typical
m
pattern as x regarding the partition k=1 Ak , namely
h
i
TH (x) =
y (1) , y (2) , · · · , y (n) ∈ T (n, P)
y (l) ∈ Ak ⇔ x(l) ∈ Ak , ∀ 1 ≤ l ≤ n, ∀ 1 ≤ k ≤ m .
We have that
Lemma 4.A.2. In Lemma 4.A.1, define Γ(x) = l ⇔ x ∈ Al . We have that
H T (x) < exp2 n H(P|π) − lim 1 H Y (w) , Y (w−1) , · · · , Y (1) + 2 ,
w→∞ w
where Y (w) = Γ X (w) .
4.A. Appendix
65
Proof. Let y = Γ x(1) , Γ x(2) , · · · , Γ x(n) . By definition, Γ y (1) , Γ y (2) ,
· · · , Γ y (n) = y, for any y (1) , y (2) , · · · , y (n) ∈ S (x). As a consequence,
o
n Pr Γ X (l) = Γ x(l) , ∀ 1 ≤ l ≤ n
n
o
X
≥
Pr X (l) = y (l) , ∀ 1 ≤ l ≤ n
[y(1) ,y(2) ,··· ,y(n) ]∈TH (x,I)
h
i
X
>
exp2 {−n [H(P|π) + ]} since y (1) , y (2) , · · · , y (n) ∈ TH
[y(1) ,y(2) ,··· ,y(n) ]∈TH (x,I)
H
(4.A.3)
= T (x, I) exp2 {−n [H(P|π) + ]}
1
H X (m) , X (m−1) , · · · , X (1) = H (P |π ) since M is irreducible
m
Markov). On the other hand,
(note: limm→∞
n o
Pr Γ X (l) = Γ x(l) , ∀ 1 ≤ l ≤ n
1 (w) (w−1)
(1)
< exp2 −n lim
H Y ,Y
,··· ,Y
−
w→∞ w
(4.A.4)
by Definition 4.2.2. Therefore,
H
T (x, I) < exp2 n H (P|π) − lim 1 H Y (w) , Y (w−1) , · · · , Y (1) + 2
w→∞ w
by (4.A.3) and (4.A.4).
Remark 4.5. Given a left ideal I of a finite ring R, R/I gives raise to a partition
of R. Let X = R, m = |R/I| and Ak (1 ≤ k ≤ m) be an element (which is a set) of
R/I. One has Lemma 4.2.1 and Lemma 4.2.2 proved immediately. In fact, Lemma
4.A.1 and Lemma 4.A.2 can be easily tailored to corresponding versions regarding
other algebraic structures, e.g. group, rng, vector space, module, algebra and etc,
in a similar fashion.
4.A.3
A Supporting Lemma
Lemma 4.A.3. Let X (n) be a Markov chain with countable state space X and
transition matrix P0 . If P0 = c1 U + (1 − c1 )1, where U is a matrix all of whose
rows are identical to some countably infinite unitary vector and 0 ≤ c1 ≤ 1, then
Γ X (n) is Markov for all feasible function Γ.
66
Stochastic Complements and Supremus Typicality
Proof. Let Y (n) = Γ X (n) , and assume that [ux ]x∈X is the first row of U. For
any a, b ∈ Γ (X ),
n
o
Pr Y (n+1) = b Y (n) = a
n
o
X
=
Pr X (n) = x, Y (n+1) = b Y (n) = a
x∈Γ−1 (a)
=
X
n
o n
o
Pr Y (n+1) = b X (n) = x, Y (n) = a Pr X (n) = x Y (n) = a
x∈Γ−1 (a)
=
X
n
o n
o
Pr Y (n+1) = b X (n) = x Pr X (n) = x Y (n) = a
x∈Γ−1 (a)
(n)
P
0
= x Y (n) = a ;
x0 ∈Γ−1 (b) c1 ux Pr X
h
i
(n)
P
= P
0
= x Y (n) = a ;
x∈Γ−1 (a) 1 − c1 +
x0 ∈Γ−1 (b) c1 ux Pr X
( P
P
c1 x0 ∈Γ−1 (b) ux0 x∈Γ−1 (a) Pr X (n) = x Y (n) = a ;
h
i
(n)
P
P
=
1 − c1 + c1 x0 ∈Γ−1 (b) ux0
= x Y (n) = a ;
x∈Γ−1 (a) Pr X
( P
c1 x0 ∈Γ−1 (b) ux0 ;
a 6= b
=
P
1 − c1 + c1 x0 ∈Γ−1 (b) ux0 ; a = b
n
o
X
=
Pr X (n+1) = x0 X (n) = x ∀ x ∈ Γ−1 (a)
(P
x∈Γ−1 (a)
a 6= b
a=b
a 6= b
a=b
x0 ∈Γ−1 (b)
=
X
n
o n
o
Pr X (n+1) = x0 X (n) = x Pr Y (n) = a Y (n) = a, Y (n−1) , · · ·
x0 ∈Γ−1 (b)
=
∀ x ∈ Γ−1 (a)
X
X
n
o
Pr X (n+1) = x0 X (n) = x, Y (n) = a, Y (n−1) , · · ·
x∈Γ−1 (a) x0 ∈Γ−1 (b)
n
o
Pr X (n) = x Y (n) = a, Y (n−1) , · · ·
n
o
X
X
=
Pr X (n+1) = x0 , X (n) = x Y (n) = a, Y (n−1) , · · ·
x∈Γ−1 (a) x0 ∈Γ−1 (b)
n
o
= Pr Y (n+1) = b Y (n) = a, Y (n−1) , · · ·
Therefore, Γ X (n) is Markov.
Remark 4.6. Lemma 4.A.3 is enlightened by [BR58, Theorem 3]. However, X (n)
in this lemma is not necessarily stationary or finite-state.
Chapter 5
Irreducible Markov Sources
quipped with the foundation laid down by Proposition 4.2.2, Lemma 4.2.1
and Lemma 4.2.2, we resume our discussion of the Markov source network
problem. First, recall Problem 2.1 and redefine it for the more universal
settings as the follows.
E
Problem 5.1 (Source Coding for Computing a Function of Sources with or without
Memory). Let t ∈ S = {1, 2, · · · , s} be a source that randomly generates discrete
(2)
(n)
(n)
(1)
data · · · , Xt , Xt , · · · , Xt , · · · , where Xt has a finite sample space Xt for
all n ∈ N+ . Given a discrete function g : XS → Y , what is the biggest region
R[g] ⊂ Rs satisfying, ∀ (R1 , R2 , · · · , Rs ) ∈ R[g] and ∀ > 0, ∃ N0 ∈ N+ , such
that, ∀ n > N0 , there
exist s encoders φt : Xtn → 1, 2nRt , t ∈ S, and one decoder
Q
nRt
ψ : t∈S 1, 2
→ Y n with
Pr {Y n 6= ψ [φ1 (X1n ) , · · · , φs (Xsn )]} < ,
(j)
where Y (j) = g XS
for all 1 ≤ j ≤ n.
The difference between Problem 2.1 and Problem 5.1 lies in the restriction of the
sources. Problem 2.1
a special case where the sources are i.i.d.. A more
n is obviously
o
(n)
general case where XS
is asymptotically mean station is discussed in Chapter 7.
In this chapter, we will investigate the situation where g is irreducible Markovian1 .
We will extend results on LCoR from previous chapter to the Markovian settings
based on the Supremus typicality argument. Once again, it is shown that LCoR
dominates its field counterpart in various aspects. Moreover, it is n
seen that
o our
(n)
being
approach even provides solutions to some particular situations with XS
non-ergodic stationary source.
1 A Markovian function is defined to be a Markov process that is a function of another arbitrary
process [BR58].
67
68
5.1
Irreducible Markov Sources
Linear Coding over Finite Rings for Irreducible Markov
Sources
As a n
specialo case, Problem 5.1 with s = 1, g being an identity function and
(n)
M = X1
= Y (n) being irreducible Markov resumes the Markov source compression problem. It is known from [Cov75] that the achievable coding rate region for
compressing M is {R ∈ R|R > H(P|π)} where P and π are the transition matrix
and invariant distribution of M , respectively. Unfortunately, the structures of the
encoders used in [Cov75] are also unclear (as their Slepian–Wolf correspondences)
which limits their application (to Problem 5.1) as we will see in later sections. On
the other hand, we have seen from Chapter 3 that the linear coding technique is
of great use when applied to this problem. Thus, it is important to re-product the
achievability theory, Theorem 2.1.1, in the Markovian settings first.
Theorem 5.1.1. Assume that s = 1, X1 = Y is nsome o
finite ring R and g is an
(n)
identity function in Problem 5.1, and additionally X1
= Y (n) is irreducible
Markov with transition matrix P and invariant distribution π. We have that
n
log |R|
min H(SR/I |π),
R > max
06=I≤l R log |I|
o
1 (m) (m−1)
(1)
(5.1.1)
H (P|π) − lim
H YR/I , YR/I , · · · , YR/I ,
m→∞ m
where
o
n
SR/I = diag {SA }A∈R/I
(i)
(i)
with SA being the stochastic complement of PA,A in P and YR/I = X1 + I, is
achievable with linear coding over R. To be more precise, for any > 0, there is
an N0 ∈ N+ such that there exist a linear encoder φ : Rn → Rk and a decoder
ψ : Rk → Rn for all n > N0 with Pr {ψ (φ (Y n )) 6= Y n } < , provided that
n
n
k > max
min H(SR/I |π),
06=I≤l R log |I|
o
1 (m) (m−1)
(1)
H (P|π) − lim
H YR/I , YR/I , · · · , YR/I .
m→∞ m
log |R|
Proof. Part One: Let rR/I = H(SR/I |π), R0 = max
rR/I . For any R >
06=I≤l R log |I|
nR
log |I| R − R0
R0 and n ∈ N+ , let k =
. Obviously, for any 0 < η < min
,
06=I≤l R log |R|
log |R|
2
2 log |I|
if n >
, then
η
log |I|
k
log |I|
k
log |I|
R0 − log |I| <
R − 2η − log |I| ≤
− 2η < −3η/2.
log |R|
n
log |R|
n
n
5.1. Linear Coding over Finite Rings for Irreducible Markov Sources
Let N00 = max
06=I≤l R
69
2 log |I|
. We have that, for all n > N00 ,
η
rR/I + η −
k
log |I|
k
log |I| ≤
R0 + η − log |I| < −η/2.
n
log |R|
n
(5.1.2)
The following proves that R is achievable with linear coding over R.
Encoding: Choose some n ∈ N+ and generate a k × n matrix A over R uniformly
at random (independently choose each entry of A from R uniformly at random).
Let the encoder be the linear mapping
φ : x 7→ Ax, ∀ x ∈ Rn .
Notice that the coding rate is
k log |R|
nR
1
1
n
log |φ(R )| ≤ log R =
≤ R.
n
n
n
log |R|
Decoding: Choose an > 0. Assume that z ∈ Rk is the output of the encoder, the
decoder claims that x ∈ Rn is the original data sequence, if and only if
1. x ∈ S (n, P); and
2. ∀ x0 ∈ S (n, P), if x0 6= x, then φ(x0 ) 6= z.
In other words, the decoder ψ maps z to x.
Error: Assume that X ∈ Rn is the original data sequence generated. An error
occurs if and only if
E1 : X ∈
/ S (n, P); or
E2 : There exists x0 ∈ S (n, P) such that φ(x0 ) = φ(X).
Error Probability: We claim that there exist N0 ∈ N+ and 0 > 0, if n > N0 and
0 > > 0, then Pr {ψ(φ(X)) 6= X} = Pr {E1 ∪ E2 } < η. First of all, by the AEP
of Supremus typicality (Proposition 4.2.2), there exist N000 ∈ N+ and 000 > 0 such
that Pr {E1 } < η/2 if n > N000 and 000 > > 0. Secondly, let E1c be the complement
of E1 . We have
X
Pr { E2 | E1c } =
Pr { φ(x0 ) = φ(X)| E1c }
x0 ∈S \{X}
≤
X
X
Pr { φ(x0 ) = φ(X)| E1c }
(5.1.3)
06=I≤l R x0 ∈S (X,I)\{X}
<
X
−k
exp2 n(rR/I + η) |I|
(5.1.4)
06=I≤l R
k
≤ 2|R| − 2 max exp2 n rR/I + η − log |I|
06=I≤l R
n
|R|
< 2 − 2 exp2 (−nη/2),
(5.1.5)
(5.1.6)
70
Irreducible Markov Sources
where
(5.1.3) follows from the fact that S (n, P) =
S
06=I≤l R
S (X, I);
(5.1.4) is from Lemma 4.2.1 and Lemma 2.1.1, and it is required that is smaller
000
+
than some 000
0 > 0 and n is larger than some N0 ∈ N ;
(5.1.5) is due to the fact that the number of non-trivial left ideals of R is bounded
by 2|R| − 2;
(5.1.6) is from (5.1.2), and it is required that n > N00 .
2 |R|
2
0
00
000
log
2 −2
and 0 = min{000 , 000
Let N0 = max N0 , N0 , N0 ,
0 }. We
η
η
have that
Pr { E2 | E1c } < η/2 and Pr {E1 } < η/2
if n > N0 and 0 > > 0. Hence, Pr {E1 ∪ E2 } ≤ Pr { E2 | E1c } + Pr {E1 } < η. This
says that R is achievable with linear coding over R.
Part Two: If we define rR/I to be
1 (m) (m−1)
(1)
H YR/I , YR/I , · · · , YR/I
m→∞ m
H (P|π) − lim
and replace S (n, P) with TH (n, P), then the conclusion
R > max
06=I≤l R
log |R|
rR/I
log |I|
is achievable with linear coding over R follows from a similarly proof based on the
AEP of modified weak typicality (Proposition 4.2.3) and Lemma 4.2.2.
Finally, the theorem is established by a time sharing argument.
Remark 5.1. In Part One of the proof of Theorem 5.1.1, we use the Supremus
typicality encoding-decoding technique, in contrast to the classical (weak) typical
sequence argument. Technically speaking, if one uses a classical (weak) typical sequence argument, Lemma 4.2.1 will not apply. Consequently, the classical argument
will only achieve the inner bound
log |R|
1 (m) (m−1)
(1)
R > max
H (P|π) − lim
H YR/I , YR/I , · · · , YR/I , (5.1.7)
m→∞ m
06=I≤l R log |I|
of (5.1.1). Similarly, the inner bound
R > max
06=I≤l R
log |R|
H(SR/I |π),
log |I|
(5.1.8)
5.1. Linear Coding over Finite Rings for Irreducible Markov Sources
71
is achieved if applying only Lemma 4.2.1 (but not Lemma 4.2.2). Obviously, (5.1.1)
is the union of these two inner bounds. However, as we have mentioned before,
(5.1.7) is hard to access in general due to engaging with the entropy rate. Thus,
based on (5.1.7), it is often hard to draw a optimality conclusion regarding compressing a Markov source as seen below.
Example 5.1.1. Let M be an irreducible Markov chain with state space Z4 =
{0, 1, 2, 3} and transition matrix P = [pi,j ]i,j∈Z4 defined by (4.2.8). With simple
calculation, (5.1.8) says that
R > max{1.8629, 1.7582} = H(P|π),
(5.1.9)
where π is the invariant distribution of M , is achievable with linear coding over
Z4 . Optimality is attained, i.e. (5.1.1) and (5.1.8) coincide with the optimal achievable region (cf. [Cov75]). On the contrary, the achievable rate (5.1.7) drawn from
the classical typicality argument does not lead to the same optimality conclusion.
Because there is no efficient method to evaluate the entropy rate in (5.1.7), since
neither the initial distribution is known, nor P = c1 U + (1 − c1 )1 for any U with
identical rows and c1 (see Remark 4.4).
Generally speaking, X or Y is not necessarily associated with any algebraic
structure. In order to apply the linear encoder, we usually assume that Y in Problem 5.1 is mapped into a finite ring R of order at least |Y | by some injection
Φ : Y → R and denote the set of all possible injections by I(Y , R).
n
o
(n)
Theorem 5.1.2. Assume that s = 1, g is an identity function and X1
=
(n) Y
is irreducible Markov with transition matrix P and invariant distribution π
in Problem 5.1. For a finite ring R of order at least |Y | and ∀ Φ ∈ I(Y , R), let
n
log |R|
min H(SΦ,I |π),
06=I≤l R log |I|
rΦ = max
o
1 (m) (m−1)
(1)
H YR/I , YR/I , · · · , YR/I ,
m→∞ m
H (P|π) − lim
where
SΦ,I = diag
n
SΦ−1 (A)
o
A∈R/I
(m)
with SΦ−1 (A) being the stochastic complement of PΦ−1 (A),Φ−1 (A) in P and YR/I =
(m)
Φ X1
+ I, and define RΦ = {R ∈ R|R > rΦ } . We have that
[
Φ∈I(Y ,R)
is achievable with linear coding over R.
RΦ
(5.1.10)
72
Irreducible Markov Sources
Proof. The result follows immediately from Theorem 5.1.1 by a timesharing argument.
Remark 5.2. In Theorem 5.1.2, assume that Y is some finite ring itself, and let
τ be the identity mapping in I(Y , Y ). It could happen that Rτ ( RΦ for some
Φ ∈ I(Y , Y ). This implies that region given by (5.1.1) could be strictly smaller
than (5.1.10). Therefore, a “reordering” of elements in the ring Y is required when
seeking for better linear encoders.
Remark 5.3. By Lemma 4.A.3, if, in Theorem 5.1.1, P = c1 U + (1 − c1 )1 with U
of identical rows and 0 ≤ c1 ≤ 1, then
o
n
log |R|
(m) (m−1)
min H(SR/I |π), H (P|π) − lim H YR/I YR/I
R > max
m→∞
06=I≤l R log |I|
is achievable with linear coding over R. Similarly, if P = c1 U+(1−c1 )1 in Theorem
5.1.2, then, for all Φ ∈ I(Y , R),
n
log |R|
RΦ = R ∈ RR > max
min H(SΦ,I |π),
06=I≤l R log |I|
o
(m) (m−1)
.
H (P|π) − lim H YR/I YR/I
m→∞
Although the achievable regions presented in the above theorems are comprehensive, they depict the optimal one in many situations, i.e. (5.1.10) (or (5.1.1))
is identical to H(P|π). This has been demonstrated in Example 5.1.1 above, and
more is shown in the following.
Corollary 5.1.1. In Theorem 5.1.1 (Theorem 5.1.2), if R is a finite field, then
R >H(P|π)
(RΦ = {R ∈ R|R >H(P|π)}, ∀ Φ ∈ I(Y , R), )
is achievable with linear coding over R.
Proof. If R is a finite field, then R is the only non-trivial
left ideal of itself. The
(m)
statement follows, since SR/R = P (SΦ,R = P) and H YR/R = 0 for all feasible
m.
Corollary 5.1.2. In Theorem 5.1.2, if P describes an i.i.d. process, i.e. the row
vectors of P are identical to π = [pj ]j∈Y , then
log |R|
RΦ = R ∈ R R > max
[H(π) − H(πΦ,I )] , ∀ Φ ∈ I(Y , R),
06=I≤l R log |I|
hP
i
where πΦ,I =
p
, is achievable with linear coding over R. In
−1
j
j∈Φ (A)
A∈R/I
particular, if
5.2. Source Coding for Computing Markovian Functions
73
1. R is a field with |R| ≥ |Y |; or
2. R, with |R|p
≥ |Y |, contains one and only one proper non-trivial left ideal I0
and |I0 | = |R|; or
3. R is a product ring of several rings satisfying condition 1 or 2,
then
S
Φ∈I(Y ,R)
RΦ = {R ∈ R |R > H(π) } .
Proof. The first half of the statement follows from Theorem 5.1.2 by direct calculation. The second half is from Theorem 2.3.3 and Theorem 2.3.5.
Remark 5.4. Concrete examples of the finite ring from Corollary 5.1.2 includes,
but are not limited to:
1. Zp , where p ≥ |Y | is a prime, as a finite field;
("
2. Zp2 and ML,p =
x
y
)
#
p
0 x, y ∈ Zp , where p ≥ |Y | is a prime;
x
3. ML,p1 × Zp2 , where p1 ≥ |Y | and p2 ≥ |Y | are primes.
Since there always exists a prime p with p2 ≥ |Y | in Theorem 5.1.2, Corollary 5.1.2
guarantees that there always exist optimal linear encoders over some non-field ring,
say Zp2 or ML,p , if the source is i.i.d..
Corollary 5.1.2 can be generalized to the multiple sources scenario in a memoryless setting (see Theorem 2.3.3 and Theorem 2.3.5). More precisely, the Slepian–
Wolf region is always achieved with linear coding over some non-field ring. Unfortunately, it is neither proved nor denied that a corresponding existence conclusion for
the (single or multivariate [FCC+ 02]) Markov source(s) scenario holds. Nevertheless, Example 5.1.1, Corollary 5.1.2 and conclusions from Chapter 2 do affirmatively
support such an assertion to their own extents.
Even if it is unproved that linear coding over non-field ring is optimal for the
special case of Problem 5.1 considered in this section, it will be seen in later sections
that linear coding over non-field ring strictly outperforms its field counterpart in
other settings of this problem.
5.2
Source Coding for Computing Markovian Functions
We now move on to a more general setting of Problem 5.1, where both s and g are
arbitrary. Generally speaking, R[g] is unknown when g is not an identity function
(e.g. the binary sum), and it is larger (strictly in many cases) than the Slepian–Wolf
74
Irreducible Markov Sources
region. However, not much is known for the case of sources with memory. Let
Rs =
X
1 h (n) (n−1)
(1)
Rt > lim
H XS , XS
, · · · , XS
(R1 , R2 , · · · , Rs ) ∈ R n→∞ n
t∈T
i
(n)
(n−1)
(1)
− H XT c , XT c , · · · , XT c , ∅ =
6 T ⊆ S 2,
(5.2.1)
s
where T c = S \ T . By [Cov75], if the process
(1)
(2)
(n)
· · · , XS , XS , · · · , XS , · · ·
is jointly ergodic3 (stationary ergodic), then Rs = R[g] for an identity function
g. Naturally, Rs is an inner bound for R[g] in the case of an arbitrary g. But
Rs is not always tight (optimal), i.e. Rs ( R[g], as we will demonstrate later in
Example 5.2.1 below. Even for the special scenario of correlated i.i.d. sources, i.e.
(1)
(2)
(n)
· · · , XS , XS , · · · , XS , · · · is i.i.d., Rs , which is then the Slepian–Wolf region,
is not tight (optimal) in general. Unfortunately, little is mentioned in the existing
(1)
(2)
(n)
literature regarding the situation that · · · , XS , XS , · · · , XS , · · · is not memory(1)
(2)
(n)
less, neither for the case that · · · , Y , Y , · · · , Y , · · · is homogeneous Markov
(1)
(2)
(n)
(which does not necessarily imply that · · · , XS , XS , · · · , XS , · · · is jointly ergodic or homogeneous Markov (see Example 5.2.2)).
In this section, we will address Problem 5.1 by assuming that g admits a Markovian polynomial (nomographic) presentation. We will show that the linear coding
approach is strictly better than the one from [Cov75] and it even offers solutions
when [Cov75] does not apply. Furthermore, in Section 5.3, we will once again demonstrate that LCoR is in strict upper hand compared to its field counterpart in terms
of achieving better coding rates in even non-i.i.d. settings.
(i)
Example 5.2.1. Consider three sources 1, 2 and 3 generating random data X1 ,
(i)
(i)
X2 and X3 (at time i ∈ N+ ) whose sample spaces are X1 = X2 = X3 = {0, 1} (
Z4 , respectively. Let g : X1 × X2 × X3 → Z4 be defined as
g : (x1 , x2 , x3 ) 7→ x1 + 2x2 + 3x3 ,
(5.2.2)
(i)
(i)
(i)
and assume that X (n) , where X (i) = X1 , X2 , X3 , forms a Markov chain
2 Assume
the limits exist.
ergodic defined by Cover [Cov75] is equivalent to stationary ergodic, a condition supporting the Shannon–McMillan–Breiman Theorem. Stationary ergodic is a special case of a.m.s.
ergodic [GK80]. The later is a sufficient and necessary condition for the Point-wise Ergodic Theorem to hold [GK80, Theorem 1]. The Shannon–McMillan–Breiman Theorem holds under this
universal condition as well [GK80].
3 Jointly
5.2. Source Coding for Computing Markovian Functions
75
with transition matrix
(0,
(0,
(0,
(0,
(1,
(1,
(1,
(1,
0,
0,
1,
1,
0,
0,
1,
1,
0)
1)
0)
1)
0)
1)
0)
1)
(0, 0, 0)
.1397
.0097
.0097
.0097
.0097
.0097
.0097
.0097
(0, 0, 1)
.4060
.5360
.4060
.4060
.4060
.4060
.4060
.4060
(0, 1, 0)
.0097
.0097
.1397
.0097
.0097
.0097
.0097
.0097
(0, 1, 1)
.0097
.0097
.0097
.1397
.0097
.0097
.0097
.0097
(1, 0, 0)
.0097
.0097
.0097
.0097
.1397
.0097
.0097
.0097
(1, 0, 1)
.0097
.0097
.0097
.0097
.0097
.1397
.0097
.0097
(1, 1, 0)
.4060
.4060
.4060
.4060
.4060
.4060
.5360
.4060
(1, 1, 1)
.0097
.0097
.0097
.0097
.0097
.0097
.0097
.1397
In order to recover g at the decoder, one solution is to apply Cover’s method [Cov75]
to first decode the original data and then compute g. This results in an achievable
region
X
h R3 = (R1 , R2 , R3 ) ∈ R3 Rt > lim H X (m) X (m−1)
m→∞
t∈T
i
(m) (m−1)
− H XT c XT c
,∅ =
6 T ⊆ {1, 2, 3} .
However, R3 is
not optimal, i.e. coding rates
beyond this region can be achieved.
Observe that Y (n) , where Y (i) = g X (i) , is an irreducible Markov with transition matrix
0
3
2
1
0
.1493
.0193
.0193
.0193
3
.8120
.9420
.8120
.8120
2
.0193
.0193
.1493
.0193
1
.0193
.0193
.0193
.1493
(5.2.3)
By Theorem 5.1.1, for any > 0, there is an N0 ∈ N+ , such that for all n > N0
there exist a linear encoder φ : Zn4 → Zk4 and a decoder ψ : Zk4 → Zn4 , such that
Pr {ψ (φ (Y n )) 6= Y n } < , as long as
n
× max {0.3664, 0.3226} = 0.1832n.
2
Further notice that φ(Y n ) = ~g Z1k , Z2k , Z3k , where Ztk = φ (Xtn ) (t = 1, 2, 3) and


(1)
(1)
(1)
g Z1 , Z2 , Z3


 g Z (2) , Z (2) , Z (2) 


1
2
3
 , since g is linear. Thus, by the approach of
~g Z1k , Z2k , Z3k = 


..


.
 
k>
(k)
(k)
(k)
g Z1 , Z2 , Z3
76
Irreducible Markov Sources
Körner–Marton [KM79], we can use φ as encoder for each source.
Upon observing
Z1k , Z2k and Z3k , the decoder claims that ψ ~g Z1k , Z2k , Z3k is the desired data
~g (X1n , X2n , X3n ). Obviously
Pr {ψ (~g [φ (X1n ) , φ (X2n ) , φ (X3n )]) 6= Y n } = Pr {ψ (φ (Y n )) 6= Y n } < ,
as long as k > 0.1832n. As a consequence, the region
2k
3
= 0.4422
RZ4 = (R1 , R2 , R3 ) ∈ R Ri >
n
(5.2.4)
is achieved. Since
0.4422 + 0.4422 + 0.4422 < lim H X (m) X (m−1) = 1.4236,
m→∞
we have that RZ4 6⊆ R3 . In conclusion, R3 is suboptimal for computing g.
Theorem 5.2.1. In Problem 5.1, assume that g admits presentation
ĝ = h ◦ k, where k(x1 , x2 , · · · , xs ) =
s
X
ki (xi ),
(5.2.5)
i=1
h, ki ’s are functions mapping R to R and k is irreducible
Let
n Markovian.
P and
o
P
(n)
(n)
π be the transition matrix and invariant distribution of Z
= t∈S kt Xt
,
respectively. We have
R = {(R1 , R2 , · · · , Rs ) ∈ Rs |Ri > R0 } ⊆ R[g],
where
n
log |R|
min H(SR/I |π),
06=I≤l R log |I|
R0 = max
o
1 (m) (m−1)
(1)
H YR/I , YR/I , · · · , YR/I ,
m→∞ m
H (P|π) − lim
n
o
SR/I = diag {SA }A∈R/I with SA being the stochastic complement of PA,A in P
(m)
and YR/I = Z (m) + I. Moreover, if R is a field, then
R = {(R1 , R2 , · · · , Rs ) ∈ Rs |Ri > H(P|π) } .
(5.2.6)
Proof. By Theorem 5.1.1, for any > 0, there exists an N0 ∈ N+ and for all n > N0 ,
there exist an linear encoder φ0 : Rn → Rk and a decoder ψ0 : Rk → Rn such that
Pr {ψ0 (φ0 (Z n )) 6= Z n } < ,
5.2. Source Coding for Computing Markovian Functions
77
nR0
. Choose φt = φ0 ◦ ~kt (t ∈ S) as the encoder for the tth
log |R|
P
source and ψ = ψ0 ◦ γ, where γ : Rs → R is defined as γ(x1 , x2 , · · · , xs ) = t∈S xt ,
as the decoder. We have that
provided that k >
Pr {ψ (φ1 (X1n ) , φ2 (X2n ) , · · · , φs (Xsn )) 6= Z n }
o
n = Pr ψ0 γ φ0 ~kt (Xtn )
6= Z n
n o
= Pr ψ0 φ0 γ ~kt (Xtn )
6= Z n
= Pr {ψ0 (φ0 (Z n )) 6= Z n } < .
k log |R|
> R0 , is achievable for
n
computing g. As a conclusion, R ⊆ R[g]. If furthermore R is a field, then R is the
only non-trivial left ideal of itself. (5.2.6) follows.
Therefore, (R1 , R2 , · · · , Rs ) ∈ Rs , where Ri =
Example 5.2.2. Define Pα and Pβ to be
(0,
(0,
(0,
(0,
(1,
(1,
(1,
(1,
0,
0,
1,
1,
0,
1,
1,
0,
0)
1)
0)
1)
1)
0)
1)
0)
(0, 0, 0)
.2597
.1208
.0184
.0985
.12985
.0604
.0092
.04925
(0, 0, 1)
.2093
.0872
.2627
.1823
.10465
.0436
.13135
.09115
(0, 1, 0)
.2713
.6711
.4101
.2315
.13565
.33555
.20505
.11575
(0, 1, 1)
.2597
.1208
.3088
.4877
.12985
.0604
.1544
.24385
(1, 0, 1)
0
0
0
0
.12985
.0604
.0092
.04925
(1, 1, 0)
0
0
0
0
.10465
.0436
.13135
.09115
(1, 1, 1)
0
0
0
0
.13565
.33555
.20505
.11575
(1, 0, 0)
0
0
0
0
.12985
.0604
.1544
.24385
0,
0,
1,
1,
0,
1,
1,
0,
0)
1)
0)
1)
1)
0)
1)
0)
(0, 0, 0)
0
0
0
0
.2597
.1208
.0184
.0985
(0, 0, 1)
0
0
0
0
.2093
.0872
.2627
.1823
(0, 1, 0)
0
0
0
0
.2713
.6711
.4101
.2315
(0, 1, 1)
0
0
0
0
.2597
.1208
.3088
.4877
(1, 0, 1)
.2597
.1208
.0184
.0985
0
0
0
0
(1, 1, 0)
.2093
.0872
.2627
.1823
0
0
0
0
(1, 1, 1)
.2713
.6711
.4101
.2315
0
0
0
0
(1, 0, 0)
.2597
.1208
.3088
.4877
0
0
0
0
and
(0,
(0,
(0,
(0,
(1,
(1,
(1,
(1,
respectively. Let M = X (n) be a non-homogeneous Markov chain whose transition matrix from time n to time n + 1 is
(
Pα ; n is even,
(n)
P =
Pβ ; otherwise.
,
78
Irreducible Markov Sources
Consider Example 5.2.1 by replacing the original homogeneous Markov chain X (n)
with M defined above. It is easy to verify that there exists no invariant distribution π 0 such that π 0 P(n) = π 0 for all feasible n. This implies that M is not jointly
ergodic (stationary ergodic), nor a.m.s. ergodic. Otherwise, M will always possess
an invariant distribution induced from the stationary mean measure of the a.m.s.
dynamical system describing M [Gra09, Theorem 7.1 and Theorem 8.1]. As a consequence, [Cov75] does not apply.
However,
g is Markovian although M is not even
homogeneous. In exact terms, g X (n) is homogeneous irreducible Markov with
transition matrix P given by (4.2.8). Consequently, Theorem 5.2.1 offers a solution
which achieves
R = {(R1 , R2 , R3 ) |Ri > H(P|π) = 1.8629} ,
where π is the unique eigenvector
satisfying
πP = π. Once again, the optimal coding
rate H(P|π) for compressing g X (n) is derived from the Supremus typicality
argument, other than the classical typicality argument.
For an arbitrary g, Lemma 1.2.4 promises that there always exist some finite
ring R and functions kt : Xt → R (t ∈ S) and h : R → Y such that
!
X
g=h
kt .
t∈S
However, k =
t∈S kt is not necessarily Markovian, unless the process M =
(n) X
is Markov with transition matrix c1 U + (1 − c1 )1, where the stochastic
matrix U has identical rows. In that case, k is always Markovian so claimed by
Lemma 4.A.3.
n
o
(n)
Corollary 5.2.1. In Problem 5.1, assume that XS
forms an irreducible Markov
P
chain with transition matrix P0 = c1 U+(1−c1 )1, where all rows of U are identical
to some unitary vector and 0 ≤ c1 ≤ 1. Then there exist some finite ring R and
functions kt : Xt → R (t ∈ S) and h : R → Y such that
!
s
X
g(x1 , x2 , · · · , xs ) = h
kt (xt )
(5.2.7)
t=1
o
n
Ps
(n)
and M = Z (n) = t=1 kt Xt
is irreducible Markov. Furthermore, let π and
P be the invariant distribution and the transition matrix of M , respectively, and
n
o
log |R|
(m) (m−1)
min H(SR/I |π), H (P|π) − lim H YR/I YR/I
R0 = max
m→∞
06=I≤l R log |I|
n
o
where SR/I = diag {SA }A∈R/I with SA being the stochastic complement of PA,A
(m)
in P and YR/I = Z (m) + I. We have that
RR = {(R1 , R2 , · · · , Rs ) ∈ Rs |Ri > R0 } ⊆ R[g].
5.2. Source Coding for Computing Markovian Functions
79
Proof. The existences of kt ’s and n
h are from
o Lemma 1.2.4, and Lemma 4.A.3 ensures
(n)
that M is Markov. In addition, XS
is irreducible, so is M . Finally,
1 (m) (m−1)
(1)
(m) (m−1)
,
H YR/I , YR/I , · · · , YR/I = lim H YR/I YR/I
m→∞ m
m→∞
n
o
(n)
since YR/I is Markov by Lemma 4.A.3. Henceforth, RR ⊆ R[g] by Theorem
5.2.1.
lim
Remark 5.5. For the function g in Corollary 5.2.1, it is often the case that there
exists more than one finite ring R or more than one set of functions kt ’s and h
satisfying corresponding requirements. For example, the polynomial function x +
2y + 3z ∈ Z4 [3] admits also the polynomial
presentation
ĥ (x + 2y + 4z) ∈ Z5 [3],
P
where ĥ(u) = a∈Z5 a 1 − (u − a)4 − 1 − (u − 4)4 ∈ Z5 [1]. As a conclusion, a
possibly better inner bound of R[g] is


[ [ [
Rs 
RR  ,
(5.2.8)
R PR (g)
where PR (g) denotes all the polynomial presentations of format (5.2.7) of g over
ring R.
n
o
(n)
Corollary 5.2.2. In Corollary 5.2.1, let π = [pj ]j∈R . If c1 = 1, namely, XS
and M are i.i.d., then
log |R|
s
[H(π) − H(πI )] ⊆ R[g],
RR = (R1 , R2 , · · · , Rs ) ∈ R Ri > max
06=I≤l R log |I|
hP
i
where πI =
p
.
j
j∈A
A∈R/I
Remark 5.6. In Corollary 5.2.2, under many circumstances it may hold that
log |R|
[H(π) − H(πI )] = H(π), i.e.
max06=I≤l R
log |I|
RR = {(R1 , R2 , · · · , Rs ) ∈ Rs |Ri > H(π) } .
(5.2.9)
For example, when R is a field. However, R being a field is definitely not necessary.
For more details, please kindly refer to Section 2.3.
Corollary 5.2.3. In Corollary 5.2.1, R can always be chosen as a field. Consequently,
RR = {(R1 , R2 , · · · , Rs ) ∈ Rs |Ri > H(P|π) } ⊆ R[g].
(5.2.10)
80
Irreducible Markov Sources
Remark 5.7. Although R in Corollary 5.2.1 can always be chosen to be a field,
the region RR is not necessarily larger than when R is chosen as a non-field ring.
On the contrary, RR is strictly larger when R is a non-field ring than when it is
chosen as a field in many case. This is because the induced P, as well as π, varies.
As mentioned, in Theorem 5.2.1, Corollary 5.2.1 and Corollary 5.2.2, there may
be more than one choice of such a finite ring R satisfying the corresponding requirements. Among those choices, R can be either a field or a non-field ring. Surprisingly,
it is seen in (infinitely) many examples that using non-field ring outperforms using
a field. In many cases, it is proved that the achievable region obtained with linear
coding over some non-field ring is strictly larger than any that is achieved with its
field counterpart, regardless which field is considered. Section 3.3 has demonstrated
this in the setting of correlated i.i.d. sources. In the next section, this will be once
again demonstrated in the setting of sources with memory.
5.3
Non-field Rings versus Fields II
Clearly, our previous discussion regarding linear coding is mainly based on general
finite rings which can be either fields or non-field rings, each bringing their own
advantages. In the setting where g is the identity function in Problem 5.1, linear
coding over finite field is always optimal in the sense of achieving R[g] if the sources
are jointly ergodic (stationary ergodic) [Cov75]. An equivalent conclusive result is
not yet proved for linear coding over non-field ring. Nevertheless, it is proved that
there always exist more than one (up to isomorphism) non-field rings over which
linear coding achieves the Slepian–Wolf region if the sources considered are i.i.d.
(Section 2.3). Furthermore, many examples, say Example 5.1.1, show that non-field
ring can be equally optimal when considering irreducible Markov sources. All in
all, there is still no conclusive support that linear coding over field is preferable in
terms of achieving the optimal region R[g] when g is an identity function.
On the contrary, there are many drawbacks of using finite fields compared to
using non-field rings (e.g. modulo integer rings):
1. The finite field arithmetic is complicated to implement since the finite field
arithmetic usually involves the polynomial long division algorithm; and
2. The alphabet size(s) of the encoder(s) is (are) usually larger than required
(Section 3.3); and
3. In many specific circumstances of Problem 5.1, linear coding over any finite
field is proved to be less optimal than its non-field rings counterpart in terms
of achieving larger achievable region (see Section 5.3 and Example 5.3.1); and
4. The characteristic of a finite field has to be a prime. This constraint creates
shortages in their polynomial presentations of discrete functions (see Lemma
5.A.2). These shortages confine the performance of the polynomial approach
5.3. Non-field Rings versus Fields II
81
(if restrict to field) and lead to results like Proposition 5.3.1. On the other
hand, The characteristic can be any positive integer for a finite non-field ring;
and
5. Field (finite or not) contains no zero divisor. This also impares the performance of the polynomial approach (if restrict to field).
Example 5.3.1. Consider the situation illustrated in Example 5.2.1, one alternative is to treat X1 = X2 = X3 = {0, 1} as a subset of finite field Z5 and the
function g can then be presented as
g(x1 , x2 , x3 ) = ĥ(x1 + 2x2 + 4x3 ),
(
z; z 6= 4,
where ĥ : Z5 → Z4 is given by ĥ(z) =
(symbol-wise). By Corollary
3; z = 4,
5.2.3, linear coding over Z5 achieves the region
RZ5 = (r1 , r2 , r3 ) ∈ R3 |ri > H (PZ5 |πZ5 ) = 0.4623 .
Obviously, RZ5 ( RZ4 ⊆ R[g]. In conclusion, using linear coding over field Z5 is
less optimal compared with over non-field ring Z4 . In fact, the region RF achieved
by linear coding over any finite field F is always strictly smaller than RZ4 .
Proposition 5.3.1. In Example 5.2.1, RF , the achievable region achieved with
linear coding over any finite field F in the sense of Corollary 5.2.1, is properly
contained in RZ4 , i.e. RF ( RZ4 .
Proof. Assume that
g(x1 , x2 , x3 ) = h (k1 (x1 ) + k2 (x2 ) + k3 (x3 ))
with kt : {0, 1} → F (1 ≤ t ≤ 3) and h : F → Z4 . Let
n
o
(n)
(n)
(n)
,
M1 = Y (n) with Y (n) = g X1 , X2 , X3
n
o
(n)
(n)
(n)
,
+ k3 X3
M2 = Z (n) with Z (n) = k1 X1
+ k2 X2
and Pl and πl be the transition matrix and the invariant distribution of Ml , respectively, for l = 1, 2. By Corollary 5.2.1 (also Corollary 5.2.3), linear coding over
F achieves the region
RF = {(R1 , R2 , · · · , Rs ) ∈ Rs |Ri > H(P2 |π2 ) } ,
while linear coding over Z4 achieves
log |Z4 |
s
H(SZ4 /I |π1 ) = H(P1 |π1 ) .
RZ4 = (R1 , R2 , · · · , Rs ) ∈ R Ri > max
06=I≤l Z4 log |I|
82
Irreducible Markov Sources
Moreover,
H(P1 |π1 ) < H(P2 |π2 )
by Lemma 5.A.1 due to Lemma 5.A.2 claims that h|S , where S = k1 ({0, 1}) +
k2 ({0, 1}) + k3 ({0, 1}), can never be injective. Therefore, RF ( RZ4 .
Remark 5.8. There are infinitely many functions like g defined in Example 5.2.1
such that the achievable region obtained with linear coding over any finite field in
the sense of Corollary 5.2.1 is strictly suboptimal compared to the
Psone achieved with
linear coding over some non-field ring. These functions include t=1 xt ∈ Z2p [s] for
any s ≥ 2 and any prime p > 2. One can always find a concrete example in which
linear coding over Z2p dominates. The reason for this is partially because these
functions are defined on rings (e.g. Z2p ) of non-prime characteristic. However, a
finite field must be of prime characteristic, resulting in conclusions like Proposition
5.3.1.
As a direct consequence of Proposition 5.3.1, we have
Theorem 5.3.1. In the sense of (5.2.8), linear coding over finite field is not optimal.
5.A. Appendix
5.A
83
Appendix
5.A.1
Supporting Lemmata
Lemma 5.A.1. Let Z be a countable set, π = [p(z)]z∈Z and P = [p(z1 , z2 )]z1 ,z2 ∈Z
be a non-negative unitary vector and a stochastic matrix, respectively. For any function h : Z → Y , if for all y1 , y2 ∈ Y
p(z1 , y2 )
= cy1 ,y2 , ∀ z1 ∈ h−1 (y1 ),
p(z1 )
(5.A.1)
where cy1 ,y2 is a constant, then
H h Z (2) h Z (1)
≤ H(P|π),
(5.A.2)
where Z (1) , Z (2) ∼ πP. Moreover, (5.A.2) holds with equality if and only if
p(z1 , h(z2 )) = p(z1 , z2 ), ∀ z1 , z2 ∈ Z with p(z1 , z2 ) > 0.
(5.A.3)
Proof. By definition,
H h Z (2) h Z (1)
=−
X
p(y1 , y2 ) log
y1 ,y2 ∈Y
p(y1 , y2 )
p(y1 )

=−
X
X
p(z1 , y2 ) log 
(a)
=−
X
X
p(z1 , y2 ) log
y1 ,y2 ∈Y z1 ∈h−1 (y1 )
=−
X
X
p(z1 , z2 ) log
y1 ,y2 ∈Y z2 ∈h−1 (y2 ),
z1 ∈h−1 (y1 )
(b)
≤−
X
X
p(z1 , z2 ) log
y1 ,y2 ∈Y z2 ∈h−1 (y2 ),
z1 ∈h−1 (y1 )
=−
X
z1 ,z2 ∈Z
p(z10 , y2 )
z10 ∈h−1 (y1 )
y1 ,y2 ∈Y z1 ∈h−1 (y1 )
p(z1 , z2 ) log

,
X
p(z1 , y2 )
p(z1 )
P
z20 ∈h−1 (y2 )
X
p(z100 )
z100 ∈h−1 (y1 )
p(z1 , z20 )
p(z1 )
p(z1 , z2 )
p(z1 )
p(z1 , z2 )
p(z1 )
=H(P|π),
where (a) is from (5.A.1). In addition, equality holds, i.e. (b) holds with equality,
if and only if (5.A.3) is satisfied.
84
Irreducible Markov Sources
Remark 5.9. P in the above lemma can be interpreted as the transition matrix
of some Markov process. However, π is not necessarily the corresponding invariant
distribution. It is also not necessary that such a Markov process is irreducible. In
the meantime, (5.A.2) can be seen as a “data processing inequality”. In addition,
(5.A.1) is sufficient but not necessary for (5.A.2), even though it is sufficient and
necessary for (a) in the above proof.
Lemma 5.A.2. For g given by (5.2.2) and any finite field F, if there exist functions
kt : {0, 1} → F and h : F → Z4 , such that
!
s
X
g(x1 , x2 , · · · , xs ) = h
kt (xt ) ,
t=1
then h|S , where S = k1 ({0, 1}) + k2 ({0, 1}) + k3 ({0, 1}), is not injective.
Proof. Suppose otherwise, i.e. h|S is injective. Let h0 : h (S ) → S be the inverse
mapping of h : S → h (S ). Obviously, h0 is bijective. By (5.2.2), we have
h0 [g(1, 0, 0)] = k1 (1) + k2 (0) + k3 (0)
=h0 [g(0, 1, 1)] = k1 (0) + k2 (1) + k3 (1)
6=h0 [g(1, 1, 0)] = k1 (1) + k2 (1) + k3 (0)
=h0 [g(0, 0, 1)] = k1 (0) + k2 (0) + k3 (1).
Let τ = h0 [g(1, 0, 0)] − h0 [g(1, 1, 0)] = h0 [g(0, 1, 1)] − h0 [g(0, 0, 1)] ∈ F. We have
that
τ = k2 (0) − k2 (1) = k2 (1) − k2 (0) = −τ
=⇒ τ + τ = 0.
(5.A.4)
(5.A.4) implies that either τ = 0 or Char(F) = 2 by Proposition 1.1.2. Noticeable
that k2 (0) 6= k2 (1), i.e. τ 6= 0, by the definition of g. Thus, Char(F) = 2. Let
ρ = k3 (0) − k3 (1). Obviously, ρ 6= 0 by the definition of g, and ρ + ρ = 0 since
Char(F) = 2. Consequently,
h0 [g(0, 0, 0)] =k1 (0) + k2 (0) + k3 (0)
=k1 (0) + k2 (0) + k3 (1) + ρ
=h0 [g(0, 0, 1)] + ρ
=h0 [g(1, 1, 0)] + ρ
=k1 (1) + k2 (1) + k3 (0) + ρ
=k1 (1) + k2 (1) + k3 (1) + ρ + ρ
=h0 [g(1, 1, 1)] .
Therefore, g(0, 0, 0) = g(1, 1, 1) since h0 is bijective. This is absurd!
Chapter 6
Extended
Shannon–McMillan–Breiman
Theorem
y Proposition 4.2.1, we have seen that Supremus typicality is a recursive
property, while classical typicality is not by Example 4.2.1. This is because
we have integrated a recursive feature, namely a reduced process of an irreducible Markov process is irreducible Markov [Mey89], into the definition of corresponding Supremus typical sequences. Although this makes the set of Supremus
typical sequences a smaller (more restricted) set compared to the set of classical
typical sequences, the AEP, Proposition 4.2.2, still holds. Consequently, we conclude that non-Supremus typical sequences, which might or might not be classical
typical, are negligible in a stochastic sense. This becomes the spirit behind the
arguments of the achievability theorems from Chapter 5. Moreover, this enriched
structure of Supremus typical sequences provides us with refine properties (see the
comparison between Lemma 4.2.1 and Lemma 4.2.2, etc.) to obtain more accessible
conclusions.
However, we have postponed the consideration of the more universal case,
asymptotically mean stationary (a.m.s.) process, due to missing some ergodic theoretical background that is to be introduced in this chapter. This chapter proves that
an induced transformation with respect to a finite measure set of a recurrent a.m.s.
dynamical system with a σ-finite measure is a.m.s.. This is the correspondence of
the recursive feature of irreducible Markov process that we are looking for. Since the
Shannon–McMillan–Breiman (SMB) Theorem and the Shannon–McMillan Theorem hold for any finite-state a.m.s. ergodic process [GK80], we conclude that the
SMB Theorem, as well as the Shannon–McMillan Theorem, holds simultaneously
for all reduced processes of any finite-state recurrent a.m.s. ergodic random process.
We term this recursive property the Extended SMB Theorem.
B
85
86
6.1
6.1.1
Extended Shannon–McMillan–Breiman Theorem
Asymptotically Mean Stationary Dynamical Systems
and Random Processes
Asymptotically Mean Stationary Dynamical Systems
A dynamical system (Ω, F , µ, T ) with a finite measure, e.g. probability measure, is
said to be asymptotically mean stationary1 (a.m.s.) [GK80] if the limit
n−1
1X
µ T −i B
n→∞ n
i=0
lim
exists for all B ∈ F . As proved in [GK80], a system being a.m.s. is a necessary and
sufficient condition to ensure that
n−1
1X
fTi
n i=0
converges µ-almost everywhere (µ-a.e.) on Ω for every bounded F -measurable realvalued function, say f . Let
n−1
1X
µ T −i B , ∀ B ∈ F ,
n→∞ n
i=0
µ(B) = lim
n−1
1X
f T i (x), ∀ x ∈ Ω.
n→∞ n
i=0
and f (x) = lim
Then, by the Vitali–Hahn–Saks Theorem, it is easily seen that µ is a finite measure
on (Ω, F ), and f is F -measurable. Moreover, (Ω, F , µ, T ) is invariant, in other
words, T is a measure preserving transformation on (Ω, F , µ), i.e.
µ(B) = µ T −1 B , ∀ B ∈ F ,
and f is T -invariant a.e., i.e.
f = f T a.e.,
with respect to both µ and µ. In fact, f is simply the conditional expectation
Eµ (f |I ), where I ⊆ F is the σ-algebra of T -invariant sets (B ∈ F is said to be
T -invariant if B = T −1 B). Therefore, if (Ω, F , µ, T ) is ergodic, i.e.
T −1 B = B =⇒ µ(B) = 0 or µ(Ω − B) = 0, ∀ B ∈ F ,
1 Perhaps it is better to replace “stationary” with “invariant,” because a stationary measure
defined in [GK80] is usually called an invariant measure in the language of ergodic theory. However,
in order to be consistent, we will follow existing literature and use the terminology “asymptotically
mean stationary,” while the reader can read it as “asymptotically mean invariant” if preferred.
6.1. Asymptotically Mean Stationary Dynamical Systems and Random Processes 87
then f = Eµ (f |I ) equals to a constant a.e. with respect to both µ and µ.
We emphasize that the definition (cited from [GK80]) of the a.m.s. property
given above is only valid for finite measures. In order to address dynamical systems
with non-finite measures, in particular those with σ-finite measures, we generalise
the definition as follows.
Definition 6.1.1. A dynamical system (Ω, F , µ, T ) is said to be asymptotically
mean stationary (a.m.s.) if there exists a measure µ on (Ω, F ) satisfying:
1. For any B ∈ F of finite measure, i.e. µ(B) < ∞,
n−1
1X
µ(T −i B);
n→∞ n
i=0
µ(B) = lim
2. For any T -invariant set B ∈ F , µ(B) = µ(B).
Such a measure µ is named the invariant mean 2 of µ.
The following proposition clearly explains why the terminology “asymptotically
mean stationary” and “invariant mean” are suggested.
Proposition 6.1.1. Let (Ω, F , µ, T ) be a.m.s. and µ be an invariant mean of µ.
If µ is σ-finite, then (Ω, F , µ, T ) is invariant.
Proof. For any B ∈ F , if µ(B) < ∞ obviously µ(B) = µ (T −n B) for any positive
integer n. If µ(B) = ∞, then there exists a countable
{Bi : i ∈ N+ }, with
−1 partition
µ(Bi ) < ∞, of B since µ is σ-finite.
Moreover, T Bi is a countable partition of
T −1 B, and µ(Bi ) = µ T −1 Bi < ∞ for all feasible i. As a consequence,
∞
∞
X
X
µ T −1 B =
µ T −1 Bi =
µ(Bi ) = µ(B).
i=1
i=1
Hence, (Ω, F , µ, T ) is invariant.
Remark 6.1. Obviously, an invariant system (Ω, F , m, T ) is a.m.s. with m being
the invariant mean of itself. Actually, if µ in Definition 6.1.1 is finite, then the second
requirement in the definition is redundant, because the fact µ(B) = µ(B) for any
T -invariant set B can be deduced from the first requirement. Therefore, Definition
6.1.1 covers the original definition from [GK80] as a special case. However, for a
non-finite measure, the second condition is crucial.
Example 6.1.1. Let R+ = (0, +∞), B be the Borel σ-algebra on R+ , µ be the
Lebesgue measure on (R+ , B), and T (x) = x2 , ∀ x ∈ R+ . For set function λ : B →
R given by:
2 In
µ.
[GK80], the term “stationary mean” is used instead of “invariant mean” for a finite measure
88
Extended Shannon–McMillan–Breiman Theorem
1. For all B ∈ B with µ(B) < ∞, λ(B) = limn→∞
2. For all B ∈ B with µ(B) = ∞, λ(B) =
the Bi ’s form a countable partition of B.
P∞
i=1
1 Pn−1
−i
B ;
i=0 µ T
n
λ(Bi ), where µ(Bi ) < ∞ and
It is easy to verify that λ is well-defined. In exact terms, for any measurable set B
with µ(B) = ∞ and any two countable partitions {Bi0 } and {Bi00 }, where µ(Bi0 ) < ∞
and µ(Bi00 ) < ∞, of B,
∞
X
λ(Bi0 ) =
i=1
∞
X
λ(Bi00 ).
i=1
In addition, one can also prove that λ is a finite, hence σ-finite, measure over
(R+ , B), since λ(R+ ) = 1. However, λ is not an invariant mean of µ, because
[1, +∞) is a T -invariant set while
µ([1, +∞)) = ∞ =
6 0 = λ([1, +∞)).
From this one sees that (R+ , B, µ, T ) is not a.m.s.. To prove this by contradiction,
suppose µ is an invariant mean of µ, then
µ([1, +∞)) =
∞
X
µ([j, j + 1))
j=1
=
n−1
∞
X
1X
µ T −i [j, j + 1) =
0=0
n→∞ n
i=0
j=1
j=1
∞
X
lim
6=∞ = µ([1, +∞)).
Definition 6.1.2. Given two dynamical systems (Ω1 , F1 , µ1 , T1 ) and (Ω2 , F2 , µ2 ,
T2 ), a mapping φ : Ω1 → Ω2 is said to be a homomorphism if
1. φ is measurable;
2. µ1 φ−1 (B2 ) = µ2 (B2 ) , ∀ B2 ∈ F2 ;
3. φT1 = T2 φ µ1 -a.e..
(Ω2 , F2 , µ2 , T2 ) is then called a factor of (Ω1 , F1 , µ1 , T1 ). Furthermore, φ is said
to be an isomorphism, if there exists a homomorphism ψ : Ω2 → Ω1 such that
ω1 = ψ (φ(ω1 )) µ1 -a.e. and ω2 = φ (ψ(ω2 )) µ2 -a.e..
Proposition 6.1.2. In Definition 6.1.2, if B2 is T2 -invariant, then φ−1 (B2 ) is
T1 -invariant.
Proof. φ−1 (B2 ) = φ−1 (T2−1 B2 ) = T1−1 φ−1 (B2 ).
6.1. Asymptotically Mean Stationary Dynamical Systems and Random Processes 89
Theorem 6.1.1. If a dynamical system is a.m.s. (invariant or ergodic), then all
of its factors are a.m.s. (invariant or ergodic).
Proof. Let (Ω2 , F2 , µ2 , T2 ) be a factor of (Ω1 , F1 , µ1 , T1 ) and φ : Ω1 → Ω2 is a
homomorphism.
a.m.s.: If (Ω1 , F1 , µ1 , T1 ) is a.m.s. with invariant mean µ1 , then, for any B2 ∈ F2
of finite measure,
n−1
n−1
1X
1X
µ2 T2−i B2 = lim
µ1 φ−1 T2−i B2
n→∞ n
n→∞ n
i=0
i=0
lim
n−1
1X
µ1 T1−i φ−1 (B2 )
n→∞ n
i=0
=µ1 φ−1 (B2 )
=µ2 (B2 ) ,
= lim
where µ2 = µ1 φ−1 . Moreover, if B2 ∈ F2 is T2 -invariant, then φ−1 (B2 ) is T1 invariant by Proposition 6.1.2. Thus,
µ2 (B2 ) = µ1 φ−1 (B2 ) = µ1 φ−1 (B2 ) = µ2 (B2 ).
Therefore, (Ω2 , F2 , µ2 , T2 ) is a.m.s..
invariant: If (Ω1 , F1 , µ1 , T1 ) is invariant, then, ∀ B2 ∈ F2 ,
µ2 (B2 ) = µ1 (φ−1 (B2 )) = µ1 (T1−1 φ−1 (B2 )) = µ1 (φ−1 (T2−1 B2 )) = µ2 (T2−1 B2 ).
Therefore, (Ω2 , F2 , µ2 , T2 ) is invariant.
ergodic: For any B2 ∈ F2 that is T2 -invariant, φ−1 (B2 ) is T1 -invariant by Proposition 6.1.2. If (Ω1 , F1 , µ1 , T1 ) is ergodic, then either µ2 (B2 ) = µ1 (φ−1 (B2 )) =
0 or µ2 (Ω2 − B2 ) = µ1 (φ−1 (Ω2 − B2 )) = µ1 (Ω1 − φ−1 (B2 )) = 0. Henceforth,
(Ω2 , F2 , µ2 , T2 ) is ergodic.
6.1.2
Random Processes
Given a dynamical system (Ω1 , F1 , µ1 , T1 ) with a probability measure µ1 and a
measurable function f1 : Ω1 → X1 (assume that X1 is countable with σ-algebra
P(X1 ), the power set of X1 , for simplicity), we can define a random process
∞
M1 = Xi = f1 (T1i ) i=0 ,
such that

Pr {Xj = xj , ∀j ∈ I} = µ1 
\
j∈I
T1−j

f1−1 (xj )  , ∀ xj ∈ X1 and ∀ I ⊆ N.
90
Extended Shannon–McMillan–Breiman Theorem
Actually, a random process can always be defined by a dynamical system according
∞
to the Kolmogorov Extension Theorem [Gra09, Theorem 3.3]. LetQM2 = {Xi }i=0
∞
be a random process with a countable state space X2 and Ω2 = i=0 X2 . Define
F2 to be the σ-algebra generated by
(
)
∞
∞
[
Y
G2 =
{x0 } × {x1 } × · · · × {xn } ×
X2 xi ∈ X2 , ∀ 0 ≤ i ≤ n .
n=0
i=n+1
The Kolmogorov Extension Theorem [Gra09, Theorem 3.3] states that there exists
a unique probability measure µ2 defining the measure space (Ω2 , F2 , µ2 ), such that

 
∞
Y
\ j−1
Y
X2  ,
Pr {Xj = xj , ∀j ∈ I} = µ2  
X2 × {xj } ×
j∈I
i=0
i=j+1
for any xj ∈ X2 and index set I ⊆ N. Let T2 : Ω2 → Ω2 be the left shift, namely
T2 ((x0 , x1 , · · · , xn , · · · )) = (x1 , x2 , · · · , xn+1 , · · · ),
and define f2 : Ω2 → X2 as f2 : (x0 , x1 , · · · , xn , · · · ) 7→ x0 , we have that
∞
M2 = f2 (T2i ) i=0 .
This says, the process M2 defines a dynamical system (Ω2 , F2 , µ2 , T2 ) “consistent”
with itself.
Definition 6.1.3. Following nation defined above, the random process M1 is said
to be a.m.s. (stationary or ergodic), if (Ω1 , F1 , µ1 , T1 ) is a.m.s. (invariant or ergodic).
Proposition 6.1.3. A function of an a.m.s. (stationary or ergodic) random process
is a.m.s. (stationary or ergodic).
Proposition 6.1.4. Assuming that M2 is finite-state Markov, if M2 is irreducible,
then (Ω2 , F2 , µ2 , T2 ) is a.m.s. and ergodic.
Proof. If M2 is irreducible, then its invariant distribution gives raise to the invariant
mean of µ2 . Thus, (Ω2 , F2 , µ2 , T2 ) is a.m.s.. The ergodicity part follows from [Gra09,
Lemma 7.15] (or [Gra09, Corollary 8.4]).
Remark 6.2. The converse of Proposition 6.1.4 is not true, even when (Ω2 , F2 , µ2 ,
T2 ) is also#ergodic. For example, let M2 be a Markov process with transition matrix
"
1
0
. One can verify that the corresponding system (Ω2 , F2 , µ2 , T2 ) defined
0.5 0.5
as above is a.m.s. and ergodic, although M2 is obviously not irreducible. However,
M2 always admits a unique invariant distribution.
6.2. Induced Transformations of A.M.S. Systems
91
Proposition 6.1.5. Assuming that M1 is finite-state Markov, if (Ω1 , F1 , µ1 , T1 )
is a.m.s. and ergodic, then M1 admits a unique invariant distribution.
Proof. Obviously, the invariant distribution is induced from the invariant mean of
µ1 . It is unique because the invariant mean of µ1 is unique.
When M1 = M2 , it is easy to show that (Ω2 , F2 , µ2 , T2 ) becomes a factor of
(Ω1 , F1 , µ1 , T1 ), although (Ω1 , F1 , µ1 , T1 ) and (Ω2 , F2 , µ2 , T2 ) are not necessarily
isomorphic. Consequently,
Proposition 6.1.6. If M1 = M2 is a.m.s. (or stationary), then (Ω2 , F2 , µ2 , T2 )
is a.m.s. (or invariant).
Proof. The conclusion follows from Theorem 6.1.1, since (Ω2 , F2 , µ2 , T2 ) is a factor
of the a.m.s. (or invariant) dynamical system, say (Ω1 , F1 , µ1 , T1 ), defining M1 .
In fact, many properties of (not necessarily discrete) random processes are better
described and easier to analysed by the underlying dynamical systems. Interested
readers are referred to [Aar97,Gra09] for a systematic establishment of the theories.
6.2
6.2.1
Induced Transformations of A.M.S. Systems
Induced Transformations
For an invariant system (Ω, F , m, T ) with a finite measure m, Poincaré’s Recurrence
Theorem guarantees that


∞
∞ [
\
T −j B  = 0, ∀ B ∈ F .
(6.2.1)
m B −
i=0 j=i
As a consequence, for any A ∈ F (m(A)
S∞one can define a new transformation
T∞> 0),
TA on (A0 , A , m|A ), where A0 = A ∩ i=0 j=i T −j A and A = {A0 ∩ B|B ∈ F },
such that
(1)
TA (x) = T ψA
(x)
(x), ∀ x ∈ A0 ,
where
(1)
ψA (x) = min i ∈ N+ |T i (x) ∈ A0
is the first return time function. Consequently, (A0 , A , m|A , TA ) forms a new dynamical system. Such a transformation TA is called an induced transformation of
(Ω, F , m, T ) (with respect to A) [Kak43].
On the other hand, for an arbitrary a.m.s. dynamical system (Ω, F , µ, T ), the
situation (of defining the concept of induced transformation) becomes delicate,
because (6.2.1) is not necessarily valid even for a finite measure µ, unless µ µ
92
Extended Shannon–McMillan–Breiman Theorem
[Gra09, Theorem 7.4]. Thus, there could be some A ∈ F of positive measure, such
that TA is not defined on any non-empty subset of A. To avoid a situation of this
sort, we shall focus on dynamical systems for which (6.2.1) holds.
Definition 6.2.1. A dynamical system (Ω, F , µ, T ) is said to be recurrent (conservative) if


∞ [
∞
\
T −j B  = 0, ∀ B ∈ F .
µ B −
i=0 j=i
Definition 6.2.2. In Definition 6.2.2, the random process M1 is said to be recurrent, if (Ω1 , F1 , µ1 , T1 ) is recurrent.
Proposition 6.2.1. A function of a recurrent random process is recurrent.
By Poincaré’s Recurrence Theorem,
Proposition 6.2.2. All stationary random process are recurrent.
Remark 6.3. There are several equivalent definitions of recurrence (conservativeness). Please refer to [Gra09, Chapter 7.4] and [Aar97] for more details. The physical interpretation of recurrence (conservativeness) states that an event of positive
probability is expected to repeat itself infinitely often during the lifetime of the
dynamical system. Because of this physical meaning, recurrence is often assumed
for ergodic systems in literature [Aar97].
It is well-known that, for a recurrent invariant system (Ω, F , m, T ) with m being
σ-finite, (A0 , A , m|A , TA ) with 0 < m(A) < ∞ is invariant. Unfortunately, the
available proof of this result relies heavily on the invariance assumption. In other
words, for more general systems, e.g. a.m.s. systems, the case is not yet settled.
Thus, the solo purpose of this section is to prove that, if (Ω, F , µ, T ) is a recurrent
a.m.s. dynamical system with µ being σ-finite, then (A0 , A , µ|A , TA ) is also a.m.s.
for all 0 < µ(A) < ∞. At the same time, a connection between the invariant mean
of µ|A and µ is established (see Theorem 6.2.1 and Theorem 6.2.4).
As a direct conclusion of this assertion, we have that the Shannon–McMillan–
Breiman (SMB) Theorem, as well as the Shannon–McMillan Theorem, holds simultaneously for all reduced processes of any finite-state recurrent a.m.s. random
process (see Section 6.3).
6.2.2
Finite Measure µ
We first prove the assertion for dynamical systems equipped with finite measures.
To facilitate our discussion,(we designate 1A as the indicator function of a set
1; if x ∈ A;
A ⊆ Ω. To be precise, 1A (x) =
0; if x ∈ Ω − A.
6.2. Induced Transformations of A.M.S. Systems
93
Theorem 6.2.1. For a recurrent a.m.s. dynamical system (Ω, F , µ, T ) with a finite
measure µ and any A ∈ F with µ(A) > 0, (A0 , A , µ|A , TA ) is a.m.s.. Moreover,
the invariant mean µ|A of µ|A admits
Z
1B
µ|A (B) =
dµ, ∀ B ∈ A .
A 1A
Remark 6.4. The integral in Theorem 6.2.1 implicitly implies that 1A 6= 0 µ-a.e. on
A, as we will prove later (see Lemma 6.2.2). Besides, as mentioned, 1A = Eµ (1A |I )
and 1B = Eµ (1B |I ), where I is the σ-algebra of T -invariant sets, µ-a.e. and µ-a.e.
on Ω. Therefore,
Z
Eµ (1B |I )
µ|A (B) =
dµ, ∀ B ∈ A .
A Eµ (1A |I )
To prove Theorem 6.2.1, a couple of supporting lemmas are required.
Lemma 6.2.1. Let (Ω, F , µ, T ) (µ is not necessarily finite) be an arbitrary dynam1 Pn−1
ical system. For any A ⊆ Ω and x ∈ Ω for which the limit limn→∞
1A T i (x)
n i=0
exists, let
O = ω ∈ Ω|1A (ω) = 0 .
We have that the limit limn→∞
1 Pn−1
1A−O T i (x) exists and 1A−O (x) = 1A (x).
n i=0
Proof. By definition,
n−1
n−1
n−1
1X
1X
1X
1A T i (x) =
1A−O T i (x) +
1A∩O T i (x).
n i=0
n i=0
n i=0
1 Pn−1
1A∩O T i (x) constantly equals to 0.
n i=0
Otherwise, T i0 (x) ∈ A ∩ O for some i0 . Let k = min{i ∈ N|T i (x) ∈ A ∩ O} and
y = T k (x). Then, for all n > k, we have
If T i (x) ∈
/ A ∩ O for all i ∈ N, then
n−1
n−1
1X
1X
i
1A∩O T (x) =
1A∩O T i (x)
n i=0
n
i=k
=
≤
1
n
1
n
n−k−1
X
1A∩O T i (y)
i=0
n−k−1
X
1A T i (y)
i=0
→0, n → ∞,
since y ∈ O. Therefore,
1 Pn−1
1A−O T i (x) → 1A (x), n → ∞.
n i=0
94
Extended Shannon–McMillan–Breiman Theorem
Lemma 6.2.2. In Theorem 6.2.1, we have that
1A 6= 0 a.e. on A,
with respect to both µ and its invariant mean µ.
Proof. Let O = ω ∈ Ω|1A (ω) = 0 . We get
Z
µ(A) =
1A dµ
(6.2.2)
1A−O dµ
(6.2.3)
ZΩ
=
Ω
=µ(A − O).
(6.2.4)
where (6.2.2) and (6.2.4) are due to the fact that (Ω, F , µ, T ) is a.m.s. [Gra09,
Corollary 7.9], and (6.2.3) follows from Lemma 6.2.1. Consequently, µ(A ∩ O) = 0.
Since (Ω, F , µ, T ) is a.m.s. and recurrent, we have that µ µ by [Gra09, Theorem
7.4]. Therefore, µ(A ∩ O) = 0.
Proof of Theorem 6.2.1. For any x ∈ A0 and positive integer n, let
(n)
ψA (x) =
n−1
X
(1)
ψA (TAi (x)).
i=0
(n)
It is easy to see that ψA (the nth return time function) is well-defined since the
system is recurrent. For any B ∈ A , we have that
Z
n−1
n−1
1X
1X
µ TA−i B =
1B TAi (ω)dµ(ω)
n i=0
n
A0
i=0
Z
n−1
1X
=
1B TAi (ω)dµ(ω)
A n i=0
(6.2.5)
(n)
Z
=
A
Z
=
A
1
n
ψA (ω)−1
X
1B T i (ω)dµ(ω)
i=0
(n)
(n)
ψA (ω)
1
(n)
n
ψ (ω)
A
ψA (ω)−1
X
i=0
1B T i (ω)dµ(ω),
6.2. Induced Transformations of A.M.S. Systems
95
where (6.2.5) follows because µ(A − A0 ) = 0 since the system is recurrent. Due to
the fact that (Ω, F , µ, T ) is a.m.s., it follows that
(n)
n
(n)
ψA (ω)
=
ψA (ω)−1
1
X
(n)
ψA (ω)
1A T i (ω) → 1A (ω) µ-a.e. and
i=0
(n)
ψA (ω)−1
1
X
(n)
ψA (ω)
1B T i (ω) → 1B (ω) µ-a.e.
i=0
as n → ∞. Let O = {ω ∈ Ω|1A (ω) = 0}. We conclude that
n−1
1X
µ TA−i B
n→∞ n
i=0
lim
Z
= lim
n→∞
Z
=
A−O
(n)
(n)
ψA (ω)
1
(n)
n
A−O
ψA (ω)
Z
1B
1B
dµ =
dµ,
1A
A 1A
ψA (ω)−1
X
1B T i (ω)dµ(ω)
(6.2.6)
i=0
(6.2.7)
where (6.2.6) is due to the fact that µ(A∩O) = 0 by Lemma 6.2.2 and (6.2.7) follows
from the Dominated Convergence Theorem [Rud86]. The theorem is established.
Corollary 6.2.1. If (Ω, F , µ, T ) in Theorem 6.2.1 is ergodic, then
µ|A (B) =
µ(A)µ(B)
,∀ B ∈ A .
µ(A)
Proof. If (Ω, F , µ, T ) is ergodic, then 1A = µ(A) and 1B = µ(B) a.e. with respect
to both µ and µ. The statement follows.
1
Remark 6.5. By Corollary 6.2.1, the system A0 , A ,
µ|A , TA is a.m.s. and
µ(A)
1
ergodic, and
µ|A is a probability measure on (A0 , A ) with invariant mean
µ(A)
1
1
µ|A =
µ|A .
µ(A)
µ(A)
For dynamical systems with finite measures, it is indeed quite natural to believe
that an induced transformation of a recurrent a.m.s. system is also a.m.s., hinted
by the fact that an induced transformation of an invariant system is invariant.
However, as seen from the above, the proof for the case of a.m.s. systems does not
follow naturally from the one for the invariant case [Aar97]. After all, the system
is no longer invariant.
96
Extended Shannon–McMillan–Breiman Theorem
6.2.3
σ-finite Measure µ
In the previous section, the assumption that µ is finite is important, it comes into
play in many places in our argument. This assumption supports the use of the
Dominated Convergence Theorem in the proof of Theorem 6.2.1, and it is also
a requirement to guarantee convergence (µ-a.e.) of the sample mean of a bounded
measurable real-valued function. Consequently, if instead µ is not finite, our method
proving Theorem 6.2.1 is not applicable. In this section, we will therefore prove our
assertion for the case of a σ-finite measure based on a different approach, which
involves the ratio ergodic theorem of [Hop70].
Pn−1
For convenience, we define Sn (f ) to be the finite sum i=0 f T i , for some given
transformation T , non-negative integer n and real-valued function f .
Theorem 6.2.2 (Ratio Ergodic Theorem for Invariant Systems3 ). Let (Ω, F , m, T )
be an invariant Rdynamical system with m being σ-finite. For any f, g ∈ L1 (m) such
that g ≥ 0 and Ω gdm > 0, there exists a function h(f, g) : Ω → R, such that
Sn (f )
= h(f, g) m-a.e. on D =
lim
n→∞ Sn (g)
ω ∈ Ω sup Sn (g)(ω) = ∞ .
n
Moreover, h(f, g) is T -invariant m-a.e. on D, it is I -measurable, where I ⊆ D∩F
is the σ-algebra of T -invariant sets, and
Z
Z
f dm = h(f, g)gdm, ∀ I ∈ I .
I
I
To our knowledge, the first4 general ergodic theorem for a.m.s. systems is the
generalisation of Birkhoff’s ergodic theorem [Bir31] presented in [GK80]. Coincidentally, there is a version of Hopf’s ratio ergodic theorem for a.m.s. systems.
Theorem 6.2.3 (Ratio Ergodic Theorem for A.M.S. Systems). Given an a.m.s.
dynamical system (Ω, F , µ, T ) with µ being σ-finite,
let µ be the invariant mean of
R
µ. For any f, g ∈ L1 (µ) such that g ≥ 0 and Ω gdµ > 0, there exists a function
h(f, g) : Ω → R, such that
Sn (f )
= h(f, g);
Sn (g)

h(f, g) = h(f, g)T

lim
n→∞
a.e. on D =
ω ∈ Ω sup Sn (g)(ω) = ∞
n
3 Hopf’s ratio ergodic theorem for invariant systems is often presented differently in the literature, in each instance with different and delicate details. Readers are kindly referred to the related
literature ( [Hop70, Ste36, KK97, Zwe04] and etc.) for more information.
4 An earlier ergodic theorem from [Hur44] works for systems that are not necessarily invariant.
However, that result relies on some additional constraints which, to the best of our knowledge,
hinder an extension to a.m.s. systems.
6.2. Induced Transformations of A.M.S. Systems
97
with respect to both µ and µ. Moreover, if (Ω, F , µ, T ) is ergodic, then either µ(D) =
µ(D) = 0 or
R
f dµ
h(f, g) = RΩ
µ-a.e. and µ-a.e. on Ω.
gdµ
Ω
Proof. By Theorem 6.2.2,
lim
n→∞
Sn (f )
= h(f, g) µ-a.e. on D,
Sn (g)
for some function h(f, g) : Ω → R. Let
Sn (f )
Sn (f )
, h∗ = lim
,
n→∞ Sn (g)
n→∞ Sn (g)
h∗ = lim
and define Dlu = {x ∈ D |h∗ (x) ≥ u, h∗ (x) ≤ l } for all l, u ∈ Q. Obviously, Dlu is
T -invariant. Thus,
µ(Dlu ) = µ(Dlu ) = 0, ∀ l < u,
because h∗ = h∗ µ-a.e. on D by Theorem 6.2.2. Consequently,
!
[
X
∗
u
µ ({x ∈ D|h (x) > h∗ (x)}) = µ
Dl ≤
µ(Dlu ) = 0.
l<u
At every point x ∈ D where the limit limn→∞
lim
n→∞
l<u
Sn (f )(x)
exists, it is obvious that
Sn (g)(x)
Sn (f )(x)
Sn (f )(T x)
= lim
.
n→∞
Sn (g)(x)
Sn (g)(T x)
Therefore, h(f, g) = h(f, g)T a.e. on D with respect to both µ and µ. The last
statement is valid due to ergodicity.
R
R
Remark 6.6. In Theorem 6.2.3, ΩRgdµ > 0 can be replaced
by Ω gdµ > 0 if the
R
system is recurrent. This is because Ω gdµ > 0 =⇒ Ω gdµ > 0 by Lemma 6.2.3.
Lemma 6.2.3. Given an a.m.s. dynamical system (Ω, F , µ, T ) with µ being σfinite, let µ be the invariant mean of µ. If (Ω, F , µ, T ) is recurrent, then µ µ.
T∞ S∞
Proof. For any B ∈ F such that µ(B) = 0, let B∞ = i=0 j=i T −j B. We have
that


∞
∞
∞
X
X
[
0=
µ(B) =
µ T −j B ≥ µ 
T −j B  ≥ µ(B∞ ) ≥ 0.
j=0
j=0
j=0
Therefore, µ(B∞ ) = µ(B∞ ) = 0 since B∞ is T -invariant. Thus, µ(B) = µ(B −B∞ ).
Moreover, µ(B − B∞ ) = 0 by the definition of recurrence. As a conclusion, µ(B) =
0.
98
Extended Shannon–McMillan–Breiman Theorem
Remark 6.7. Whenever µ is finite, the converse of Lemma 6.2.3 is also valid
[Gra09, Theorem 7.4]. However, it is not necessarily true for a non-finite measure
µ.
Theorem 6.2.4. For a recurrent a.m.s. dynamical system (Ω, F , µ, T ) with µ being
σ-finite and any A ∈ F with 0 < µ(A) < ∞, (A0 , A , µ|A , TA ) is a.m.s.. In
particular, the invariant mean µ|A of µ|A satisfies
Z
µ|A (B) =
h(1B , 1A )dµ, ∀ B ∈ A ,
A
where h(1B , 1A ) : Ω → R satisfies
Sn (1B )
h(1B , 1A ) = lim
a.e. on D =
n→∞ Sn (1A )
ω ∈ Ω sup Sn (1A )(ω) = ∞
n
(6.2.8)
with respect to both µ and µ.
Proof. First of all,
Z
Z
1A dµ = µ(A) > 0 =⇒
Ω
1A dµ > 0
Ω
by Lemma 6.2.3. Furthermore, since µ(B) ≤ µ(A) < ∞ for any B ⊆ A , we have
that
Z
n−1
1X
1B dµ = µ(B) = lim
µ T −i B < ∞ and
n→∞
n
Ω
i=0
Z
n−1
X
1
µ T −i A < ∞
1A dµ = µ(A) = lim
n→∞ n
Ω
i=0
by definition. Therefore, there exists a function h(1B , 1A ) : Ω → R satisfying (6.2.8)
based on Theorem 6.2.3. Moreover, we have that
Z
n−1
n−1
1X
1X
−i
µ TA B =
1B TAi (ω)dµ(ω)
n i=0
n
A0
i=0
Z
Skn (1B )
(n)
=
dµ(ω),
(where kn (ω) = φA (ω)).
S
(1
)
k
A
A0
n
Obviously, 0 ≤ h(1B , 1A ) ≤ 1 µ-a.e. and µ-a.e. on D because 1B ≤ 1A , and A0 ⊆
D by the definitions of A0 and D. Since µ(A0 ) = µ(A) < ∞, the Dominated
Convergence Theorem [Rud86] ensures that
Z
Z
n−1
1X
µ TA−i B =
h(1B , 1A )dµ =
h(1B , 1A )dµ.
n→∞ n
A0
A
i=0
µ|A (B) = lim
The statement is proved.
6.3. Extended Shannon–McMillan–Breiman Theorem
99
Remark 6.8. In the proof of Theorem 6.2.4, the condition µ(A) < ∞ cannot be
dropped, since it ensures that 1A ∈ L1 (µ), i.e. µ(A) < ∞.
Corollary 6.2.2. In Theorem 6.2.4, if (Ω, F , µ, T ) is ergodic, then
µ|A (B) =
µ(A)µ(B)
,∀ B ∈ A .
µ(A)
Proof. Since µ(D) ≥ µ(A0 ) = µ(A) > 0 and (Ω, F , µ, T ) is ergodic, we have that
µ(Ω − D) = 0 and
R
1B dµ
µ(B)
=
µ-a.e., ∀ B ∈ A ,
h(1B , 1A ) = RΩ
µ(A)
1
dµ
Ω A
by Theorem 6.2.3. The conclusion follows.
6.3
Extended Shannon–McMillan–Breiman Theorem
Let (Ω, F , µ, T ) be a dynamical system with µ being a probability measure, and X
be a random variable (a measurable function) with a finite sample space X defined
on (Ω, F , µ). From Section 6.1.2, the random process
∞
∞
{Xi }i=0 = X T i i=0
has distribution

p (x0 , x1 , · · · , xn ) = µ 
n
\
T −j

X −1 (xj )  , ∀ n ∈ N.
j=0
∞
Theorem 6.3.1 (Shannon–McMillan Theorem [Sha48,McM53]). If {Xi }i=0 is stationary ergodic, i.e. (Ω, F , µ, T ) is invariant ergodic, then
−
where h = lim
n→∞
1
log p (X0 , X1 , · · · , Xn−1 ) → h in L1 (µ),
n
1
1
E (− log p (X0 , X1 , · · · , Xn−1 )) = lim H (X0 , X1 , · · · , Xn−1 ).
n→∞ n
n
Theorem 6.3.2 (Shannon–McMillan–Breiman (SMB) Theorem [Bre57]). If
∞
{Xi }i=0 is stationary ergodic, i.e. (Ω, F , µ, T ) is invariant ergodic, then
−
where h = lim
n→∞
1
log p (X0 , X1 , · · · , Xn−1 ) → h a.e.,
n
1
1
E (− log p (X0 , X1 , · · · , Xn−1 )) = lim H (X0 , X1 , · · · , Xn−1 ).
n→∞ n
n
100
Extended Shannon–McMillan–Breiman Theorem
Remark 6.9. The constant h in the above theorems is called the entropy rate of
∞
the process {Xi }i=0 . It can be proved that h = lim H (Xn |X0 , X1 , · · · , Xn−1 ).
n→∞
In fact, the stationary (invariant) condition can be further relaxed. Being a.m.s.
is already sufficient.
Theorem 6.3.3 ( [GK80, Corollary 4]). The SMB Theorem and the Shannon–
McMillan Theorem hold for any a.m.s. process with finite state space.
In addition to being a.m.s., assume that (Ω, F , µ, T ) is also recurrent. Given a
subset Y ⊆ X of positive probability, i.e. Pr {X ∈ Y } > 0, the reduced process
∞
{Yj }j=0 with sub-state space Y is defined to be
∞
∞
{Yj }j=0 = Xij j=0 ,
(
min{i ≥ 0|Xi ∈ Y };
j = 0;
where ij =
It is of interest to know whether the
min{i > ij−1 |Xi ∈ Y }; j > 0.
∞
SMB Theorem (the Shannon–McMillan Theorem) holds also for {Yj }j=0 . Let A =
T
S
∞
∞
X −1 (Y ) and A0 = A ∩ i=0 j=i T −j A. It is easily seen that
n o∞
∞
{Yj }j=0 = X TAj
j=0
1
is essentially a random process defined on A0 , A0 ∩ F ,
µ|A0 ∩F , TA , which
µ(A)
is a.m.s. by Theorem 6.2.1 (by Theorem 6.2.4 as well) and ergodic by [Aar97,
Proposition 1.5.2]. As a conclusion, the SMB Theorem (the Shannon–McMillan
∞
Theorem) holds for the reduced process {Yj }j=0 .
Theorem 6.3.4 (Extended SMB Theorem). Given a recurrent a.m.s. ergodic dynamical system (Ω, F , µ, T ) with probability measure µ, {B1 , B2 , · · · , Bn , · · · } ⊆ F
and a measurable function X : Ω → X (X is finite), the SMB Theorem, as well
as the Shannon–McMillan Theorem, holds simultaneously for all the processes
n o∞
X TBj i
, i ∈ N+ .
j=0
Proof. The statement follows from Theorem 6.2.1 (Theorem 6.2.4 as well) and
Theorem 6.3.3.
Corollary 6.3.1. The SMB Theorem, as well as the Shannon–McMillan Theorem, holds simultaneously for all reduced processes of any recurrent a.m.s. ergodic
random process of finite states.
Proof. The statement follows from Theorem 6.3.4 by letting Y1 , Y2 , · · · , Y2|X | −1
be all the non-empty subsets of X , Bi = X −1 (Yi ), ∀ 1 ≤ i ≤ 2|X | − 1 and
Bi = X , ∀ i > 2|X | − 1.
6.A. Appendix
6.A
101
Appendix
6.A.1
Proof of Proposition 4.2.3
(n)
By Proposition 6.1.4, we havethat {X
} is a.m.s. ergodic, since it is irreducible.
(n)
Moreover, for any function Γ, Γ X
is a.m.s. and ergodic by Proposition 6.1.3.
Thus, the SMB Theorem holds for Γ X (n) by Theorem 6.3.3, i.e.
−
nh io
1
log Pr Γ X (1) , Γ X (2) , · · · , Γ X (n)
→ HΓ,X with probability 1.
n
In addition, we say that two functions Γ0 : X → D0 and Γ00 : X → D00 belong to
the same class, if Γ0 = πΓ00 for some bijection π : Γ00 (D00 ) → Γ0 (D0 ). Obviously,
there are P , where P is the number of all partitions5 of X , classes of functions
defined on X . In the meanwhile, given any two functions Γ0 and Γ00 from the same
class, it is obvious that HΓ0 ,X = HΓ00 ,X and
n o
Pr Γ0 X (l) = Γ0 x(l) , ∀ 1 ≤ l ≤ n
n o
= Pr Γ00 X (l) = Γ00 x(l) , ∀ 1 ≤ l ≤ n
for any x(1) , x(2) , · · · , x(n) ∈ X n . Let F be a set containing exactly one function
from
of the P classes of functions defined on X . By definition, a sequence
(1) each
x , x(2) , · · · , x(n) is contained in TH (n, P) if and only if
n o
1
− log Pr Γ X (l) = Γ x(l) , ∀ 1 ≤ l ≤ n − HΓ,X < ,
n
for all the P functions in Γ ∈ F . Therefore,
nh
i
o
Pr X (1) , X (2) , · · · , X (n) ∈
/ TH (n, P)
( )
nh io
[ 1
(1)
(2)
(n)
= Pr
,Γ X
,··· ,Γ X
− HΓ,X > − n log Pr Γ X
Γ
(
)
nh io
[ 1
(1)
(2)
(n)
− log Pr Γ X
= Pr
,Γ X
,··· ,Γ X
− HΓ,X > n
Γ∈F
nh io
X
1
≤
Pr − log Pr Γ X (1) , Γ X (2) , · · · , Γ X (n)
− HΓ,X > n
Γ∈F
→P × 0 = 0,
as n → ∞.
5A
partition of a set is a disjoint union of non-empty subsets of this set.
Chapter 7
Asymptotically Mean Stationary
Ergodic Sources
onsider a finite-state Markov process M , if it is a.m.s. ergodic1 (while not
necessarily irreducible), then it admits a unique invariant distribution by
Proposition 6.1.5. This invariant distribution is induced from the invariant
mean of the a.m.s. ergodic dynamical system defining M . Moreover, all reduced
(Markov) processes of M inherit the a.m.s. property (by Theorem 6.2.1) and ergodicity (by [Aar97, Proposition 1.5.2]) from M . Thus, every reduced (Markov)
process of M admits a unique invariant distribution by Proposition 6.1.5.
On the other hand, recall from Chapter 4 and Chapter 5 that the irreducible
condition is important because it guarantees recursively that each and every reduced
process of an irreducible Markov process admits an invariant distribution. This
provides the theoretical support to establish the AEP of Supremus typical sequences
and all related results. However, the a.m.s. ergodic condition is already sufficient
to make such a recursive claim on the invariant distributions of all the reduced
processes. Therefore, results form these two chapters can be easily extended to the
a.m.s. ergodic case with the same arguments.
As a matter of fact, irreducibility is only a special realization of the recursive phenomenon characterized by the a.m.s. ergodic concept (Proposition 6.1.4).
Henceforth, it is convincing to bring what we have established to the a.m.s. ergodic
settings.
C
1 We
follow Definition 6.2.2 in this chapter in defining the term “ergodic”. Be reminded that,
for a Markov chain, the term “ergodic” has another definition given by Definition 4.1.5. They are
not equivalent.
103
104
Asymptotically Mean Stationary Ergodic Sources
7.1
Supremus Typicality in the Weak Sense
n
o
(k)
Let {X (n) } be a random process with a finite state space X . Define XY
to be
(n ) k
the reduced process X
, where
(
min{n ≥ 0|X (n) ∈ Y };
k = 0,
nk =
(n)
min{n > nk−1 |X
∈ Y }; k > 0.
By the Kolmogorov Extension Theorem [Gra09, Theorem 3.3], {X (n) } = {X(T n )}
for some dynamical system (Ω, F , µ, T ) and measurable function X : Ω → X
(see Section 6.1.2). Assume that (Ω, F , µ, T ) is recurrent a.m.s.
let A =
n ergodic,
o
T∞ S∞ −j
(k)
−1
X (Y ) and A0 = A ∩ i=0 j=i T A. It is easily seen that XY
is essentially
k
the random process X TA defined on the system
1
µ|A0 ∩F , TA ,
A0 , A0 ∩ F ,
µ(A)
which is also recurrent (by Definition 6.2.1) a.m.s. (by Theorem 6.2.1) and ergodic
(by [Aar97, Proposition 1.5.2]). In the meantime,
n
o the SMB Theorem holds simul(k)
XY ’s by Corollary 6.3.1. In addition,
(
1; x ∈ Y ;
let µ be the invariant mean of µ and IY =
be the indicator function
0; x ∈
/Y,
with respect to Y . The Point-wise Ergodic Theorem [Gra09, Theorem 8.1] states
that, with probability 1,
taneously for all the reduced processes
n−1
1X
|XY |
= lim
IY X(T i ) = Eµ (IY X) = µ X −1 (Y ) = µ(A).
n→∞ n
|X|→∞ |X|
i=0
lim
This says that, given a sequence X generated by {X (n) }, with high probability the
reduced subsequence XY with respect to Y has probability close to
exp2 (− |XY | HY ) ≈ exp2 [−np(Y )HY ] ,
n
o
(k)
where p(Y ) = Eµ (IY X) and HY is the entropy rate of XY , when the length
of X is big enough. This motivates the following definition of Supremus typicality.
Definition 7.1.1 (Supremus Typicality in the Weak Sense). Let {X (n) } be a
recurrent a.m.s. ergodic process with a finite state space X . A sequence x ∈ X n
is said to be Supremus -typical with respect to {X (n) } for some > 0, if ∀ ∅ =
6
Y ⊆X,

|xY |

p(Y ) − <
< p(Y ) + ;
n
|x | (H − ) < − log p (x ) < |x | (H + ),
Y
Y
Y
Y
Y
Y
7.1. Supremus Typicality in the Weak Sense
105
where o
pY and HY are the joint distribution and entropy rate of the reduced process
n
(k)
XY
of {X (n) } with sub-state space Y , respectively. The set of all Supremus
-typical sequences with respect to {X (n) } in X n is denoted by S (n, {X (n) }).
Obviously, Supremus typical sequences form a subset of classical typical sequences defined as follows.
Definition 7.1.2 (Typicality in the Weak Sense). Let {X (n) } be an a.m.s. ergodic
process with a finite state space X . A sequence x ∈ X n is said to be -typical with
respect to {X (n) } for some > 0, if
n(HX − ) < − log pX (x) < n(HX + ),
where pX and HX are the joint distribution and entropy rate of the process {X (n) }.
The set of all -typical sequences with respect to {X (n) } in X n is denoted by
T (n, {X (n) }).
From the definitions, it is seen that Supremus typicality is a more restricted
concept. In other words, it features more characteristics of the original random
process. For example, Proposition 4.2.1 is also valid.
Proposition 7.1.1. Every reduced subsequence of a Supremus -typical sequence
in the weak sense is Supremus -typical in the weak sense.
Unfortunately, the following example says otherwise for classical ones.
Example 7.1.1. Let {X (n) } be an i.i.d. process with state space X = {α, β, γ}
and distribution
pX (α) = 997/1000;
pX (β) = 2/1000;
pX (γ) = 1/1000.
It is easy to verify that x = [α, α, · · · , α, β, γ] ∈ X 1000 is 0.1-typical in the weak
sense, i.e. x ∈ T (1000, {X (n) }), because
− 1 log pX (x) − HX = − 1 log 997 + 1 log 2 < 0.01 < 0.1.
1000
1000
1000 1000
1000 However, for the reduced subsequence xY = [β, γ] ∈ Y 2 (Y = {β, γ}),
1
− log pY (xY ) − HX = 1 log 2 − 1 log 1 = 1 > 0.15 > 0.1.
2
6
3 6
3 6
n
o
(k)
Thus, xY is not Supremus 0.1-typical with respect to XY
in the weak sense.
We will present more properties, both classic and new, embraced by the new
concept in the following.
106
Asymptotically Mean Stationary Ergodic Sources
Proposition 7.1.2 (AEP of Weak Supremus Typicality). In Definition 7.1.1,
1. S (n, {X (n) }) < exp2 [n (HX + )]; and
2. ∀ η > 0, there exists some positive integer N0 such that
nh
i
o
Pr X (1) , X (2) , · · · , X (n) ∈
/ S (n, {X (n) }) < η
and S (n, {X (n) }) > (1 − η) exp2 [n (HX − )] ,
for all n > N0 .
Proof.
1. First of all,
X
1≥
x∈S
pX (x)
(n,{X (n) })
X
>
exp2 [−n (HX + )]
x∈S (n,{X (n) })
= S (n, {X (n) }) exp2 [−n (HX + )] .
Therefore, S (n, {X (n) }) < exp2 [n (HX + )].
2. Let X = X (1) , X (2) , · · · , X (n) . We have that
n
o
n
oo
[ n
(k)
X∈
/ S (n, {X (n) }) =
XY ∈
/ S |XY | , XY
.
∅6=Y ⊆X
In the meanwhile, The Point-wise Ergodic Theorem [Gra09, Theorem 8.1] and
Corollary 6.3.1 guarantee that, with probability 1,

X

 Y → p(Y );
n
1

−
log pY (XY ) → HY
|XY |
simultaneously for all ∅ 6= Y ⊆ X . This implies that, for some positive
integer N0 ,


 [ n
n
o
n
oo
(k)
Pr X ∈
/ S (n, {X (n) }) = Pr
XY ∈
/ S |XY | , XY
< η,


∅6=Y ⊆X
∀ n > N0 . Furthermore,
n
o
1 − η < Pr X ∈ S (n, {X (n) })
X
<
exp2 [−n (HX − )]
x∈S (n,{X (n) })
= S (n, {X (n) }) exp2 [−n (HX − )]
7.1. Supremus Typicality in the Weak Sense
107
Consequently, S (n, {X (n) }) > (1 − η) exp2 [n (HX − )].
The statement is proved.
Remark 7.1. According to the AEP of Weak Supremus Typicality, stochastically
speaking, the classical typical sequences that are not Supremus typical is also negligible as non-typical sequences.
`m
Lemma 7.1.1. In Definition 7.1.1, for all partition2 j=1 Yj of X and x = x(1) ,
x(2) , · · · , x(n) ∈ S (n, {X (n) }), the size of
a
n
m
S x,
Yj = y (1) , y (2) , · · · , y (n) ∈ S (n, {X (n) })
j=1
o
y (l) ∈ Yj ⇔ x(l) ∈ Yj , ∀ 1 ≤ l ≤ n, ∀ 1 ≤ j ≤ m
is strictly smaller than
!#
" m
#
"
m
X
X
p(Yj )HYj + (|X | + 1) .
exp2
xYj HYj + < exp2 n
j=1
j=1
`m
Proof. By Proposition 7.1.1, for any y ∈ S x, j=1 Yj , the reduced subsequence
n
o
(k)
yY (1 ≤ j ≤ m) resides in S xY , X
. The number of all possible yY ’s
j
Yj
j
j
is upper bounded by
n (k) o
< exp2 xYj (HYj + )
S xYj , XYj
according to Proposition 7.1.2. Therefore,
a
Y
m
m
n (k) o
S x,
Yj ≤ S xYj , XYj
j=1
j=1
<
m
Y
exp2 xYj (HYj + )
j=1
"
m
X
xY HY + j
j
= exp2
#
j=1
"
= exp2 n
m X
xYj j=1
"
< exp2
n
!#
HYj + m
X
n
(p(Yj ) + )HYj + !#
j=1
2A
partition of a set is a disjoint union of non-empty subsets of this set.
108
Asymptotically Mean Stationary Ergodic Sources
"
≤ exp2 n
m
X
!#
p(Yj )HYj + (|X | + 1)
.
j=1
The statement is established.
7.2
Hyper Supremus Typicality in the Weak Sense
Definition 7.2.1 (Hyper Supremus Typicality in the Weak Sense). Let {X (n) }
be
a.m.s.
(1)a recurrent
ergodic process with a finite state space X . A sequence
x , x(2) , · · · , x(n) ∈ X n is said to be Hyper Supremus -typical with respect
to {X (n) } for some > 0, if Γ x(1) , Γ x(2) , · · · , Γ x(n) is Supremus -typical
with respect to {Γ X (n) } for all feasible functions Γ. The set of all Hyper Supremus
-typical sequences with respect to {X (n) } in X n is denoted by H (n, {X (n) }).
One motivation for Definition 7.2.1 is to extend the definition of Supremus typicality from the “single process” case to the “joint processes” case, as
n it isodone
(n)
for joint typicality [Cov75] in the classical sense. Given two processes X1
and
n
o
h
(n)
(1)
(2)
X2
with state space X1 and X2 , respectively. Two sequences x1 = x1 , x1 ,
i
h
i
(n)
(1)
(2)
(n)
· · · , x1
∈ X1n and x2 = x2 , x2 , · · · , x2
∈ X2n are said to be jointly typical in a classical sense
if bothof them are classical typical and x =
[Cov75],
tr
(1) (2)
(i)
(i)
(n)
(i)
x ,x ,··· ,x
x = x1 , x2
is classical typical. Here, x1 , as well as
x2 , is just a function of x. Therefore, if x is Hyper Supremus typical, then both
x1 and x2 are necessarily Supremus typical, hence, them are joint typical in the
classical sense.
Proposition 7.2.1. A function of a Hyper Supremus -typical sequence is Hyper
Supremus -typical. A reduced subsequence of a Hyper Supremus -typical sequence
is Hyper Supremus -typical.
It is well-known that a function of classical typical sequence in the weak sense
is not necessarily classical typical (unless it is defined in the strong sense for i.i.d.
settings [Yeu08, Chapter 6.3]). Nevertheless, Proposition 7.2.1 states differently for
Hyper Supremus typical sequences. On this regard, we see that Definition 7.2.1
embraces more features beyond characterizing the “joint effects.”
Proposition 7.2.2 (AEP of Weak Hyper Supremus Typicality). In Definition
7.2.1,
1. H (n, {X (n) }) < exp2 [n (HX + )]; and
7.2. Hyper Supremus Typicality in the Weak Sense
109
2. ∀ η > 0, there exists some positive integer N0 such that
i
o
nh
/ H (n, {X (n) }) < η
Pr X (1) , X (2) , · · · , X (n) ∈
and H (n, {X (n) }) > (1 − η) exp2 [n (HX − )] ,
for all n > N0 .
Proof.
1. H (n, {X (n) }) ≤ S (n, {X (n) }) < exp2 [n (HX + )].
2. We say that two functions Γ0 : X → D0 and Γ00 : X → D00 belong to the
same class, if Γ0 = πΓ00 for some bijection π : Γ00 (D00 ) → Γ0 (D0 ). Obviously,
there are P , where P is the number of all partitions of X , classes of functions defined
functions
Γ0 and Γ00 come from the same
0 (1)on
X0 . For
any two
(2)
0
(n)
class, Γ x
,Γ x
,··· ,Γ x
is Supremus -typical if and only
if Γ00 x(1) , Γ00 x(2) , · · · , Γ00 x(n) is Supremus -typical. On the other
hand, fix a function, say Γ, {Γ X (n) } is recurrent a.m.s. ergodic by Proposition 6.1.3 and Proposition 6.2.1. Therefore, there exists some NΓ > 0 such
that
nh i
n oo
η
Pr Γ X (1) , Γ X (2) , · · · , Γ X (n) ∈
/ S n, Γ X (n)
< ,
P
for all n > NΓ , so claimed by Proposition 7.1.2. Let F be the set containing
exactly one function from each of the P classes of functions defined on X .
We have that
nh
i
o
Pr X (1) , X (2) , · · · , X (n) ∈
/ H (n, {X (n) })
(
)
i
n oo
[ nh (1)
(2)
(n)
(n)
= Pr
Γ X
,Γ X
,··· ,Γ X
∈
/ S n, Γ X
Γ
(
= Pr
)
i
n oo
[ nh (1)
(2)
(n)
(n)
Γ X
,Γ X
,··· ,Γ X
∈
/ S n, Γ X
Γ∈F
≤
X
Pr
nh i
n oo
Γ X (1) , Γ X (2) , · · · , Γ X (n) ∈
/ S n, Γ X (n)
Γ∈F
<
η
× |F | = η
P
for all n > N0 = maxΓ∈F {NΓ }. In addition,
nh
i
o
1 − η < Pr X (1) , X (2) , · · · , X (n) ∈ H (n, {X (n) })
X
<
exp2 [−n (HX − )]
x∈H (n,{X (n) })
= H (n, {X (n) }) exp2 [−n (HX − )]
110
Asymptotically Mean Stationary Ergodic Sources
Consequently, H (n, {X (n) }) > (1 − η) exp2 [n (HX − )].
The statement is proved.
Lemma 7.2.1. In Definition 7.2.1, for all partition
x(2) , · · · , x(n) ∈ H (n, {X (n) }), the size of
`m
j=1
Yj of X and x = x(1) ,
a
n
m
H x,
Yj = y (1) , y (2) , · · · , y (n) ∈ H (n, {X (n) })
j=1
o
y (l) ∈ Yj ⇔ x(l) ∈ Yj , ∀ 1 ≤ l ≤ n, ∀ 1 ≤ j ≤ m
is strictly smaller than
!#
" m
#
"
m
X
X
xYj HYj + < exp2 n
p(Yj )HYj + (|X | + 1) .
exp2
j=1
j=1
Proof. Notice that H (x) ⊆ S (x). The proof follows from Lemma 7.1.1.
NOTATION: We have defined HX and HY nto beo the entropy rate of the
(k)
random process {X (n) } and its reduced process XY , respectively. However,
given an arbitrary function Γ : X → Y , even though Γ(X (n) ) has state space
Y as well, the entropy rate of Γ(X (n) ) is not necessary equal to HY . nThis is
o
(k)
because the distributions and the underlying dynamical systems defining XY
and Γ(X (n) ) are different. To avoid causing confusion, we denote HΓ,X to be
the entropy rate of Γ(X (n) ) .
Lemma 7.2.2. In Lemma 7.2.1, define Γ(x) = l ⇔ x ∈ Yl . We have that
m
a H x,
Yj < exp2 [n (HX − HΓ,X + 2)] .
j=1
"
(n)
Z1
(n)
(n)
Z2
X (n)
Γ X (n)
#
=
(n ∈ N). By Proposition
= Γ X
and
Proof. Let
n
o
n
o
(n)
(n)
6.1.3, Z1
and Z2
are recurrent a.m.s. ergodic since they are functions
h
i
(1) (2)
(n)
(i)
(n)
of {X }. Define z1 to be z1 , z1 , · · · , z1 , where z1 = Γ x(i) . For any
`m
y = y (1) , y (2) , · · · , y (n) ∈ H x, j=1 Yj , we have that
h i
Γ y (1) , Γ y (2) , · · · , Γ y (n) = z1
7.2. Hyper Supremus Typicality in the Weak Sense
111
by definition. Therefore,
pZ2 (z2 )
pX (x)
=
,
pZ1 (z1 )
pZ1 (z1 )
" #
h
i
y (i)
(1) (2)
(n)
(i)
where z2 = z2 , z2 , · · · , z2
and z2 = (i) . On the other hand, both z1 and
z1
z2 are Hyper Supremus -typical since they are functions of the Hyper Supremus
-typical sequence y. Henceforth,
Pr { X = y| Z1 = z1 } =
pX (x)
exp2 [−n (HX + )]
>
(by Proposition 7.2.2)
pZ1 (z1 ) exp2 [−n (HZ1 − )]
exp2 [−n (HX + )]
=
1 (1) (2)
(l)
exp2 −n liml→∞ H Z1 , Z1 , · · · , Z1 − l
= exp2 [−n (HX − HΓ,X , + 2)] .
Consequently,
X
1≥
`m
y∈H x,
j=1
Pr { X = y| Z1 = z1 }
Yj
a
m
> H x,
Yj exp2 [−n (HX − HΓ,X + 2)] .
j=1
The statement follows.
Remark 7.2. At this point, we possess the right background to unveil the very essential differences between the mechanisms of the First Proof and the Second Proof
of Lemma 2.1.2. The First Proof resembles the argument given to prove Lemma
7.2.1. It comes from the property of the reduced subsequences that are modelled
by corresponding reduced processes. On the other hand, The Second Proof, as the
proof of Lemma 7.2.2, is based on the property characterized by the “joint effect”
of the original process and one of its functions. It is a coincidence that both proofs
lead to the same conclusion in Lemma 2.1.2. For more universal settings as given
in Lemma 7.2.1 and Lemma 7.2.2, the results diverse. The effect of the differences
is reflected by the notation HYj ’s and HΓ,X from the upper bounds, respectively.
Actually, the same analytical differences can also be found by inspecting Lemma
4.2.1 and Lemma 4.2.2 (as well as Lemma 4.A.1 and Lemma 4.A.2).
Given an index set S = {1, 2, · · · , s}, define the projective function πT (∅ =
6
tr
T ⊆ S) to be the mapping maps an S indexed array, say [a1 , a2 , · · · , as ] , to the T
tr
indexed array ai1 , ai2 , · · · , ai|T | , where ij ∈ T .
112
Asymptotically Mean Stationary Ergodic Sources
"
#
(n)
X1
Lemma 7.2.3. In Definition 7.2.1, assume that X
=
(n) (n ∈ N) and X =
X2
" #
" #
"
#
m
(1)
(2)
(n)
a
X1
x1
x 1 , x1 , · · · , x1
, for any partition
Yj of X1 and
=
∈
(1)
(2)
(n)
X2
x2
x2 , x2 , · · · , x2
j=1
(n)
H (n, {X (n) }), the size of
H
( " (1)
y ,
Yj x2 =
x1 ,
(1)
x2 ,
j=1
m
a
#
y (2) , · · · , y (n)
(n) })
(2)
(n) ∈ H (n, {X
x2 , · · · , x2
)
y
(l)
∈ Yj ⇔
(l)
x1
∈ Yj , ∀ 1 ≤ l ≤ n, ∀ 1 ≤ j ≤ m
is strictly smaller than
" m
#
X
x1,Yj HZj − Hπ2 ,Zj + 2
exp2
j=1
"
< exp2 n
m
X
!#
p(Zj ) HZj − Hπ2 ,Zj + (|X | + 2)
,
j=1
tr
where Zj = [Yj , X2 ] .
Remark 7.3. Assume that X := X1 × X2 = R1 × R2 (R1 and R2 are finite
rings). For any left ideal I ≤l R1 , we have that R1 /I = {J1 , J2 , · · · , Jm }, where
|R1 |
and Jj ’s are disjoint cosets. Thus, R1 /I defines a partition of X1 =
m =
|I|
`m
R1 . From this, we can define H (x1 , R1 /I|x2 ) to be H x1 , j=1 Jj |x2 for all
i
h
(n)
(2)
(1)
and x2 =
x1 ∈ Rn1 and x2 ∈ Rn2 . To be specific, if x1 = x1 , x1 , · · · , x1
i
h
(1)
(2)
(n)
x2 , x2 , · · · , x2 , then H (x1 , R1 /I|x2 ) contains all Hyper Supremus -typical
#
"
(n)
(1)
(2)
y1 , y1 , · · · , y1
(i)
(i)
sequences, say (1)
(n) , such that y1 ∈ x1 +I for all 1 ≤ i ≤ n.
(2)
x 2 , x2 , · · · , x2
" #
x1
Proof of Lemma 7.2.3. For any x =
. xZj is Hyper Supremus -typical of
x2
`
tr
length x1,Yj by Proposition 7.2.1. Consider the partition x2 ∈X2 [Yj , {x2 }] of
Zj . We have that



m

a
a
tr
zZj z ∈ H x1 ,
Yj x2
⊆ H xZj ,
[Yj , {x2 }]
.


j=1
x2 ∈X2
7.2. Hyper Supremus Typicality in the Weak Sense
113
Thus, by Lemma 7.2.2,


 
m
a
a
tr ≤ H xZj ,
zZj z ∈ H x1 ,
[Yj , {x2 }]
Yj x2


j=1
x2 ∈X2
< exp2 x1,Yj HZj − Hπ2 ,Zj + 2 ,
n
o
(n)
since the reduced process XZj is recurrent a.m.s. ergodic. Consequently,
m
a
H x1 ,
Yj x2 j=1



m 
m
Y
a
zZj z ∈ H x1 ,
Yj x2
≤


j=1
j=1
<
m
Y
exp2 x1,Yj HZj − Hπ2 ,Zj + 2
j=1


m
X
x1,Yj HZj − Hπ2 ,Zj + 2 
= exp2 
j=1

 
m X
xZj
= exp2 n 
HZj − Hπ2 ,Zj + 2
n
j=1
 

m
X
(p(Zj ) + ) HZj − Hπ2 ,Zj + 2
< exp2 n 
j=1
 

m
X
< exp2 n 
p(Zj ) HZj − Hπ2 ,Zj + (|X | + 2) .
j=1
The statement is proved.
Lemma 7.2.4. In Lemma 7.2.3, let Γ : X →
m
[
"
[
j=1 x2 ∈X2
"
#
"
#
Yj
be given as
{x2 }
#
x1
Yj
7→
x2
{x2 }
if x1 ∈ Yj .
We have that
m
a
H x1 ,
Yj x2 < exp2 [n (HX − HΓ,X + 2)] .
j=1
114
Asymptotically Mean Stationary Ergodic Sources
"
Proof. Obviously,
`m `
H
x2 ∈X2
j=1
#
Yj
is a partition of X . In addition,
{x2 }
"
#
" # a
m
a
x
Y
1
j
x1 ,
,
Yj x2 = H
.
x2 j=1 x ∈X {x2 }
j=1
m
a
2
2
by definition. Thus, the statement follows from Lemma 7.2.2.
For special cases, e.g. m = 1 or {X (n) } is i.i.d., the exponents of the two bounds
given in Lemma 7.2.1 and Lemma 7.2.2, as well as the two given in Lemma 7.2.3
and Lemma 7.2.4, are equal (up to a difference of several ). In general, we can
not determine which one is tighter at this point. However, the first one is often
(n)
more accessible and
n easier
o to evaluate. For instance if {X } is Markov, then the
(k)
reduced process XY
is also Markov for all ∅ =
6 Y ⊆ X . As a consequence, the
upper given by Lemma 7.2.1 is easily evaluated because p is simply the invariant
distribution and the entropy rates HYj ’s can be obtain easily from the transition
given by Lemma 7.2.2 is sigmatrix and p (see Chapter 4). In contrast, the
bound nificantly more complicated mainly because Γ X (n) is not necessarily Markov
which makes it significantly harder to evaluate the entropy rate HΓ,X .
7.3
Linear Coding over Finite Rings for A.M.S. Sources
(n)
Let Ri (i ∈ S = {1, 2, · · · , s}) be a finite ring and {X (n) }, where X (n) = X1 ,
Qs
(n)
(n) tr
X2 , · · · , Xs
, be a random process with state space R = i=1 Ri . In this
section, we are to establish the following achievability theorems of LCoR.
Theorem 7.3.1. (R1 , R2 , · · · , Rs ) satisfying, ∀ 0 6= T ⊆ S and ∀ ∅ =
6 Ii ≤l Ri ,
(
)
X
X Ri log |Ii |
> min
p(Z ) (HZ − HπT c ,Z ) , HR − HΓIT ,R ,
log |Ri |
Z ∈(RT /IT )×RT c
i∈T
tr
where ΓIT : (rT , rT c )
R1 , R2 , · · · , Rs .
tr
7→ (rT + IT , rT c ) , is achievable by linear coding over
Corollary 7.3.1. Let
log |R|
r = max
min
06=I≤l R log |I|
(
)
X
p(Z )HZ , HR − hI ,
Z ∈R/I
where hI is the entropy rate of X (n) + I . R > r is achievable by linear coding
over ring R.
7.3. Linear Coding over Finite Rings for A.M.S. Sources
Proof of Theorem 7.3.1. Designate
(
X
r (T, IT ) = min
115
)
p(Z ) (HZ − HπT c ,Z ) , HR − HΓIT ,R ,
Z ∈(RT /IT )×RT c
and let ki =
nRi
, where n is the length of the data sequences. By definition,
log |Ri |
1X
ki log |Ii | − r (T, IT ) > 2η
n
(7.3.1)
i∈T
for some small constant η > 0 and large enough n, ∀ ∅ 6= T ⊆ S, ∀ 0 6= Ii ≤l Ri .
We claim that (R1 , R2 , · · · , Rs ) is achievable by linear coding over R1 , R2 , · · · , Rs
based on the following proof.
Encoding:
For every i ∈ S, randomly generate a ki × n matrix Ai based on a uniform
distribution, i.e. independently choose each entry of Ai uniformly at random from
Ri . Define a linear encoder φi : Rni → Rki i such that
φi : x 7→ Ai x, ∀ x ∈ Rni .
Obviously the coding rate of this encoder is
log |Ri |
1
nRi
1
ki
n
log |φi (Ri )| ≤ log |Ri | =
≤ Ri .
n
n
n
log |Ri |
Decoding:
Subject to observing yi ∈ Rki i (i ∈ S) from the ith encoder, the decoder claims
Qs
tr
that x = (x1 , x2 , · · · , xs ) ∈ i=1 Rni is the array of the encoded data sequences,
if and only if:
1. x ∈ H n, {X (n) } ; and
tr
2. ∀ x0 = (x10 , x20 , · · · , xs0 ) ∈ H n, {X (n) } , if x0 6= x, then φj (xj0 ) 6= yj , for
some j.
Error:
Assume that Xi ∈ Rni (i ∈ S) is the original data sequence generated by the
ith source. It is readily seen that an error occurs if and only if one of the following
events occurs:
tr
E1 : X = (X1 , X2 , · · · , Xs ) ∈
/ H n, {X (n) } ;
tr
E2 : There exists X 6= (x10 , x20 , · · · , xs0 ) ∈ H n, {X (n) } , such that φi (xi0 ) =
φi (Xi ), ∀ i ∈ S.
116
Asymptotically Mean Stationary Ergodic Sources
Error Probability:
By the AEP of Weak Hyper Supremus Typicality 7.2.1, Pr {E1 } → 0, n → ∞.
Meanwhile, for ∅ =
6 T ⊆ S and 0 6= I ≤l RT , let
n
tr
D (X; T ) = (x10 , x20 , · · · , xs0 ) ∈ H n, {X (n) } o
xi0 6= Xi , ∀ i ∈ T and xi0 = Xi , ∀ i ∈ T c
Q
and
Q D (XT , I|XT c ) = H (XT , RT /I|XT c )\{X}, where XT = i∈T Xi and XT c =
i∈T c Xi . We have
[
D (X; T ) =
D (XT , I|XT c ),
(7.3.2)
06=I≤l RT
since I goes over all possible non-trivial left ideals. In addition,
D (XT , I|XT c )
=H (XT , RT /I|XT c ) − 1

i
h exp2 n P
p(Z
)
(H
−
H
)
+
(|R|
+
2)
−1
Z
πT c ,Z
h Z ∈(RT /IT )×RT c i
<
exp2 n HR − HΓ ,R + 2 − 1
IT
< exp2 [n (r(T, IT ) + (|R| + 2))] − 1
(7.3.3)
by Lemma 7.2.3 and Lemma 7.2.4. Consequently,
=
Pr {E2 |E1c }
X
(
tr
x01 ,··· ,x0s
(n)
Pr {φi (xi0 ) = φi (Xi )|E1c }
i∈S
)
∈H (n,{X
=
Y
})\{X}
X
X
Y
Pr {φi (xi0 ) = φi (Xi )|E1c }
(7.3.4)
X
(7.3.5)
tr
∅6=T ⊆S (x0 ,··· ,x0 ) i∈T
1
s
∈D (X;T )
≤
X
X
Y
tr
x01 ,··· ,x0s
∅6=T ⊆S 06=I≤l RT
(
Pr {φi (xi0 ) = φi (Xi )|E1c }
i∈T
)
∈D (XT ,I|XT c )
<
X
X
Q
∅6=T ⊆S 06=
i∈T
[exp2 [n (r (T, IT ) + η)] − 1]
Y
|Ii |−ki
(7.3.6)
i∈T
Ii
≤l RT
< (2s − 1) 2|R| − 2 ×
"
max
Q∅6=T ⊆S,
06=
i∈T
Ii ≤l RT
exp2 −n
!#
1X
ki log |Ii | − (r (T, IT ) + η)
n
i∈T
(7.3.7)
7.3. Linear Coding over Finite Rings for A.M.S. Sources
117
< (2s − 1) 2|R| − 2 × exp2 [−nη] ,
(7.3.8)
where
`
(7.3.4) is from the fact that H n, {X (n) } \ {X} =
∅6=T ⊆S D (X; T ) (disjoint
union);
(7.3.5) follows from (7.3.2) by Boole’s inequality [Boo10, Fré35];
(7.3.6) is from (7.3.3) and Lemma 2.1.1, as well as the fact that every left ideal of
RT is a Cartesian product of some left ideals Ii of Ri , i ∈ T (see Proposition
η
;
1.1.3). At the same time, is required to be smaller than
|R| + 2
(7.3.7) is due to the facts that the number of non-empty subsets of S is 2s − 1 and
the number of non-trivial left ideals of the finite ring RT is less than 2|R| − 1,
which is the number of non-empty subsets of R;
(7.3.8) is from (7.3.1).
η
, Pr {E2 |E1c } → 0, when n → ∞, from (7.3.8), since
|R| + 2
1P
for sufficiently large n,
ki log |Ii | − [r (T, I) + η] > η > 0. Therefore,
n i∈Tc
Pr {E1 ∪ E2 } = Pr {E1 } + Pr {E1 } Pr { E2 | E1c } → 0 as → 0 and n → ∞.
Ps
Theorem 7.3.2. In Problem 5.1, let ĝ = h( i=1 ki ) be a polynomial presentaPs
(n)
tion of g over ring R, and X (n) = i=1 ki Xi
(note: this defines a random
n
o
(n)
(n)
process {X } with state space R). If XS
is recurrent a.m.s. ergodic, then
Thus, for all ≤
(R1 , R2 , · · · , Rs ) satisfying
log |R|
min
Ri > max
06=I≤l R log |I|
(
)
X
p(Z )HZ , HR − hI ,
Z ∈R/I
where hI is the entropy rate of X (n) + I , is achievable for encoding g.
Proof. By Proposition
6.1.3 and Proposition 6.2.1, {X (n) } is recurrent a.m.s. ern
o
(n)
godic since XS
is recurrent a.m.s. ergodic. Therefore, ∀ > 0, there exists
a large enough n, an m × n matrix A ∈ Rm×n and a decoder ψ, such that
Pr {X n 6= ψ (AX n )} < , provided that
(
)
X
n
m > max
min
p(Z )HZ , HR − hI ,
06=I≤l R log |I|
Z ∈R/I
118
Asymptotically Mean Stationary Ergodic Sources
by Corollary 7.3.1. Let φi = A ◦ ~ki (1 ≤ i ≤ s) be the encoder of the
ith source.
n
~
Upon receiving φi (Xi ) from the ith source, the decoder claims that h X̂ n , where
Ps
X̂ n = ψ [ i=1 φi (Xin )], is the function, namely ĝ, subject to computation. The
probability of decoding error is
n h
i
o
Pr ~h ~k (X1n , X2n , · · · , Xsn ) 6= ~h X̂ n
n
o
≤ Pr X n 6= X̂ n
(
" s
#)
X
n
n
= Pr X 6= ψ
φi (Xi )
i=1
(
"
n
= Pr X 6= ψ
s
X
#)
A~ki (Xin )
i=1
(
"
n
= Pr X 6= ψ A
s
X
#)
~ki (X n )
i
i=1
n
h
io
= Pr X n 6= ψ A~k (X1n , X2n , · · · , Xsn )
= Pr {X n 6= ψ (AX n )} < .
Therefore, all (R1 , R2 , · · · , Rs ) ∈ Rs with
m log |R|
log |R|
Ri =
> max
min
06=I≤l R log |I|
n
(
)
X
Z ∈R/I
is achievable, i.e. (R1 , R2 , · · · , Rs ) ∈ R[ĝ] ⊆ R[g].
p(Z )HZ , HR − hI
Chapter 8
Conclusion
8.1
Summary
This thesis first presented a coding theorem of linear coding over finite rings (LCoR)
for correlated i.i.d. date compression. This theorem covers corresponding achievability theorems of Elias [Eli55] and Csiszár [Csi82] for linear coding over finite fields as
special cases. In addition, it was showed that, for any set of finite correlated discrete
memoryless sources, there always exists a sequence of linear encoders over some finite non-field rings which achieves the data compression limit, the Slepian–Wolf
region. Hence, the optimality problem regarding linear coding over finite non-field
rings for data compression is closed with positive confirmation with respect to existence.
As an application, we addressed the problem of encoding functions of sources
where the decoder is interested in recovering a discrete function of the data generated and independently encoded by several correlated i.i.d. random sources. We
proposed linear coding over finite rings as an alternative solution to this problem.
Results in Körner–Marton [KM79] and Ahlswede–Han [AH83, Theorem 10] on encoding the binary sum were generalised to cases for encoding certain polynomial
functions over rings. Since a discrete function with a finite domain always admits
such a polynomial presentation, we concluded that both generalisations universally apply to encoding all discrete functions of finite domains. Based on these,
we demonstrated that linear coding over finite rings strictly outperforms its field
counterpart in terms of achieving better coding rates and reducing the required
alphabet sizes of the encoders for encoding many discrete functions.
In order to generalise the above results to Markov source and a.m.s. source
settings, we introduced the concept of Supremus typicality. It was showed that
Supremus typicality is stronger in term of characterising the ergodic behaviours
of random sequences than the classical Shannon typicality. Moreover, it possesses
better properties that give rise to results that are more accessible and easier to
analyse compared to corresponding ones derived from its classical counterpart.
Built on the properties established for Supremus typicality (e.g. AEP and Extended SMB Theorem), we generalised our results on LCoR to non-i.i.d. (Markov
119
120
Conclusion
and a.m.s.) settings. It was seen that linear coding over non-field rings is equally
optimal as its field counterpart for compressing irreducible Markov sources in many
examples (not a complete proof as for the i.i.d. case). In addition, it was once again
proved that linear encoders over non-field rings strictly outperform their field counterparts for encoding many functions. To be more precise, it was proved that the
set of coding rates achieved by linear encoders over certain non-field ring is strictly
larger than the one achieved by all the field versions.
As mentioned, the idea of Supremus typical sequence is a very important element to the establishment of our results on non-i.i.d. sources. Its advantages were
seen by comparing corresponding results derived from the classical Shannon typical
sequence and the Supremus typical sequence arguments, respectively. Yet, fundamentally, their differences come from the SMB Theorem and the Extended SMB
Theorem. Empirically speaking, classical Shannon typical sequences feature the
SMB Theorem, and Supremus typical sequences are characterised by the Extended
SMB Theorem. The Extended SMB Theorem specifies not only the ergodic behaviours of the “global” random process as the SMB Theorem does but also the
behaviours of all the reduced (“local”) processes. From this viewpoint, we can see
that Supremus typicality describes the “typical behaviours” of a randomly generated sequence better. It refines the idea of classical typicality.
8.2
Future Research Directions
1. We have proved that, for some classes of non-field rings, linear coding is optimal
in achieving the best coding rates for correlated i.i.d. data compression. However,
the statement is not yet proved to hold for all non-field rings. It could be that the
achievability theorem we obtained does not present an optimal achievable coding
rate region in general. Given that LCoR brings in many advantages in applications,
it is interesting to have a conclusive answer to this problem.
2. An efficient method to construct an optimal or asymptotically optimal linear
coding scheme over a finite ring is required and very important in practical applications. Even though our analysis for the ring scenarios is more complicated than
that for the field cases, linear encoders working over some finite rings are in general
considerably easier to implement in practice. This is because the implementation of
finite field arithmetic can be quite demanding. Normally, a finite field is given by
its polynomial representation, operations are carried out based on the polynomial
operations (addition and multiplication) followed by the polynomial long division
algorithm. In contrast, implementing arithmetic of many finite rings is a straightforward task. For instance, the arithmetic of modulo integers ring Zq , for any positive
integer q, is simply the integer modulo q arithmetic, and the arithmetic of matrix
rings are matrix additions and multiplications.
3. It is also very interesting to consider coding schemes based on other algebraic
structures, e.g. groups [CF09], rng, modules, and algebras. Actually, for linear coding over finite rngs, many of our results on LCoR hold for rng correspondences since
8.2. Future Research Directions
121
essentially a rng is “a ring without the multiplicative identity.” It will be intriguing
should it turn out that the rng version outperforms the ring version in the function
encoding problem or other problems, in the same manner that the ring version
outperforms its field counterpart. It will also be interesting to see whether the idea
of using rng, as well as other algebraic structures, provides more understanding of
related problems.
4. Although we have seen that linear encoders over non-field rings outperform their
field counterparts in various aspects in the function encoding problem, the problem of characterising the achievable coding rate region for encoding a function of
sources is generally open. This problem is linked to other unsolved network information theory problems as well. Hence, an approach to tackle the function encoding
problem could potentially provide significant insight into other problems.
5. For decades, Shannon’s argument on typicality of sequences has not changed
much. It was successfully applied to prove most of the information theory results.
Yet, changes can be helpful sometimes. From careful investigation of our analysis
used to establish the achievability theorems of LCoR, one can see that the ring
linear encoder is still chosen randomly as it is done for the field linear encoder.
Compared to the analysis used to prove the achievability theorem of LCoF, the
major difference lies in the analysis of the stochastic properties of the random
source data. For non-i.i.d. scenarios, such a difference is obviously seen from the
introduction of the concept Supremus typicality. AEP of Supremus typicality states
that the Shannon typical sequences which are not Supremus typical are negligible.
Such a “refinement” of the concept of typical sequence works in our particular
problems. Thus, apart from search for different coding schemes, we propose to look
deeper into the stochastic behaviours of the systems (sources and channels) as well
when dealing with other problems.
Bibliography
[Aar97]
J. Aaronson, An Introduction to Infinite Ergodic Theory.
R.I.: American Mathematical Society, 1997.
[AF92]
F. W. Anderson and K. R. Fuller, Rings and Categories of Modules,
2nd ed. Springer-Verlag, 1992.
[AH83]
R. Ahlswede and T. S. Han, “On source coding with side information via
a multiple-access channel and related problems in multi-user information
theory,” IEEE Transactions on Information Theory, vol. 29, no. 3, pp.
396–411, May 1983.
[BB05]
L. Breuer and D. Baum, An Introduction to Queueing Theory: and
Matrix-Analytic Methods, 2005th ed. Springer, Dec. 2005.
[Bir31]
G. D. Birkhoff, “Proof of the ergodic theorem,” Proceedings of the
National Academy of Sciences of the United States of America, vol. 17,
no. 12, pp. 656–660, Dec. 1931.
[Boo10]
G. Boole, An investigation of the laws of thought on which are founded,
the mathematical theories of logic and probabilities. [S.l.]: Watchmaker,
2010.
[BR58]
C. J. Burke and M. Rosenblatt, “A markovian function of a markov
chain,” The Annals of Mathematical Statistics, vol. 29, no. 4, pp.
1112–1122, Dec. 1958.
[Bre57]
L. Breiman, “The individual ergodic theorem of information theory,”
The Annals of Mathematical Statistics, vol. 28, no. 3, pp. 809–811, Sep.
1957.
[Buc82]
R. C. Buck, “Nomographic functions are nowhere dense,” Proceedings
of the American Mathematical Society, vol. 85, no. 2, pp. 195–199, Jun.
1982.
[CF09]
G. Como and F. Fagnani, “The capacity of finite abelian group codes
over symmetric memoryless channels,” IEEE Transactions on Information Theory, vol. 55, no. 5, pp. 2037–2054, May 2009.
123
Providence,
124
Bibliography
[Cov75]
T. M. Cover, “A proof of the data compression theorem of slepian and
wolf for ergodic sources,” IEEE Transactions on Information Theory,
vol. 21, no. 2, pp. 226–228, Mar. 1975.
[Csi82]
I. Csiszár, “Linear codes for sources and source networks: Error exponents, universal coding,” IEEE Transactions on Information Theory,
vol. 28, no. 4, pp. 585–592, Jul. 1982.
[Csi98]
——, “The method of types,” IEEE Transactions on Information Theory, vol. 44, no. 6, pp. 2505–2523, 1998.
[CT06]
T. M. Cover and J. A. Thomas, Elements of Information Theory, 2nd ed.
Wiley-Interscience, Jul. 2006.
[DA12]
J. Du and M. Andersson, Private Communication, May 2012.
[DF03]
D. S. Dummit and R. M. Foote, Abstract Algebra, 3rd ed. Wiley, 2003.
[DLS81]
L. D. Davisson, G. Longo, and A. Sgarro, “The error exponent for the
noiseless encoding of finite ergodic markov sources,” IEEE Transactions
on Information Theory, vol. 27, no. 4, pp. 431–438, Jul. 1981.
[Eli55]
P. Elias, “Coding for noisy channels,” IRE Convention Record, vol. 3,
pp. 37–46, Mar. 1955.
[FCC+ 02] E. Fung, W. K. Ching, S. Chu, M. Ng, and W. Zang, “Multivariate
markov chain models,” in 2002 IEEE International Conference on Systems, Man and Cybernetics, vol. 3, Oct. 2002.
[Fré35]
M. Fréchet, “Généralisation du théorème des probabilités totales,”
Fundamenta Mathematicae, vol. 25, no. 1, pp. 379–387, 1935.
[Fro12]
G. Frobenius, “Uber matrizen aus nicht negativen elementen,” Sitzungsberichte Königlich Preussichen Akademie der Wissenschaft, pp. 456–477,
1912.
[Gal68]
R. G. Gallager, Information Theory and Reliable Communication. New
York: Wiley, 1968.
[GK80]
R. M. Gray and J. C. Kieffer, “Asymptotically mean stationary
measures,” The Annals of Probability, vol. 8, no. 5, pp. 962–973, Oct.
1980.
[Gra09]
R. M. Gray, Probability, Random Processes, and Ergodic Properties,
2nd ed. Springer, Aug. 2009.
[Hop70]
E. Hopf, Ergodentheorie (Ergebnisse der Mathematik und Ihrer Grenzgebiete / Zweiter Band) (German Edition), reprint der erstausgabe berlin
1937 edition ed. Springer, Jan. 1970.
Bibliography
125
[HS]
S. Huang and M. Skoglund, “On linear coding over finite rings
and applications to computing,” IEEE Transactions on Information
Theory, conditionally accepted for publication (submitted October
2012). [Online]. Available: http://people.kth.se/~sheng11
[HS12a]
——, “Computing polynomial functions of correlated sources: Inner
bounds,” in International Symposium on Information Theory and its
Applications, Oct. 2012, pp. 160–164.
[HS12b]
——, “Linear source coding over rings and applications,” in IEEE
Swedish Communication Technologies Workshop, Oct. 2012, pp. 1–6.
[HS12c]
——, On Existence of Optimal Linear Encoders over Non-field Rings
for Data Compression, KTH Royal Institute of Technology, December
2012. [Online]. Available: http://people.kth.se/~sheng11
[HS12d]
——, “Polynomials and computing functions of correlated sources,” in
IEEE International Symposium on Information Theory, Jul. 2012, pp.
771–775.
[HS13a]
——, “Encoding irreducible markovian functions of sources: An
application of supremus typicality,” IEEE Transactions on Information
Theory, May 2013, submitted to. [Online]. Available: http://people.kth.
se/~sheng11
[HS13b]
——, “On achievability of linear source coding over finite rings,” in
2013 IEEE International Symposium on Information Theory Proceedings (ISIT), 2013, pp. 1984–1988.
[HS13c]
——, “On existence of optimal linear encoders over non-field rings for
data compression with application to computing,” in 2013 IEEE Information Theory Workshop (ITW), 2013.
[HS14a]
——, “Induced transformations of recurrent a.m.s. dynamical systems,”
Stochastics and Dynamics, 2014.
[HS14b]
——, “Supremus typicality,” in 2014 IEEE International Symposium
on Information Theory Proceedings (ISIT), 2014, pp. 2644–2648.
[Hun80]
T. W. Hungerford, Algebra (Graduate Texts in Mathematics). Springer,
Dec. 1980.
[Hur44]
W. Hurewicz, “Ergodic theorem without invariant measure,” Annals of
Mathematics, vol. 45, no. 1, pp. 192–206, Jan. 1944.
[Kak43]
S. Kakutani, “Induced measure preserving transformations,” Proceedings
of the Imperial Academy, vol. 19, no. 10, pp. 635–641, 1943.
126
Bibliography
[KK97]
T. Kamae and M. Keane, “A simple proof of the ratio ergodic theorem,”
Osaka Journal of Mathematics, vol. 34, no. 3, pp. 653–657, 1997.
[KM79]
J. Körner and K. Marton, “How to encode the modulo-two sum of binary
sources,” IEEE Transactions on Information Theory, vol. 25, no. 2, pp.
219–221, Mar. 1979.
[Lam01]
T.-Y. Lam, A First Course in Noncommutative Rings, 2nd ed. Springer,
Jun. 2001.
[LN97]
R. Lidl and H. Niederreiter, Finite Fields, 2nd ed. New York: Gambridge
University Press, 1997.
[McM53] B. McMillan, “The basic theorems of information theory,” The Annals
of Mathematical Statistics, vol. 24, no. 2, pp. 196–219, Jun. 1953.
[Mey89]
C. D. Meyer, “Stochastic complementation, uncoupling markov chains,
and the theory of nearly reducible systems,” SIAM Rev., vol. 31, no. 2,
pp. 240–272, Jun. 1989.
[MS84]
G. Mullen and H. Stevens, “Polynomial functions (mod m),” Acta
Mathematica Hungarica, vol. 44, no. 3-4, pp. 237–241, Sep. 1984.
[Nor98]
J. R. Norris, Markov Chains.
[Per07]
O. Perron, “Zur theorie der matrices,” Mathematische Annalen, vol. 64,
no. 2, pp. 248–263, Jun. 1907.
[Rot10]
J. J. Rotman, Advanced Modern Algebra, 2nd ed.
matical Society, Aug. 2010.
American Mathe-
[Rud86]
W. Rudin, Real and Complex Analysis, 3rd ed.
ence/Engineering/Math, May 1986.
McGraw-Hill Sci-
[Sha48]
C. E. Shannon, “A mathematical theory of communication,” Bell System
Technical Journal, vol. 27, no. 3, pp. 379–423, 1948.
[Ste36]
W. Stepanoff, “Sur une extension du théorème ergodique,” Compositio
Mathematica, vol. 3, pp. 239–253, 1936.
[SW49]
C. E. Shannon and W. Weaver, The mathematical theory of communication. Urbana: University of Illinois Press, 1949.
[SW73]
D. Slepian and J. K. Wolf, “Noiseless coding of correlated information
sources,” IEEE Transactions on Information Theory, vol. 19, no. 4, pp.
471–480, Jul. 1973.
[Yeu08]
R. W. Yeung, Information Theory and Network Coding, 1st ed. Springer
Publishing Company, Incorporated, Sep. 2008.
Cambridge University Press, Jul. 1998.
Bibliography
[Zwe04]
127
R. Zweimüller, “Hopf’s ratio ergodic theorem by inducing,” Colloquium
Mathematicum, vol. 101, no. 2, pp. 289–292, 2004.