1. Introduction These are my lecture notes for the class. They also

1. Introduction
These are my lecture notes for the class. They also include suggested problems from the
textbook (and possibly from other sources). The notes are far from complete, and should not
be taken as an exhaustive list of what you are expected to know. Instead, they highlight the
pieces that seemed most important to me. I mention that this document doesn’t contain very
much on applied subjects, real examples and how to use this material. This isn’t because
that part of the course is unimportant; it is just not the core technical material that exams
tend to focus on.
Finally, this document is not carefully edited and has not been read by anybody else. When
my notes disagree with the textbook, trust the textbook! More importantly, Do
not rely solely on these notes! Of course, I (and your classmates) also appreciate emails
that point out errors or typos.
2. Lecture 1: Chapter 1 (Jan. 12)
2.1. Summary.
• Welcome and Administrative Details.
• Review: Axioms of Probability and Elementary Calculations
2.2. Lecture.
2.2.1. Administrative Details.
• If you don’t know anybody in this class, please sit in the front-right and say hello to
your neighbours!
• In this course, we review basic probability and move on to some applications. By
default, this will be focused on Markov chain models in engineering and economics.
However, there is some flexibility. Please let me know (preferably by email) if there
are particular topics or applications that you would like to cover.
• Textbook: Probability, Markov Chains, Queues and Simulations by William J. Stewart.
• Website: go to aix1.uottawa.ca/∼asmi28. The syllabus is there, and this is my
primary means of communicating with the class.
• Office hours: 585 KED, office 201G. Monday and Wednesday from 1 PM to 2:30 PM.
• Evaluation will be based on 5-6 homework sets (25%), a midterm on March 2 (25%)
and a final exam (50%).
• The first homework set is posted on the website, and is due in 2.5 weeks. All
homework is due by the start of class on its due date. Like the first few
weeks of class, the material in this set is meant to be a review.
• I strongly suggest that you start on this homework set quite soon! The homework,
as well as the first week of class, will be devoted to review. If you are having trouble
with the homework, and would like to have a longer review period, let me know what
you are finding difficult by this Sunday. If enough students are interested in this, we
will have additional review sessions.
• You are encouraged to ask for worked solutions to relevant questions that are not on
the homework; I am happy to give them and add them to this document.
1
• This is not an introductory math class. We expect answers that are readable by
human beings, including those not in this course! For example, you should:
– Use sentences.
– Not use the equality sign to mean ‘the next step in my argument is...’
– Define your notation if there might be some confusion.
• An advertisement: the probability group is running a reading course this term on
Fridays (details will be on my website). If you, or a friend, are interested in diving a
little deeper into probability, please attend!
2.2.2. Sets and Sample Spaces. Recall the basic definitions:
• Sample space: a set. Often the set of all possible outcomes of an experiment. Ex.
When rolling a die, the associated sample space might be Ω = {1, 2, 3, 4, 5, 6}.
• Event: a subset of a sample space. Ex. The event that a die roll is even is S =
{2, 4, 6} ⊂ Ω.
• When discussing events, we need to understand set operations and the algebra of set
operations.
• Recall the basic set operations, and try them all with sets A = {1, 2, 3}, B = {2, 4, 6}:
– Set union (denoted A ∪ B).
– Set intersection (denoted A ∩ B).
– Set difference (denoted A − B).
– Set complement (denoted Ac or sometimes A0 ).
• There are many, many ‘rules’ for the algebra of set operations. If you forget some,
you can normally recover them from Venn diagrams. We recall a few important rules:
– A ∩ Ac = ∅.
– A ∪ Ac = Ω.
– A ∩ (B ∩ C) = (A ∩ B) ∩ C.
– A ∩ (B ∪ C) = (A ∩ B) ∪ (A ∩ C).
– A ∪ (B ∩ C) = (A ∪ B) ∩ (A ∪ C).
– (A ∩ B)c = Ac ∪ B c .
– (A ∪ B)c = Ac ∩ B c .
• Recall that events A, B are mutually exclusive if A ∩ B = ∅, while A1 , . . . , An are
exhaustive if ∪ni=1 Ai = Ω.
2.2.3. Axioms of Probability. A probability P is a map from some subsets of a sample space
Ω to [0, 1] that satisfies:
• P[Ω] = 1.
• P
For any countable sequence {Ai }i∈N of pairwise mutually exclusive events, P[∪i∈N Ai ] =
i∈N P[Ai ].
These have many consequences, which you should recall. Some of the most important are:
• P[Ac ] = 1 − P[A].
• For any A, B, we have P[A ∪ B] = P[A] + P[B] − P[A ∩ B].
2.2.4. Conditional Probability and Independence. We define the conditional probability of A
given B by
P[A ∩ B]
P[A|B] =
.
P[B]
2
This is the definition, but it basically does what we expect:
Example 1. Let Ω = {1, 2, 3, 4, 5, 6, 7, 8, 9, 10}, let P[S] = |S|
for any S ⊂ Ω, let A =
10
{1, 2, 3, 5, 7, 9, 10} and B = {2, 4, 6, 8, 10}. Then
P[{2, 10}]
P[A|B] =
P[{2, 4, 6, 8, 10}]
2
= .
5
Recall that two events A, B are independent if P[A ∩ B] = P[A]P[B].
Example 2 (Dice and Sums). We roll two fair dice. Let E1 be the event that the sum is 7,
and E2 be the event that the sum is 10. Let A be the event that the first die rolled comes up
4. We note:
P[A ∩ E1 ] = P[E1 |A]P[A]
11
=
66
= P[A]P[E1 ].
P[A ∩ E2 ] = P[E2 |A]P[A]
11
=
66
> P[A]P[E2 ]
1 1
=
.
6 12
Thus, A, E1 are independent but A, E2 are not.
2.2.5. Law of Total Probability. Recall that a collection of sets {Bi }ni=1 is called a partition
of Ω if:
• Bi , Bj are mutually exclusive.
• P[∪ni=1 Bi ] = 1.
Recall that, if {Bi }ni=1 are a partition and A is any set,
n
X
P[A] =
P[A|Bi ]P[Bi ].
i=1
Who cares? We have:
Example 3. 100 people at a company have the opportunity to get a flu shot on Oct. 10.
Say that 80 do. It is known that the probability of getting the flu within a year given that
you have a flu shot is 3%; the probability of getting a flu without a flu shot is 6%. A worker
is chosen at random on Oct. 9 of the following year; what is the probability that the worker
had the flu in the intervening time?
Let F denote the event that the selected worker had the flu and S denote the event that the
worker had a flu shot. Then
P[F ] = P[F |S]P[S] + P[F |S c ]P[S c ]
= (0.03)(0.8) + (0.06)(0.2)
3
= 0.036.
Note, the answer is between 3% and 6% - it is nice to remember this sort of ‘sanity check’
when doing calculations.
2.2.6. Bayes’ Rule. Recall that, if {Bi }ni=1 are a partition and A is any set,
P[A|B1 ]P[B1 ]
.
P[B1 |A] = Pn
i=1 P[A|Bi ]P[Bi ]
Who cares? We have:
Example 4. Consider the same situation as the above subsection. A worker with the flu is
chosen at random; what is the probability that the worker had a flu shot?
We calculate:
P[F |S]P[S]
P[F |S]P[S] + P[F |S c ]P[S c ]
0.024
=
0.036
2
= .
3
This is surprising the first time you see it: even though flu shots make you less likely to get
the flu, most people who have the flu also had a flu shot. I do hope that this isn’t surprising
to anyone in this class!
P[S|F ] =
2.2.7. Harder Example. Lets go over some interesting examples that every student of probability should see at some point! How many people have seen these?
Example 5 (Hat-Check Problem). n people go to dinner and leave there hats at the front
of the restaurant. Upon leaving, each person grabs a hat at random. What is the probability
that anybody gets the right hat?
For 1 ≤ i ≤ n, let Ai denote the event that the i’th person has received their own hat. We
then note that
1
P[A1 ] = .
n
Furthermore,
P[A1 ∩ A2 ] = P[A1 ]P[A2 |A1 ]
1 1
=
,
nn−1
and more generally,
P[A1 ∩ A2 ∩ . . . ∩ Aj ] =
j Y
i=1
=
4
1
n−i+1
(n − j)!
n!
Thus, by the principle of inclusion-exclusion,
n
X
X
n
P[∪i=1 Ai ] =
P[Ai ] −
P[Ai ∩ Aj ] +
i=1
=
1≤i<j≤n
X
X
P[Ai ∩ Aj ∩ Ak ] − . . .
1≤i<j<k≤n
(−1)|I|+1 P[∩i∈I Ai ]
I⊂{1,2,...,n}
=
=
n
X
i=1
n
X
(−1)i+1
(n − i)!
n!
i!(n − i)! n!
1
(−1)i+1 .
i!
i=1
This is the answer, but it is traditional to go a little further. Recall from calculus that
∞
X
xi
x
e =
.
i!
i=0
Thus, for n big,
1 − P[∪ni=1 Ai ] ≈ e−1 ≈ 0.368.
5
3. Lecture 2: Chapter 2; Start of Chapter 3 (Jan. 15)
3.1. Summary.
• A few administrative details.
• Review of combinatorics.
• Definitions related to random variables.
. That
3.2. Lecture. For a finite sample space Ω = {1, 2, . . . , n}, we often have P[A] = |A|
n
is, all outcomes are equally likely. In this case, calculating the probability of an event is
equivalent to counting the number of ways the event can occur. Ex. card games, dice games,
drawing balls from urns.
3.2.1. Permutations. We call an ordering of items in a set a permutation. The number of
permutations of a set S is
|S|! = |S|(|S| − 1)(|S| − 2) . . . (2)(1).
Example 6. Let S = {1, 2, 3, 4}. Then some permutations of S are 1234, 1324, 4321, etc.
There are |S|! = (4)(3)(2)(1) = 24 permutations of S.
We sometimes consider permutations for a set with repeated elements. Say that a set S
has ni copies of the i’th item, for 1 ≤ i ≤ k. Then the number of permutations of S is
(n1 +...+nk )!
.
n1 !n2 !...nk !
Remark 3.1. Our way of talking about ‘sets with repeated elements’ is very far from what
you might find in other math textbooks. This is not relevant for this course, but you should
be aware of this if you do outside reading.
Example 7. Let S = {1, 1, 2, 2, 2}. Then some distinct permutations of S are 11222, 12122.
5!
There are 2!3!
= 10 permutations of S.
There are many, many objects that ‘look like’ permutations, and almost all of them show
up in probability. I will briefly survey the textbook’s notation for them, but please be aware
that:
• Other sources may give the same objects slightly different names, and the names may
be inconsistent. You have to understand what is going on.
• The list in this textbook is not a complete list of permutation-like objects that show
up! It is helpful, though not essential, to learn some principle.
3.2.2. Permutations with Replacement. Say that we have k boxes, and we can choose one of
ni objects from each box. The total number of ways to make the k choices is n1 n2 . . . nk .
Example 8. Say I have 12 pairs of pants and 25 shirts. Then I have (25)(12) = 300 possible
outfits.
3.2.3. Permutations without Replacement. I have a box with n distinguishable objects, choose
k objects from the box, and put them in order. The number of ways to do this is
n!
P (n, k) =
.
(n − k)!
Example 9. Consider length-8 passwords built from characters a, b, . . . , y, z, 0, 1, . . . , 8, 9
without any repeated characters. There are 36!
≈ 1.2 × 1012 such passwords.
28!
6
3.2.4. Combinations without Replacement. I have n distinguishable objects and wish to put
0 ≤ k ≤ n of these objects into a box. The number of ways to do this is
n
n!
P (n, k)
= C(n, k) =
=
.
k
k!(n − k)!
k!
This number is called a binomial coefficient. Hopefully we all remember this from the
binomial distribution!
Example 10. I need to bring 3 textbooks with me on a trip, from the 12 that I should be
12!
reading. The number of ways to do this is 3!9!
= 220.
Assume that
I have m boxes and wish to place ki objects into each box, for 1 ≤ i ≤ m.
Pm
Assume that i=1 ki = n. The number of ways to do this is
n
n!
=
.
k1 . . . km
k1 ! . . . km !
These are called the multinomial coefficients.
3.2.5. Combinations with Replacement. I have a box with n distinguishable objects, and
choose k of them with replacement, and ignore the order in which the objects were chosen.
The number of ways to do this is (n+k−1)!
.
(n−1)!k!
Example 11. I roll a die 3 times, recording the results. The number of possible sets of
8!
results (again, ignoring order) is 5!3!
= 56.
3.2.6. Remembering Which is Which. It is often hard to remember what is going on with
permutations and combinations. For some people, putting the following into a table helps.
Permutations:
• With replacements: nk .
n!
• Without replacements: (n−k)!
.
Combinations:
• With replacements: (n+k−1)!
.
(n−1)!k!
n!
• Without replacements: k!(n−k)! .
3.2.7. Principles. When looking at permutations and combinations without replacement,
there is a nice unifying way to find all of the relevant formulas. Fix a set S of size n, possibly
with duplicates. First, we add an extra label to duplicate objects, so that all of the objects
are distinguishable. Then, put all objects in order; there are n! ways to do this. Finally, fix
a particular permutation, and divide by the number of ways to ‘reorder’ this permutation
that we don’t care about.
Example 12. Let S = {A, A, B, B, B}. We wish to look at the number of ways to order
these 5 items. In our steps:
(1) We consider instead the set S 0 = {A1 , A2 , B1 , B2 , B3 }.
(2) Note that there are 5! = 120 permutations of S 0 .
(3) One permutation of S 0 is A1 A2 B1 B2 B3 . This is indistinguishable from A2 A1 B1 B2 B3 ,
A1 A2 B3 B2 B1 , etc. In total, there are 2! ways to rearrange the two copies of A and
5!
3! ways to rearrange the three copies of B. Thus, we end up with 2!3!
permutations
of S.
7
This principle is useful for two reasons:
(1) It is eventually easier to use this than to memorize lots of formulas.
(2) This can be used to do lots of calculations that can’t be done with the formulas in
the textbook!
3.2.8. Harder Example. Lets go over some interesting examples that every student of probability should see at some point! How many people have seen these?
Example 13 (Birthday Problem). Assume that people are born uniformly at random on the
365 days of the year. I have 32 people in a group. What is the chance that no two have the
same birthday? If the year had a large number of days n, approximately how many people
would need to be in a group for there to be a 90 percent chance of having two people with the
same birthday?
We tackle the general problem first. Assume that there are n days in a year, and for 1 ≤ j ≤ n
let Aj,n be the event that the first j people in our group all have distinct birthdays. We note
that
j
P[Aj+1,n ] = P[Aj,n ] 1 −
,
n
and so
j−1 Y
i
P[Aj,n ] =
1−
.
n
i=1
Thus, for the first problem,
P[A32,365 ] =
31 Y
i=1
i
1−
365
= 0.2358.
For the second problem, fix 0 < c < ∞. Then
√
c n
P[Ac√n,n ]
=
Y
i
1−
n
i=1
Pc√n
− i=1
≈e
c
≈ e− 2 .
i
n
Since we want to get a probability of 0.1, we choose
c
0.1 = e− 2
c
= − log(0.1)
2
c = 4.61.
√
IMPORTANT QUESTION: Why did we choose j = c n? How could we find out that
this was the right thing to do?
We now begin our review of random variables
8
3.2.9. Types of Random Variables. We begin by fixing a sample space Ω and a probability
function P. At its most general,
Definition 3.2 (Random Variable). A random variable X is a function from Ω to some
other set.
Remark 3.3. Those of you who have seen more mathematical theory will know that this
isn’t quite right - random variables can’t quite be any function. However, this is close enough
for our purposes.
In this course, we generally have:
Definition 3.4 (Random Variable). A random variable X is a function from Ω to the set
R of real numbers.
All of the random variables in this course will come in one of two flavours. For a random
variable X : Ω → R, denote the range of X by X(Ω) = {X(ω) : ω ∈ Ω}. We then have:
Definition 3.5 (Discrete Random Variable). If X(Ω) is either finite or countable, then X
is called a discrete random variable.
Definition 3.6 (Continuous Random Variable). If X(Ω) is an interval in R, then X is
called a continuous random variable.
Example 14 (Discrete, Continuous, and Other Random Variables). Let Ω = [0, 10] and
define functions X1 (ω) = ω, X2 (ω) = bωc. Define X3 (ω) to be X1 (ω) for ω < 5 and X2 (ω)
otherwise.
Then X1 is continuous, X2 is discrete and X3 is neither.
To relate random variables back to probability, we make the following important definition:
P[X = x] = P[{ω ∈ Ω : X(ω) = x}].
Note that the RHS is already defined, and the LHS didn’t have any clear meaning! When
actually doing probability calculations, we almost always deal with random variables and
ignore the underlying sample space Ω. You certainly did this, possibly without thinking
about it, in your last probability course.
Example 15. Let Ω be the collection of possible 5-card hands from a deck of 52 cards. Define
X1 to be 1 if the majority of cards in the hand are red, and 0 otherwise. Define X2 to be
the sum of the 5 cards, using blackjack conventions. Define X3 to be the numerical ranking,
from 1 to 52, of the highest card in the hand, using poker conventions. Define X4 to be the
number of 5-card hands that this hand would beat in poker.
52!
=
We note that Ω is complicated and very large (we know from chapter 2 that it has 5!47!
2598960 elements), though we can write it down in a fairly compact way. X1 is much simpler
than Ω; it takes only 2 values, and it is easy to check that P[X1 = 1] = P[X1 = 0] = 12 . This
is main reason to use random variables: they are easier to deal with than the entire sample
space. X2 and X3 are slightly more complicated than X1 ; both are easy to calculate by looking
at a hand and have much smaller ranges (6 to 50 or 1 to 52) than the size of Ω. Finally,
the range of X4 has the same size as Ω; we don’t gain very much in this case.
9
Remark 3.7. We often define random variables without defining Ω. For example, I might
say ‘let the random variable X be the result of a fair dice roll.’ This is all perfectly OK, but
a bit disconcerting - random variables are functions, but we seem to be omitting the domain
of the function. I will be doing this throughout the rest of this course.
10
4. Lecture 3: End of Chapter 3, Start of Chapter 4 (Jan. 19)
4.1. Summary.
• A few administrative details.
• Basics of random variables and distributions. We do not cover this in a great deal of
detail; please read over the rest of chapter 3 and make sure that it is familiar to you.
• Introduction to joint distributions. This marks the beginning of material that will
be new to many of you.
4.2. Lecture. Today, we continue our review of random variables.
4.2.1. Probability Distribution Functions. Recall:
Definition 4.1 (Probability Mass Function). The probability mass function (or PMF) of a
discrete random variable X is defined by
pX (x) = P[X = x].
This has the important formula
P[X ∈ A] =
X
fX (x).
x∈A
Also recall the related:
Definition 4.2 (Cumulative Distribution Function). The cumulative distribution function
(or CDF) of any random variable X is defined by
FX (x) = P[X ≤ x].
This has the important related formula
P[a < X ≤ b] = FX (b) − FX (a).
Remark 4.3. Note that we use P[X ≤ x], not P[X < x].
Finally,
Definition 4.4 (Probability Density Function). The probability density function (or PDF)
of a continuous random variable X is defined by
d
pX (x) =
FX (x),
dx
when it exists.
Example 16. Question: X has PDF pX (x) = αx−2 for 2 ≤ x ≤ 10. What is α? What is
the CDF?
Answer: We note
Z 10
1
1
1=
αx−2 dx = α( − ),
2 10
2
so α = 25 . We then have
5
FX (x) =
2
Z
2
x
5 1 1
y −2 dy = ( − )
2 2 x
for 2 ≤ x ≤ 10.
11
4.2.2. Conditioned Random Variables. We can condition random variables in the same way
that we condition events. Recall, for events A and B,
P[A|B] =
P[A ∩ B]
.
P[B]
Similarly, if X is a discrete random variable, we can write
P[X = x|B] =
P[{X = x} ∩ B]
.
P[B]
This leads to the definition of a conditioned PMF for discrete random variables:
P[{X = x} ∩ B]
pX|B (x) =
.
P[B]
Example 17. Consider rolling two fair dice, let X be their sum, and let B be the event that
the first die rolls a 4. We then have
pX|B (x) =
P[{X = x} ∩ B]
= 6P[{X = x} ∩ B].
P[B]
If 5 ≤ x ≤ 10, P[{X = x} ∩ B] =
1
;
36
otherwise,P[{X = x} ∩ B] = 0. Thus,
pX|B (x) =
1
6
for 5 ≤ x ≤ 10.
Example 18. Consider rolling two fair dice. Let Y, Z be their results, let X = Y + Z be
their sum, and let B be the event that Y is even. Then
1
P[B] = .
2
We note that, if X ∈ {3, 4} and B holds, we must have Y = 2. Thus, for x ∈ {3, 4},
pX|B (x) =
P[{X = x} ∩ B]
P[X = x|Y = 2]P[Y = 2|B]P[B]
11
1
=
=
= .
P[B]
P[B]
63
18
Similarly, if X ∈ {5, 6} and B holds, we must have Y ∈ {2, 4}. Thus, for x ∈ {5, 6},
pX|B (x) =
P[{X = x} ∩ B]
P[X = x|Y ∈ {2, 4}]P[Y ∈ {2, 4}|B]P[B]
12
1
=
=
= .
P[B]
P[B]
63
9
Similar calculations can be done for 7 ≤ x ≤ 12.
If the event B is of the form B = ∪x∈I {X = x} for some set I, the condition PMF has a
particularly nice formula:
pX|B (x) =
pX (x)
P[B]
for all x ∈ I, and
pX|B (x) = 0
for x ∈
/ I. This formula even works for continuous random variables.
12
Example 19. Let X be a continuous random variable with pdf given by fX (x) = e−x for
x ≥ 0, and let B = {X > 1}. We wish to calculate fX|B . We first calculate
Z ∞
e−x dx = e−1 .
P[B] =
x=1
We then note that B is of the nice form required by the expression above, and so
pX (x)
fX|B (x) =
= e1−x .
P[B]
As an aside, this distribution is pretty special: it is basically the same as the one we started,
but shifted. This is a special case of the memoryless property, which you probably saw in a
previous probability course, and which is going to be very important in this course.
4.2.3. Distribution functions, events and formulas. Since distribution functions are written
as the probabilities of events, all of the formulas we know for events (Venn diagrams, Bayes’
rule, etc) have related formulas for random variables. For example, let {Bi }ni=1 be a partition
of a sample space Ω, and let X be a random variable on Ω. Then
n
X
pX (x) =
pX|Bi (x)P[Bi ].
i=1
This is similar to our rule
P[A] =
n
X
P[A|Bi ]P[Bi ]
i=1
for events, and indeed you can prove one from the other.
4.2.4. Joint Distributions. We often want to think about two (or more) random variables
defined on the same sample space. Since PDFs are easier to think about than the functions
themselves, we introduce joint PDFs to deal with this.
Remark 4.5. We have already been dealing with joint distributions, without naming them.
Consider rolling two dice, and let X1 , X2 be their results. We certainly know how to deal with
these together! Even more, we are pretty comfortable with random variables Y = X1 + X2
and Z = max(X1 , X2 ). We are just formalizing this.
4.2.5. Joint CDF.
Definition 4.6 (Joint CDF). The joint CDF of two random variables X, Y is
FX,Y (x, y) = P[X ≤ x, Y ≤ y].
In general, the joint CDF of n random variables X1 , X2 , . . . , Xn is
FX1 ,...,Xn (x1 , . . . , xn ) = P[X1 ≤ x1 , . . . , Xn ≤ xn ].
For X, Y discrete, this is computed just like any event:
Example 20. Let X1 , X2 be chosen uniformly from {1, 2, 3}, let Y = X1 + X2 and let
Z = max(X1 , X2 ). We note that 2 ≤ Y ≤ 6, 1 ≤ Z ≤ 3 and look at FY,Z (y, z).
1
pY,Z (2, 1) =
9
13
pY,Z (2, 2) = 0
pY,Z (2, 3) = 0
pY,Z (3, 1) = 0
2
pY,Z (3, 2) =
9
pY,Z (3, 3) = 0
pY,Z (4, 1) = 0
1
pY,Z (4, 2) =
9
2
pY,Z (4, 3) =
9
pY,Z (5, 1) = 0
pY,Z (5, 2) = 0
2
pY,Z (5, 3) =
9
pY,Z (6, 1) = 0
pY,Z (6, 2) = 0
1
pY,Z (6, 3) =
9
This is tedious but easy!
If X, Y are continuous, we need to remember our higher-dimensional calculus to do general
calculations; more on that soon. If FX,Y is given to us, we have some formulas that are similar
to the one dimensional case.
Recall:
P[a < X ≤ b] = FX (b) − FX (a).
In two dimensions,
P[a < X ≤ b, c < Y ≤ d] = FX,Y (b, d) − FX,Y (a, d) − FX,Y (b, c) + FX,Y (a, c).
This equation, and similar equations, can be found by looking at pictures. We can also find
the CDF for individual random variables, called the marginal CDF :
lim FX,Y (x, y) = FX (x)
y→∞
lim FX,Y (x, y) = FY (y).
x→∞
We next have a critical definition:
Definition 4.7 (Independent Random Variables). A pair of random variables X, Y are
independent if
FX,Y (x, y) = FX (x)FY (y)
14
for all x, y,. In general, many random variables X1 , X2 , . . . , Xn are independent if
n
Y
FX1 ,...,Xn (x1 , . . . , xn ) =
FXi (xi )
i=1
for all x1 , . . . , xn .
Note that, if X1 , . . . , Xn are independent, than Xi , Xj are independent... but the opposite
direction is not true!
Example 21. Fix n ≥ 2. Let X1 , . . . , Xn−1 be independently chosen from {0, 1} and let
Xn = 1 if an odd number of X1 , . . . , Xn−1 are odd, and Xn = 0 otherwise.
We then have that {Xi }i∈I are independent for any I ⊂ {1, 2, . . . , n} with |I| ≤ n − 1, but
{X1 , . . . , Xn } are not independent!
NOTE: More complicated variations on this example show up in computer
Pnscience and
information theory. We have effectively split 1 ‘bit’ of information (whether i=1 Xi is odd
or not) across n people, in such a way that no n − 1 of them working together can recover
anything about it.
4.2.6. Joint PMF.
Definition 4.8 (Discrete PMF). For a pair of random variables X, Y , we define
pX,Y (x, y) = P[X = x, Y = y].
Similarly, for many random variables X1 , . . . , Xn , we have
pX1 ,...,Xn (x1 , . . . , xn ) = P[X1 = x1 , . . . , Xn = xn ].
Just like
P[X ∈ A] =
X
pX (x),
x∈A
we have for joint PMFs the formula
X
P[(X, Y ) ∈ A] =
pX,Y (x, y).
(x,y)∈A
Just like we can relate the joint CDF to the marginal CDF, we can relate the joint PMF to
the marginal PMF:
X
pX (x) =
pX,Y (x, y)
y
pY (y) =
X
x
15
pX,Y (x, y).
5. Lecture 4: End of Chapter 4, Start of Chapter 5 (Jan. 22)
5.1. Summary.
• A few administrative details.
• Remainder of discussion of joint distributions.
• Review of expectations.
• Properties and formulas associated with expectations.
5.2. Lecture. Today, we continue our review of random variables and start a discussion of
expectations.
5.2.1. Joint PDF. Recall the relationship between the CDF and PDF for a single continuous
random variables:
Z x
fX (t)dt.
FX (x) =
−∞
We have a similar definition for the joint PDF:
Definition 5.1 (Joint PDF). Let X, Y be two continuous random variables with joint CDF
FX,Y . Then the joint pdf fX,Y is a function that satisfies
Z y Z x
FX,Y (x, y) =
fX,Y (s, t)dsdt.
−∞
−∞
Remark 5.2. There is some fuzziness here - it isn’t at all clear that there is exactly one
such function. In fact, there generally isn’t! In this class, we generally give you one PDF,
and you will do calculations using this formula.
For a joint PDF fX,Y , we have the useful formula:
Z
Z
fX,Y (s, t)dtds.
P[X ∈ A, Y ∈ B] =
s∈A
t∈B
Note that using this in general requires vector calculus! As a quick review:
Example 22. Let X, Y have joint PDF
fX,Y (x, y) = c(x + y)
for 0 ≤ x ≤ y ≤ 5. We will calculate P[X + Y > 1]. First,
Z 5Z y
Z
3c 5 2
125c
1=
c(x + y)dxdy =
y dy =
.
2 0
2
0
0
Thus, c =
2
.
125
We then calculate
1
2
Z
P[X + Y > 1] = 1 −
0
1
=1−
125
374
=
.
375
16
1−x
Z
x
Z
1
2
x=0
2
(x + y)dydx
125
(1 − 4x2 )dx
Definition 5.3 (Uniform Random Variable). A continuous (respectively discrete) random
variable is uniform if its pdf (respectively pmf ) takes on exactly one nonzero value.
Remark 5.4. As this isn’t a vector calculus course, I’m going to avoid doing too many ‘hard’
multivariable integral questions. However, you need to know some for this course, and to use
the material in this course you will need to know more. I suggest that you read section 4.4,
including the section on uniform random variables (which I am largely skipping). If any of
it looks intimidating, do a practice problem or two.
Just as we could calculate the marginal PMF from the joint PMF in the discrete case,
we can calculate the marginal PDF from the joint PDF in the continuous case. Recall, for
PMFs, we had:
pX (x) =
X
pX,Y (x, y)
y
pY (y) =
X
pX,Y (x, y).
x
For PDFs, we have:
Z
fX,Y (x, y)dy
fX (x) =
y
Z
fY (y) =
fX,Y (x, y)dx.
x
Just as we defined two discrete random variables X, Y to be independent if
pX,Y (x, y) = pX (x)pY (y),
we define two continuous random variables X, Y to be independent if
fX,Y (x, y) = fX (x)fY (y).
5.2.2. Conditional Distribution Functions for Discrete Random Variables. Last class, we
defined conditional distribution functions for single random variables. Recall, for a discrete
random variable X and an event B, we had
P[{X = x} ∩ B]
pX|B (x) =
.
P[B]
Similarly, for a pair of discrete random variables X, Y , we write
P[X = x, Y = y]
pX,Y (x, y)
pX|Y (x|y)
=
.
P[Y = y]
pY (y)
Remark 5.5. This is a definition, not a calculation! However, it lines up with many calculations for conditional events that we have already seen. For example, we can check that:
pX|Y (x, y)pY (y) = pX,Y (x, y),
just like
P[A|B]P[B] = P[A ∩ B].
17
Similarly, X and Y are independent if
pX|Y (x, y) = pX (x)
for all y, just like A and B are independent if
P[A|B] = P[A].
Lets do some calculations:
Example 23. We consider X, Y with joint PMF:
pX,Y (x, y) = c(x + y)
for x, y ∈ {1, 2, 3}. Lets calculate fX|Y (x, y). The natural thing to do is:
1=
3 X
3
X
c(x + y)
x=1 y=1
=c
3
X
(3x + 6)
x=1
= c(18 + 18),
so c =
1
.
36
Next, we have
pY (y) =
3
X
pX,Y (x, y)
x=1
3
=
1 X
(x + y)
36 x=1
=
1
(6 + 3y).
36
Thus,
pX,Y (x, y)
pY (y)
x+y
.
=
6 + 3y
NOTE: I wasted time there. We didn’t need to calculate c - it goes away in the end. It is
generally a good thing to be lazy in math class!
pX|Y (x, y) =
5.2.3. Conditional Distribution Functions for Continuous Random Variables. We try to do
the same thing for continuous random variables. There are some immediate issues; we
originally had
P[X = x, Y = y]
pX|Y (x, y) =
,
P[Y = y]
but the denominator is 0 for continuous random variables. However, we also had a formula:
pX,Y (x, y)
pX|Y (x, y) =
,
pY (y)
18
which suggests defining:
fX,Y (x, y)
.
fY (y)
fX|Y (x|y) =
That definition at least works. It also turns out to make sense. As before, we get many
analogues to existing formulas. We list some important formulas. First, we certainly have:
Z ∞
Z ∞
fX,Y (x, y)dy =
fY (y)fX|Y (x|y)dy.
fX (x) =
−∞
−∞
From this, we get a continuous version of Bayes’ rule:
fY (y)fX|Y (x|y)
.
f (y)fX|Y (x|y)dy
−∞ Y
fY |X = R ∞
Example 24. Let X, Y have joint pdf
fX,Y (x, y) = 6x2 y,
on 0 ≤ x, y ≤ 1. We wish to calculate P[X ≤ 0.2|Y = 0.5]. To calculate the conditional
distribution of X, we need the marginal distribution of Y :
Z 1
6x2 ydx
fY (y) =
0
= 2y.
Thus,
fX,Y (x, y)
fY (y)
6x2 y
=
2y
= 3x2 .
fX|Y (x|y) =
Thus,
Z
0.2
P[X ≤ 0.2|Y = 0.5] =
fX|Y (x, 0.5)dx
0
Z
0.2
=
3x2 dx
0
−3
=5 .
NOTE: We could have observed that fX,Y (x, y) = f1 (x)f2 (y), which automatically implies
that X, Y are independent. Thus, we didn’t need to actually do much of this calculation at
all!
5.2.4. Convolutions and Sums. Let X, Y be two discrete random variables with PMFs pX (x),
pY (y). Let Z = X + Y . What is pZ (z)? Well,
pZ (z) = P[Z = z]
X
=
P[X = x, Y = z − x]
x
19
=
X
pX (x)pY (z − x).
x
This is called the convolution formula. Basically the same thing holds for continuous random
variables:
Z ∞
fX (x)fY (z − x)dx.
fZ (z) =
−∞
We will be using this quite a bit later in the course; I mention it here mostly to follow
with the textbook.
5.2.5. Expectations. You certainly know the expectation from previous probability classes,
where it was the most frequently-used among many notions of ‘centrality,’ such as the median
or the mode. In this class, it will really be the only such measure that we use. Recall the
definition:
Definition 5.6 (Expectation). For a discrete random variable X, the expectation is:
X
xpX (x).
E[X] =
x
For a continuous random variable Y , the expectation is:
Z ∞
E[Y ] =
yfY (y)dy.
−∞
We know how to do integrals, but there are many nice formulas that you might not know.
For example:
Lemma 5.7 (Integration by Parts Formula). Let X ≥ 0 be a discrete random variable. Then
∞
X
P[X ≥ x].
E[X] =
x=1
Remark 5.8. Yes, this is called ‘integration by parts’ in reference to the formula
Z
Z
udv = uv − vdu
from calculus. Those of you who like calculus can play around with this and figure out why,
but it is beyond the scope of this course.
Since functions of random variables are still random variables (yes, I really did mean that
sentence!), we know how to take expectations of functions of random variables. For example,
if X is a discrete random variable, and h is a function, then we can write:
X
E[h(X)] =
h(x)fX (x).
x
Similarly for continuous random variables. However, there are a bunch of ‘special’ functions
of random variables whose expectations have names.
Definition 5.9 (Moments). For k ∈ N, the expected values E[X k ] are called the moments
of X. When k = 1, the moment is called the mean or average.
We also have
20
Definition 5.10 (Central Moments). For k ∈ N, the expected values E[(X − E[X])k ] are
called the central moments of X. When k = 2, ths is called the variance.
Remark 5.11. The mean and variance are the really important ones here!
Remark 5.12. You’ve all seen expectations, so no examples here!
5.2.6. Expectations and Joint Distributions. We can’t really take expectations of a pair of
random variables X, Y . When we talk about expectations in the context of joint distributions, we are really talking about the expectation of some function Z = h(X, Y ). But this
Z is just a random variable, so we take the expectation the same way we always do:
XX
E[Z] =
h(x, y)pX (x)pY (y)
x
Z
y
∞
E[Z] =
h(x, y)fX (x)fY (y).
−∞
Example 25. Let X, Y have joint PDF
fX,Y (x, y) =
3 2
(x + xy + y 2 )
44
for 0 ≤ x, y ≤ 2. Let Z = XY . Then
Z 2Z 2
3
(xy)(x2 + xy + y 2 )dxdy
E[Z] =
44 0 0
Z 2Z 2
3
=
(x3 y + x2 y 2 + xy 3 )dxdy
44 0 0
Z 2
3
8
=
(4y + y 2 + 2y 3 )dy
44 0
3
3
64
= (8 +
+ 8)
44
9
52
= .
33
NOTE: This is plausible - the answer is between 0 and 4.
More important than this definition, expectations of joint distributions have many important properties. The most important, often referred to as linearity of expectation, is
that
E[X1 + . . . + Xn ] = E[X1 ] + . . . + E[Xn ].
This holds regardless of independence. If X, Y are independent, we also have
E[XY ] = E[X]E[Y ].
This implies an important formula for variance:
V ar[X1 + . . . + Xn ] = V ar[X1 ] + . . . + V ar[Xn ].
Generally, if X, Y are not independent, we define
21
Definition 5.13 (Covariance and Correlation). The covariance of two random variables
X, Y is given by
Cov[X, Y ] = E[(X − E[X])(Y − E[Y ])]
and their correlation is given by
Cov[X, Y ]
Corr[X, Y ] = p
.
V ar[X]V ar[Y ]
There are many important formulas here as well:
V ar[X + Y ] = V ar[X] + 2Cov[X, Y ] + V ar[Y ]
Cov[cX, Y ] = cCov[X, Y ]
Cov[X + c, Y ] = Cov[X, Y ]
All of these are possible to derive in one or two lines from the definitions you know. However,
they show up often enough that they are probably worth memorizing.
22
6. Lecture 5: End of Chapter 5 (Jan. 26)
6.1. Summary.
• A few administrative details.
• A few last properties of expectations.
• Introduction to generating functions.
6.2. Lecture.
6.2.1. Homework Note. Several people have asked me what I mean by a ‘probabilistic’ proof
of an identity. This is very poorly defined, but (roughly speaking) I want a proof that doesn’t
involve manipulating formulas. Here is a simple proof in that line, for the formula:
X
n−3 n
i−1 n−i−1
=
.
5
2
2
i=3
To prove this, I give a procedure for choosing 5 objects out of n. First, I choose the middle
object. Then, if the middle object is in position i, I choose 2 objects from the first i − 1 and
2 objects from the last n − i − 1. The number of ways to do this procedure is clearly
n−3 X
i−1 n−i−1
.
2
2
i=3
However, this procedure also gives me every way to choose 5 objects out of n exactly once,
so the number of ways to do this procedure must also equal
n
.
5
Thus, the two must be equal.
6.2.2. Introduction. Today, we finish discussing expectations for collections of random variables and get started on generating functions. We then begin to review various special
distributions.
6.2.3. Conditioning and Expectations. Let X, Y be two random variables with joint PMF
(respectively PDF) pX,Y (respectively fX,Y ). We define the conditional expectation of X
given Y in the obvious way:
X
E[X|Y = y] =
xpX|Y (x|y)
Zx ∞
E[X|Y = y] =
xfX|Y (x|y)dx.
−∞
Note that we have seen this before! After all, P[·|Y = y] is an honest-to-goodness probability
distribution. We are just taking expectations with respect to it. So, again, there is in fact
no new content here! However, it does lead to two really important formulas. Both of them
work for both discrete and continuous random variables. The first is:
E[X] = E[E[X|Y ]].
For now, we just give a silly application:
23
Example 26. The average height of a Canadian male is 175.1 cm and the average height of
a Canadian female is 162.3 cm. 49.7% of Canadians are men. What is the average height
of Canadians?
Choose a Canadian at random. Let X be their height and let Y be their gender. Then
E[X] = E[E[X|Y ]]
= (175.1)(0.497) + (162.3)(0.503)
= 168.7.
Another important formula is:
V ar[X] = E[V ar[X|Y ]] + V ar[E[X|Y ]].
We give a simple application:
Example 27 (Pooled Samples). I am interested in insuring a group of students. I pick a
student at random, and let X be the amount that I have to pay out and let Y = 1 if the
student is on the varsity football team and 0 otherwise. Due to earlier research, we know:
E[X|Y = 0] = 248
E[X|Y = 1] = 880
V ar[X|Y = 0] = 110
V ar[X|Y = 1] = 110
P[Y = 0] = 0.92.
We then calculate:
E[V ar[X|Y ]] = (0.92)(110) + (0.08)(110)
= 110.
Also,
V ar[E[X|Y ]] = E[E[X|Y ]2 ] − E[E[X|Y ]]2
= (0.92)(248)2 + (0.08)(880)2 − ((0.92)(248) + (0.08)(880))2
= 29398.
We conclude:
V ar[X] = E[V ar[X|Y ]] + V ar[E[X|Y ]]
= 29508.
NOTE: We needed quite a lot of information to be able to do this sort of calculation.
ANOTHER NOTE: This formula gives us a beautiful, simple, inequality:
V ar[X] ≥ E[V ar[X|Y ]] ≥ min V ar[X|Y = y].
y
Before continuing: I strongly encourage you to read the list of formulas and inequalities
on pp.100 of the textbook. All of them should be ‘obvious,’ and are a nice source of ‘tricky’
exam questions.
24
6.2.4. Generating Functions. Generating functions are some particularly important expectations. First:
Definition 6.1 (Probability Generating Function). Let X ≥ 0 be a discrete random variable
with PMF pX . The probability generating function is given by
∞
X
X
GX (z) = E[z ] =
pi z i .
i=0
This function has many important properties. The most important, which will certainly
be used, is:
Theorem 28. The distribution of X is uniquely determined by its generating function. That
is, if GX (z) = GX (y), then we necessarily have pX = pY .
It is also very useful for computations:
Theorem 29. We have
G0X (1) = E[X].
More generally, for any 1 ≤ k,
(k)
GX (1) = E[X(X − 1) . . . (X − k + 1)].
These expectations are known as factorial moments, though I won’t use that word much.
There is also a continuous analogue to all of this:
Definition 6.2 (Moment Generating Function). Let X ≥ 0 be a random variable. For
θ ∈ R, the moment generating function is given by
MX (θ) = E[eθX ].
Remark 6.3. The probability and generating functions are very close... but are parameterized slightly differently. This is annoying and makes them a little harder to memorize. I’m
sorry, but there’s not much that I can do about it.
Remark 6.4. The moment generating function uniquely determines the PMF of a discrete
random variable. The moment generating function ‘almost’ uniquely determines the PDF
of a continuous random variable as well, when it exists. We won’t really discuss what this
‘almost’ entails.
Theorem 30. We have, for k ≥ 1,
(k)
MX (θ) = E[X k ].
Although the formulas are nice, we mostly care about the MGF because it has some very
nice formulas. The most interesting is probably:
Theorem 31. Let X1 , . . . , Xn be a collection of independent random variables. LEt Y =
X1 + . . . + Xn . Then
n
Y
MY (θ) =
MXi (θ).
i=1
We also have,
McY (θ) = MY (cθ).
25
Lets use this!
Example 32. Let X1 , . . . , Xn be i.i.d. random variables with MGF
2
MXi (θ) = eθ .
Ok,Pthese are normal random variables - but we’re not talking about that yet. Let X =
n
√1
i=1 Xi . Then
n
MX (θ) = M √1
Pn
i=1
n
Xi (θ)
θ
= MPni=1 Xi ( √ )
n
n
Y
θ
=
MXi ( √ )
n
i=1
=
n
Y
θ2
en
i=1
2
= eθ
= MX1 (θ).
D
Thus, by our earlier theorem, we have S = X1 . NOTE: Even though we are just checking
equality... this actually seems quite difficult to do without using a generating function!
We have two more generating funtions! The most general is:
Definition 6.5 (Laplace Transform). Let X be a random variable (nonnegative or not!).
For any s ∈ C, the Laplace transform is defined by
LX (s) = E[e−sX ].
Finally, we have
Definition 6.6 (Characteristic Function). Let X be a random variable (nonnegative or not!).
For any s ∈ R, the characteristic function is defined by
where i =
√
CX (θ) = E[eiθX ],
−1.
Remark 6.7. Although it isn’t the easiest to understand, the characteristic function is probably the most important of the generating functions. It has two particularly important properties:
• It generally exists (we will see soon that many distributions don’t have MGFs).
• It has a nice inversion formula, useful in many areas. Those of you who have seen
Fourier analysis will know this.
Example 33 (Calculations with Generating Functions). Let X have moment-generating
function
1
MX (z) =
.
2−z
26
What is E[X n ]?
We calculate
1
(2 − z)2
2
MX00 (z) =
(2 − z)3
......
k!
(k)
MX (z) =
,
(2 − z)k+1
MX0 (z) =
and so
(k)
MX (0) =
and so E[X n ] =
k!
2k+1
,
n!
.
2n+1
6.2.5. Maximums and Minimums. Remember: If X1 , . . . , Xn are independent, and Y =
max(X1 , . . . , Xn ), then
n
Y
FY (y) =
FXi (y).
i=1
Section 5.5 of the textbook has more on this.
27
7. Lecture 6: R Programming, Start of Chapter 6 (Jan. 29)
7.1. Summary.
• A few administrative details.
• Homework 1 is due by the start of class today! Homework 2 is on the website, and
is due on February 12 by the start of class.
• Introduction to R programming.
7.2. Lecture. Today, we get a lightning introduction to the R programming language and
to simulation, and then start Chapter 6.
7.2.1. Introduction to R. R is a programming language that is popular for doing work in
probability and statistics. Its main advantages are:
• It has a large and active user base, who can (and will!) answer questions.
• It has large libraries for doing almost any calculations you will see in any probability
and statistics course.
• It is similar enough to ‘modern’ languages like C to be familiar to most users, and
code from those languages can be integrated into it.
• You can do command-line programming.
• The big one: it is free.
The R environment can be obtained at http://www.r-project.org/ under the downloads
page. Please do that immediately, so that you have some time to practice!
There are many great R resources for general programming. Some include:
• The official R tutorial (available at http://www.r-project.org/ ).
• There are three things that you can type into the console to get information about a
function such as ‘sum’: ?sum, ??sum, and sum
• StackOverflow and CrossValidated (otherwise known as StatsOverflow) have an enormous number of basic questions about R. Everything that comes up in this course
will likely have a question on that website.
I will very briefly go over some R syntax in class, including:
• Creating variables (e.g. x = 5), vectors (e.g. y = c(1,2,3)) and matrices (e.g. z =
mat(1:10, nrow=5)).
• Operations (+,-,*,/, &,|).
• Flow control (if, for).
• Functions (foo←function(x){bar}).
• Simulating random variables (rbinom(nSamples,n,p), rnorm(nSamples, mean,variance),
in general rBLAH).
• A small number of important special functions (mean, sum, sample(nItems,nSamples)).
• A small number of ways to plot data (the most common is plot(x) or plot(x,y) to
make scatter plots, and hist(x) to create histograms of univariate data).
We will use R mostly to do ‘simulations.’ That is, we will give you a story about a bunch
of distributions, and you will use R to generate made-up data to try to understand the story.
If you haven’t done this before, it isn’t clear what this is for, or why you can do this on a
computer. We’ll go through a worked example that might clear up some of this.
28
Example 34 (Birthday Problem, Redux). We know that, in fact, births aren’t distributed
evenly throughout the year. For example, in a typical week in Canada, roughly 20 percent
more people will be born on a given weekday than on the weekend. Say that we have a large
number (more than 365) people lined up in a hallway, and we go through the line asking
people their birthdays. How many people do we need to go through before we find a repeated
birthday?
We try to understand this problem by simulating the process. We make a function that simulates the result of a single line, with the weekday to weekend birth ratio given by a free
parameter. For simplicity, assume that a year has exactly 52 weeks.
birthday <- function(p) {
Weeks = sample(52,366, replace=TRUE, rep(1,52))
Days = sample(7,366,replace=TRUE, c(1, 1,1+p,1+p,1+p,1+p,1+p))
BirthDate = 7*(Weeks-1) + Days
i = 2
for(i in 2:365) {
for(j in 1:(i-1)) {
if( BirthDate[i] == BirthDate[j]) {
return(i)
}
}
}
}
To simulate this process 100 times, we could write:
SimulatedData = rep(0,100)
for(i in 1:100) { SimulatedData[i] = birthday(0.2) }
And then we might plot using hist(SimulatedData).
So, this lets us simulate data. What do we actually use it for? In practice, we often use it
to do Monte Carlo integration. This is a big and very popular subject, but it is easy to say
the basics:
Example 35 (Monte Carlo Integration). Fix some random variable X with PMF pX . We
want to calculate some expectation E[f (X)] with respect to pX ; for example, we might be
interested in
P[X < 1] = E[1X<1 ]
or
E[X 2 ].
Sometimes, these integrals are hard to calculate. Instead, we might simulate X1 , . . . , Xn and
estimate
n
1X
f (Xi ).
E[f (X)] ≈
n i=1
29
Since we all know the central limit theorem, we know that
n
C
1X
f (Xi ) + ± √ .
E[f (X)] ≈
n i=1
n
Example 36 (Finishing the Birthday Problem). How does this relate to the birthday problem? We let X be the number of people called in before we found a repeated birthday, and
want to find E[X], P[X < 20] and some number T so that P[X < T ] ≈ 0.5. I don’t know
much about pX in this case, so these seem hard to find! Lets instead use our simulated data
from above. We calculate these three terms in R:
mean(SimulatedData)
mean(SimulatedData < 20)
sort(SimulatedData)[floor(length(SimulatedData))]
Easy!
Ok, that is just about it for a brief introduction to R. R will show up mostly in assignments,
where there might be a little bit of extra information that could be useful. You will generally
be responsible for figuring out the rest by yourself, but please feel free to write to me, come
to office hours, etc.
30
8. Lecture 7: Chapter 6 (Feb. 2)
8.1. Summary.
• A few administrative details.
• Some more discrete random variables
8.2. Lecture. We begin chapter 6 of the textbook: discrete distribution functions. Most
textbook questions about chapters 6 and 7 follow essentially the same format:
• We give you the parameters for some (named) distribution function,
• You plug these values into the PMF or PDF.
You should be very comfortable with this type of question, we’ll do some in class. Another
(slightly less common) class of questions have the format:
• We tell you how to construct a random variable,
• You prove that this random variable is really one our named random variables in
disguise.
We already did this for the normal distribution when we started using generating functions,
and there are more examples in homework 2. Another (still less common) class of questions involves doing ‘composite’ calculations. These look like the first type of calculation,
but they involve several different types of random variables. Again, an example of this is
in the homework, and I’ll highlight an example next lecture. If you are comfortable with
these three types of questions, you will be largely set for exams. The homework also has
some ‘tricky’ questions, but these would only show up on exams if they came with many hints.
8.2.1. Uniform Distribution. One of the more popular distributions is the uniform distribution. A discrete (respectively continuous) uniform distribution is one with PMF (respectively
PDF) that is uniform on its set. There isn’t much to say about them - calculations here are
fairly easy!
8.2.2. Discrete Distribution Functions. Today, we’ll focus on discrete distributions that come
from independent trials:
• Bernoulli, Binomial and Multinomial.
• Geometric and Negative Binomial.
There are a lot of names here. I suspect that you’ve heard all of them, but have trouble
remembering which is which. As such, we start with the underlying process, then recall
which name refers to which questions. Note that all of our calculations are ‘really’ about
sampling with replacement, and the formulas reflect that. The hypergeometric distribution
deals with a related process, but this time without replacement.
Definition 8.1 (Bernoulli Process). A Bernoulli Process is a sequence of i.i.d. random
variables {Xi }i∈N so that
P[Xi = 1] = 1 − P[Xi = 0] = p ∈ [0, 1].
Example 37. The canonical example, when p = 12 , is an infinite sequence of coin flips, with
Xi = 1 if the i’th flip comes up heads. Note that this coin-flipping is really sampling with
replacement from a set of two objects!
31
Dice rolls give examples for other values of p - e.g. p =
if the i’th die comes up 1, 2, 3 or 4.
2
3
can be obtained by setting Xi = 1
Remark 8.2. We’ve snuck a definition in here. We know what it means for any finite collection of random variables to be independent, but not what it means for an infinite sequence
to be independent.
We define an infinite sequence like this to be independent if any finite subset of them is
independent.
8.2.3. Binomial and Multinomial Distribution. Let {Xi }i∈N be a Bernoulli process. A binomial random variable with parameters (p, n) has the distribution of:
X=
n
X
Xi .
i=1
We list some formulas for this. You should derive some of them yourself - none are terribly
difficult, all are fair game for exams.
n x
pX (x) =
p (1 − p)n−x
x
E[X] = np
V ar[X] = np(1 − p).
For practice, we calculate the probability generating function:
Example 38. Recall that
GX (z) = E[z X ]
n X
n x
=
p (1 − p)n−x z x .
x
x=0
Ok, this seems to be a problem. What should we do instead?
One answer: remember the formula
GX (z) = GX1 +...Xn (z)
n
Y
=
GXi (z)
i=1
= (1 − p)z 0 + pz 1 )n
= (1 − p) + pz)n .
So, pretty easy in the end.
We don’t discuss the multinomial distribution in class, as it is similar to the binomial
distribution and should be review. Please refer to section 6.7 of the textbook for review.
32
8.2.4. Geometric and Negative Binomial Distributions. The binomial distribution counts the
number of successes up to some specific time n. The negative binomial distribution counts
the number of trials needed to get a specific number of successes k. That is, the binomial
distribution involves fixing the amount of time and counting successes; the negative binomial
distribution involves fixing the number of successes and counting time. More formally, let
{Xi }i∈N and define
n
X
T = min{n :
Xi = k}.
i=1
Then T has negative binomial distribution with parameters (p, k). The geometric random
variable is a special name for a negative binomial distribution with k = 1. In this class,
we will be talking about geometric random variables (and their continuous cousins, the
exponential random variables) more than the negative binomial distribution itself. Some
formulas for the negative binomial distribution include:
n−1 k
pX (x) =
p (1 − p)n−k
k−1
k
E[X] =
p
k(1 − p)
V ar[X] =
.
p2
Remark 8.3. Again, k = 1 is the important case here. One reason is that the exponential
distribution has the memoryless property, which we will get back to in the near future.
Example 39. At my local bus stop, the number of minutes that I must wait for a bus follows
a geometric distribution with mean 12. I have early meetings every day this week, and want
to make sure that with 95% probability, I am on the bus by 7 AM. What time should I show
up to the bus station?
1
Since the mean is 12, we have p = 12
. Let X1 , . . . , X5 be the amounts of time that I spend
waiting for the bus, and assume that I get to the bus station m minutes before 7 AM. Then
I want:
0.95 ≈ P[max(X1 , . . . , X5 ) ≤ m]
= P[X1 ≤ m]5
1
= (1 − (1 − )m )5
12
Thus,
0.99 = (1 − (1 −
1 m
)
12
log(0.01)
m=
log( 11
)
12
≈ 53.
0.01 = (1 −
33
1 m
) )
12
Is this plausible? For the model: it is clear that buses don’t arrive according to a discrete
distribution... but counting by the minute is surely ‘plausible.’ For the end result: the average
waiting time is 12 minutes, so you need to be about 36 minutes early to be 95-percent sure on
even a single day. Going from 36 to 53 minutes early doesn’t seem completely implausible.
8.2.5. Poisson Distribution. The Poisson distribution is very important in applied probability, and in particular shows up quite often in queuing theory. Roughly speaking, a Poisson
distribution shows up when you are counting the number of successes of a very large number
of experiments, each of which has a very low success rate. For example, it might be used to
model the number of callers to a Microsoft help desk within a given 1-minute period. In this
case, the number of experiments is the number of people using Microsoft products (a very
large number!), but the probability of success is the probability that any given person will
call given any particular minute (this is small).
There is a natural question here: why use a Poisson random variable, rather than a binomial? There are many reasonable answers, but for now I’ll give the simplest: there are
many random variables besides the binomial that ‘look like’ the Poisson distribution, and
the Poisson is easiest to work with. More mathematically, the Poisson is ‘universal’ - if you
are running many trials, and all of them are likely to fail, and they aren’t too dependent,
the number of successes will be (approximately) Poisson.
Definition 8.4 (Poisson Distribution). A Poisson random variable X associated with rate
r and time interval T has PMF
λx
pX (x) = e−λ ,
x!
where λ = rT .
Example 40. Assume that a help desk receives, on average, 7 calls per hour. Let X be the
number of calls that it receives between 9 AM and 9:10 AM, and assume that X has Poisson
distribution. Then its parameter is λ = (7)( 16 ) ≈ 1.17. The probability that the help desk
receives no calls at all during this period is
7
P[X = 0] = e− 6 ≈ 0.311.
We mentioned that the Poisson distribution is often used as an approximation to the
binomial. More formally,
Definition 8.5. Let X be a Binomial random variable with parameters n, p. Then the
Poisson approximation to X is a Poisson random variable with parameter λ = np.
Example 41. In a certain widget factor, one percent of widgets are defective. I look at 200
widgets, chosen at random. Using the Poisson approximation, what is the probability that at
least 4 of them are defective?
Let X be Poisson with mean λ = (200)(0.01) = 2. Then
P[X ≥ 4] = 1 − P[X = 0] − P[X = 1] − P[X = 2] − P[X = 3]
≈ 0.133.
34
Also,
E[X] = 2
V ar[X] = 2.
There is more on the Poisson approximation in the homework. Lets do two more complicated calculations to prepare - what I called ‘composite’ questions last class.
Example 42. I send somebody to look for defective widgets at the same widget factory.
While I want the person to sample 200 widgets, they are unreliable; instead, they sample
Θ widgets, where Θ is a geometric random variable with mean 200. Let X be the number
of defective widgets that they find, and calculate both E[X] and V ar[X] under the Poisson
approximation.
First,
E[X] = E[E[X|Θ]]
Θ
= E[
]
100
= 2.
NOTE: This is the same expectation as we would have if exactly 200 had been sampled!
However,
V ar[X] = V ar[E[X|Θ]] + E[V ar[X|Θ]]
Θ
Θ
= V ar[
] + E[
]
100
100
1
= 4(1 −
)+2
200
≈ 6.
NOTE: So, the variance is quite a lot larger in this situation!
Another ‘composite’ example.
Example 43. I buy 12 lightbulbs and plug them in. Assume each lightbulb’s lifetime, in
days, follows a geometric distribution with mean 50. Also assume that the lifetimes are
independent. I will buy new lightbulbs once 7 have failed. What is the probability that I have
bought new lightbulbs after 75 days?
P
Let X1 , . . . , Xn be the lifetimes of the lightbulbs, let Yi = 1Xi >75 , let Y = 12
i=1 Yi and let
Z = 1 if I have bought new lightbulbs and 0 otherwise. We note that
1
E[Yi ] = P[Xi > 180] = (1 − )75 ≈ 0.220.
50
Since Y is Binomial(12, 0.133), we have
P[Z = 1] = P[Y ≤ 5] ≈ 0.97.
35
9. Lecture 8: Start of Chapter 7 (Feb. 5)
9.1. Summary.
• A few administrative details.
• Some more continuous random variables.
9.2. Lecture. We begin chapter 7, a collection of continuous distribution functions. As with
the discrete distribution functions, there are really three types of popular questions:
• We give parameters for a continuous PDF; you plug them into a formula.
• We construct several random variables, and ask you to prove some relationship between them (e.g. that a sum of i.i.d. normal random variables is normal). This often
involves generating functions; it sometimes involves conditional distributions.
• We give parameters for several continuous PDFs; you solve several type-1 questions
in a row.
As with the discrete random variables, we’ll do all three types in class, and there are
examples of all three types in the homework. You should, of course, do more examples
yourself!
9.2.1. Exponential Distribution. The exponential distribution is a close cousin to the geometric distribution.
Definition 9.1 (Exponential Distribution). The PDF of an exponential distribution with
parameter λ is
fX (x) = λe−λx
for x ≥ 0.
We give some related formulas:
FX (x) = 1 − e−λx
1
E[X] =
λ
1
V ar[X] = 2 .
λ
The exponential distribution is very important to this course for two closely-related reasons.
The first is the memoryless property:
Theorem 44 (Memoryless Property). Let X be a random variable with exponential distribution and let s, t be two constants. Then
P[X > s + t|X > t] = P[X > s].
Proof.
P[X > s + t]
P[X > t]
−λ(s+t)
e
=
e−λt
−λs
=e
= P[X > s].
P[X > s + t|X > t] =
36
Remark 9.2. The exponential distribution is the only continuous distribution with this property. The geometric distribution, which we have already seen, is the only discrete distribution
with this property.
Remark 9.3. This is very surprising! Lets look at another random variable, with CDF
FX (x) = 1 − e−x
2
for x ≥ 0. Then
2
e−(s+t)
P[X > s + t|X > t] = −(t)2
e
−s2 −2ts
=e
= e−2ts P[X > s].
Example 45. I am at a bus stop with 5 types of buses. The wait for each bus is exponentially
distributed, with mean waiting times of 12,14,19,22 and 40 minutes. What is the distribution
of the waiting time for the first bus?
Let Xi be the time until a bus of type i arrives and let X = min(X1 , . . . , X5 ). Note that we
have many options for calculating the distribution of X. However, the CDF seems to be by
far the easiest, so we choose that. NOTE: Choosing what to calculate is by far the hardest
part of this question - it shouldn’t be swept under the rug!
Then
P[X < s] = P[min(X1 , . . . , X5 ) ≤ s]
= 1 − P[max(X1 , . . . , X5 ) > s]
t
Y
=1−
P[Xi > s]
i=1
s
s
s
s
s
− 12
− 14
− 19
− 22
− 40
=1−e e e
= 1 − e−0.278s .
e
e
NOTE: This is the CDF of an exponential distribution. Please look at pp. 140 of the textbook
for more discussion of this example, and more practice on manipulating the exponential
distribution.
The other important property of the exponential distribution is its relationship to the
Poisson distribution. The relationship is:
Theorem 46. Let {Xi }i∈N be an infinite sequence of i.i.d. exponential random variables
with mean λ and fix T > 0. Then
Y = max{n :
n
X
i=1
has Poisson distribution with mean
T
.
λ
That is a lot to process! Lets make it concrete.
37
Xi ≤ T }
Example 47. Say we know that the time between calls at a help desk follow an exponential
distribution, with mean λ. Then this theorem says that the number of calls to the help desk
over a time period of length T has Poisson distribution with mean Tλ .
Note that this does more-or-less go in the opposite direction, though the above theorem
doesn’t imply this (and indeed we would have to be more careful to make this rigorous):
Example 48. Say we know that the number of calls to a help desk follow a Poisson distribution, with rate λ. Then the time between calls has an exponential distribution with mean
1
.
λ
We do some calculations:
Example 49. Assume that the waiting time for a bus is exponentially distributed with mean
12 minutes. Let A be the event that no buses show up for the first 20 minutes and let B the
event that more than 2 buses show up in the first 20 minutes. Calculate P[A], P[B].
The first is straightforward:
20
P[A] = e− 12 ≈ 0.189.
For the second, we need our theorem. Let X be a Poisson random variable with mean
Then
20
.
12
P[B] = 1 − P[X ≤ 2] = 1 − P[X = 0] − P[X = 1] − P[X = 2] ≈ 0.234.
NOTE: The ppois function in R is very helpful for calculating these very quickly! In this
case, the above was evaluated with
1-ppois(2, 20/12)
9.2.2. Normal Distribution. This is probably the most important distribution in probability
and statistics, and was likely the emphasis of your first course on the subject. It will be less
critical in this course, but is still interesting. The main interest in the normal distribution
is that it has a very important ‘universal’ property: if you add up a lot of little random
variables, and none of them are large, and they aren’t too dependent, the sum ‘looks like’ a
normal distribution. This is the content of the famous central limit theorem.
Recall:
Definition 9.4 (Normal Distribution). The normal distribution with mean µ and variance
σ 2 has PDF
(x−µ)2
1
fX (x) = √ e− 2σ2
σ 2π
for −∞ < x < ∞. NOTE: It isn’t so obvious that this is actually a distribution function.
Checking the normalizing constant is a difficult calculus exercise.
Since this function is so difficult to integrate, most probability calculations involving the
normal distribution are done via table look-ups. In order to use these tables, you need:
Definition 9.5 (Standard Normal Distribution). The normal distribution with µ = 0, σ 2 = 1
is known as the standard normal distribution.
Note that:
38
Theorem 50. If X has a normal distribution with mean µ and variance σ 2 , then
X −µ
Z=
σ
has standard normal distribution.
Proof. This is a short exercise, either directly or with generating functions. So, check!
Almost all lookup tables are written in terms of the standard normal distribution, so we
will do an example of that here. Note that, if you use R, it will do the transformation for
you, and so this is a little outdated.
Example 51. Assume that the test scores in this class are normally distributed, with mean
µ = 85 and variance σ 2 = 16. Let X be the score of a student chosen at random. What is
P[X > 80]?
X − 85
80 − 85
P[X > 80] = P[
>
]
4
4
= P[Z > −1.25]
≈ 0.894.
We could look this up in a table. Alternatively, we could use R: the command is
pnorm(-1.25)
Note that we needed a negative sign! Alternatively, if we are using R, we can do the entire
question at once:
pnorm(-80, -85, 4)
Just like we had a Poisson approximation to the binomial distribution, there is a normal
approximation to the binomial distribution:
Definition 9.6 (Normal Approximation).
Let X1 , . . . , Xn be n i.i.d. Bernoulli random
Pn
variables with E[Xi ] = p, let S = i=1 Xi , let
S − np
S∗ = p
,
np(1 − p)
and let Z be a standard normal random variable. The normal approximation to S is
P[a ≤ S ∗ ≤ b] ≈ P[a ≤ Z ≤ b].
Remark 9.7. When do we use a normal approximation, and when do we use a Poisson
approximation? Roughly, we use the normal approximation when n is very large and np is
also very large. We use the Poisson approximation when n is very large but p is not large.
This isn’t the focus of the current course, so I won’t say much more about it here.
9.2.3. Gamma Distribution. We introduce the Gamma distribution now as a place-holder,
and will see how it shows up ‘in the real world’ next lecture. To give a small preview, we
often use a Poisson distribution to model the number of ‘successes’ of a process that occurs
over time - for example, the number of buses that arrive at a bus stop. We saw two classes
ago that the time between successes of a Poisson process had exponential distribution. We
will see that the time between successes of a more general type of process has Gamma distribution.
39
Before defining the Gamma distribution, we need:
Definition 9.8 (Gamma Function). The gamma function is:
Z ∞
Γ(α) =
y α−1 e−y dy.
0
This has a number of nice properties:
Γ(1) = 1
√
Γ(0.5) = π
Γ(α) = (α − 1)Γ(α − 1),
α>1
Γ(n) = (n − 1)!,
n ∈ N.
The first is easy and the third and fourth come from integration by parts. The second is
a little tricky!
Definition 9.9 (Gamma Distribution). The distribution with parameters (α, β) is:
fX (x) =
x
1
α−1 − β
x
e
β α Γ(α)
for x > 0.
This is quite a confusing density! A few comments:
• β is called the ‘scale’ parameter. Changing it doesn’t change the shape of the distribution, it just scales it vertically and horizontally.
• α is called the ‘shape’ parameter. It radically changes the shape of the distribution,
but is hard to understand otherwise.
• α = 1 gives the exponential distribution, which we’ve seen.
• α = n2 , β = 2 gives the χ2 distribution with n degrees of freedom, which is important
in statistics.
• The distribution seems impossible to integrate in general. However, when α is an
integer, there is a magical formula for the CDF:
FX (x) = 1 −
α
X
( βx )i
i=0
i!
x
e− β .
This is exactly a Poisson distribution.
• The distribution has:
E[X] = αβ
V ar[X] = αβ 2 .
Remark 9.10. I don’t expect you to memorize these identities, besides the expectation and
variance. However, it is helpful to know that they exist.
40
9.2.4. Weibull Distribution. The Weibull distribution is another generalization of the exponential distribution. Like the Gamma distribution, the Weibull distribution is quite flexible.
Unlike the Gamma distribution, it is fairly easy to do calculations involving the Weibull
distribution.
The main interest of the Weibull distribution is that, like the Poisson distribution and
the normal distribution, the Weibull distribution is ‘universal.’ Recall that the number of
successes of many independent, low-probability trials is (approximately) Poisson, while the
sum of many independent, small random variables is (approximately) normal. There is a
similar result for the Weibull: the maximum of many small, independent random variables
is approximately Weibull.
Definition 9.11 (Weibull Distribution). The Weibull distribution with parameters β, η has
PDF:
β−1
x β
β x
fX (x) =
e−( η ) ,
η η
for x ≥ 0 and β, η > 0.
Remark 9.12. You will sometimes see the Weibull distribution written with a different
number of parameters. I follow the textbook’s notation.
The Weibull has some nice properties:
• When β = 1, this is the exponential distribution.
• For once, we can write down the CDF:
x β
FX (x) = 1 − e−( η ) .
This is nice enough that it is worth remembering! It is just like the exponential
random variable!
• The moments are:
E[X] = ηΓ(1 + β −1 )
V ar[X] = η 2 Γ(1 + 2β −1 ) − Γ(1 + β −1 )2 .
A typical question might be:
Example 52. X has Weibull distribution with parameters η = 2 and β = 4. What is
P[X > 1]?
We calculate:
P[X > 1] = 1 − P[X ≤ 1]
1 4
= e−( 2 )
≈ 0.939.
As I mentioned, the Weibull distribution has some ‘universality’ property. This universality is a little more complicated than the central limit theorem, and I won’t give a precise
statement - it is the subject of extreme value theory. However, we can easily check that
Weibull distributions are stable:
41
Example 53. Assume X1 , . . . , Xn are independent Weibull random variables with parameters η, β. Then
P[min(X1 , . . . , Xn ) ≥ s] = P[X1 ≥ s]n
s
β
= e−n( η )
=e
−
s
−1
ηn−β
β
Note that this is, itself, the CDF of a Weibull random variable.
42
.
10. Lecture 9: More of Chapter 7 (Feb. 9)
10.1. Summary.
• A few administrative details.
• Remember: Homework 2 is due at the start of next class!
• Today, we discuss reliability modeling and introduce phase-type distributions.
10.2. Lecture.
10.2.1. Reliability Modeling. The Weibull distribution is often used in reliability modeling,
and so the textbook introduces the two together. However, in principle, they are different
topics: the Weibull distribution is just some special distribution, while the ideas related to
realiability modeling can be used with any distribution. For now, we introduce reliability
modeling, then relate it back to the Weibull distribution.
There is no new mathematics in the reliability modeling that we’re discussing; the new
content is in subject matter rather than mathematics. The point of reliability modeling
is to study random variables that represent the lifetime of some object; frequently, this is
either the lifetime of a human or the lifetime of a complicated manufactured system. The
mathematical content consists of some formulas and definitions that provide some insight
into these applications. The first are quite simple:
Definition 10.1 (Reliability). The reliability of a random variable is
RX (t) = 1 − FX (t).
The survivor function is exactly the same function:
SX (t) = 1 − FX (t).
In the context of survival times, these are often more convenient to discuss than the CDF.
The prototypical example is:
Example 54. We consider a truck with 8 wheels. Let Xi be the time that the i’th wheel
lasts, and let X = min(X1 , . . . , X8 ). Assume that Xi is Weibull with parameters µ, β. Then
P[X > s] = RX (s)
8
Y
=
RXi (s)
i=1
s
β
= e−8( η )
−
=e
s
−1
η8−β
β
.
Remark 10.2. Note that we can make some heuristic comments here! Note that:
• If we take the minimum of n Weibull random variables with the same parameters
−1
β, η, the new random variable has parameters β, ηn−β .
• Recall that a Weibull distribution has expectation E[X] = ηΓ(1 + β −1 ).
43
• Combining these, we see that the expected value of the minimum of n Weibull variables
looks like
−1
E[X] = n−β ηΓ(1 + β −1 ).
Thus, if β 1 is small, the expectation drops very quickly; if β 1 is large, the
expectation is insensitive to n. This gives a bit of interpretability to β, which looks
quite mysterious: the larger the value of β, the less the lifetime ‘fluctuates.’ It even
gives an exact tradeoff between the number of systems n and the ‘reliability’ β, if you
happen to have control over both.
Another typical example is:
Example 55. I have 8 batteries with me, and my flashlight will fail once all 8 have broken.
Let Xi be the failure time of the i’th battery, and let X = max(X1 , . . . , X8 ). Assume Xi is
exponentially distributed with mean µ. Then
FX (s) =
8
Y
FXi (s)
i=1
s
= (1 − e− µ )8 .
Remark 10.3. It turns out that this maximum is harder to deal with than the minimum,
and there is only a ‘nice’ formula for the exponential distribution. When the number of
components is n, the maximum satisfies:
1
1
E[max(X1 , . . . , Xn )] = E[X1 ](1 + + . . . + )
2
n
≈ E[Xi ] log(n).
Although this formula was computed for the exponential distribution, it turns out that the
general principle that the maximum ‘looks like log(n)’ holds for many distributions.
The next definition is less familiar:
Definition 10.4 (Hazard Rate). For a continuous random variable X, the hazard rate is
given by:
fX (t)
hX (t) =
.
RX (t)
As the textbook points out, this is a measure of the chance that a device is ‘about to
fail’ if it has survived until time t. More formally, the following calculation holds for ‘nice’
PDF’s:
P[t < X ≤ t + h]
lim P[X ≤ t + h|X > t] = lim
h→0
h→0
P[t < X]
R t+h
fX (s)ds
fX (t)
= lim t
=
.
h→0
RX (t)
RX (t)
Example 56. If X has exponential distribution with mean λ1 ,
hX (t) =
λe−λt
= λ.
e−λt
44
The fact that the hazard rate is constant is essentially a restatement of the memoryless
property - the probability that a device is ‘about to fail’ given that it has survived until time
t doesn’t depend on t.
The book has an interesting calculation that shows how hazard rates apply to minimums
of random variables:
Example 57 (7.16 from textbook). We consider a collection of i.i.d. continuous random
variables X1 , . . . , Xn and let X = max(X1 , . . . , Xn ). In the jargon of reliability modeling, we
are considering n objects ‘in parallel,’ just as with the battery example above. We recall that
FX (t) = FX1 (t)n .
Since FX1 (t) FX1 (t)n →n 1, this tells us something rather obvious: the more objects you
put in parallel, the longer the circuit will last. Let’s see what the hazard function looks like:
fX (t)
hX (t) =
1 − FX (t)
nFX1 (t)n−1 fX1 (t)
=
1 − FX1 (t)n
!
fX1 (t)
nFX1 (t)n−1
=
Pn−1
i
1 − FX1 (t)
i=0 FX1 (t)
!
nFX (t)n−1
= hX1 (t) Pn−11
.
i
F
(t)
X
1
i=0
As t goes to infinity, all of the terms in the parentheses go to 1. Thus, for t large,
hX (t)
lim
= 1.
t→∞ hX1 (t)
This tells us: conditional on having survived until a very large time t, the hazard rate for
n objects in parallel is very close to the hazard rate for a single object, regardless of the
details of the hazard function (though, if you look at the calculation carefully, it is always at
least a little smaller).
Why is this obvious? We can consider: what is the probability that two or more objects
have survived for a large time t, conditional on at least one having survived? The answer is
that this probability certainly goes to 0. Thus, for very large times t, conditioning on at
least one object having survived is very close to conditioning on exactly one object having
survived.
Finally, we give one last definition and a few easy formulae:
Definition 10.5 (Cumulative Hazard). The cumulative hazard of a random variable is given
by
Z t
HX (t) =
hx (s)ds.
0
We relate some of the new objects we have:
d
hX (t) = − log(RX (t))
dt
45
RX (t) = e−
Rt
0
hX (s)ds
= e−HX (t) .
This last formula is very important for the standard textbook problems, many of which give
you a hazard function and ask you to calculate probabilities. For example,
Example 58 (7.5.8 from textbook). The hazard rate of a random variable X is given by
1 t
h(t) = e 4 ,
5
t > 0. What are the cumulative hazard function and reliability function for X? What is
P[X > 2]?
We directly calculate
Z t
Z
1 t s
4 t
H(t) =
h(s)ds =
e 4 ds = e 4 .
5 0
5
0
Thus,
4
t
R(t) = e−H(t) = e− 5 e 4 ,
and so
P[X > 2] = R(2) ≈ 0.267.
10.2.2. Phase-Type Distributions: The Goal. Today, we will be talking about phase-type
distributions. These are important for a few reasons. The first is that they give our first
introduction to Markov chains and Queues, which are central to this course. However, that
is not the emphasis in the textbook. Instead, phase type distributions are motivated as a
computational tool. That is, one reason to study the phase type distribution is as a way to
approximate complicated integrals. This is quite different from anything that we’ve talked
about in this course, which has mostly focused on distributions as models. Since the textbook doesn’t spend much time discussing this viewpoint in this chapter, it is worth taking
a detour to talk about computation and probability.
We’ll talk about this in the context of an important problem in applied probability that
is (largely) out of the scope of this course. Let’s say you are trying to price an option - for
example, you want to charge a client a fee F0 , in exchange for which the client has a right
to buy a stock for a certain amount eS0 at a certain time T in the future. In this case, your
client would gain money if eST > eS0 + F ; otherwise, you would gain money. The ‘classic’
approach to this problem is, roughly, to assume
St+1 − St
are i.i.d. normal random variables. If you make this careful and do appropriate transformation, this leads to the Black-Scholes formula. In the real world, however, St+1 − St is not
normal. It may even depend a great deal on t - for example, there might be a lawsuit pending, and there might be uncertainty as to when the lawsuit is to be resolved. One approach
is to try to build a very complicated model, like
St+1 − St = A max(B, C, eD log(E)) +
46
1
,
1+Q
where each of the random variables A, B, . . . might themselves be complicated - some might
correspond to court cases, or patents, or weather, or anything else. While this model might
be realistic, it is very hard to calculate things like P[St+1 −St < 1] - that is some huge integral!
Phase-type distributions are built to get around this issue. The central observation is that
it is very easy to do complicated calculations with the exponential distribution. Thus, if
we could replace each ‘part’ of the above expression with a ‘bunch’ of exponential random
variables, we would be able to combine them, and thus actually be able to do the calculation. It is generally not possible to replace a complicated distribution with a bunch of
exponential random variables. However, it turns out to be possible to get an arbitrarily
good approximation using them. Thus, we replace each ‘part’ of a complicated distribution
with a phase-type distribution that approximates it, and then we do calculations with those
phase-type distributions.
The study of phase-type distributions started before we had modern computers - one of
the main theorems in the subject dates back to 1955. This leads to a natural question:
is this approach still useful, now that we have fast computers? The answer turns out to
be yes. Even though we can do more complicated integrals now, it turns out that doing
high-dimensional integrals is very, very hard, even on a computer. The study of how to
approximate complicated distributions with simpler and more tractable distributions is still
very much alive, and there are good reasons to believe that it will remain an important
subject.
10.2.3. Erlang Distribution. So, we’re going to build a big collection of random variables
out of exponential distribution, with the goal of being able to approximate complicated
distributions. To do this, we need two things that we didn’t need in the rest of chapters 6
and 7:
• Some way(s) to check that the random variables we’re constructing are ‘really’ different, and
• Some way to keep track of all of our indices.
We will measure the first by looking at the relationships between the moments of our
distributions, and at their Laplace transforms. We will eventually do the second by building
a bunch of matrices and vectors, but initially we will do this using a bunch of pictures. The
pictures aren’t necessary, but they are quite useful for most of us - and they become even
more helpful as we continue our study of Markov chains in the rest of the course.
The first picture is:
This picture represents the following process: somebody enters a queue on the left, waits
in the circle for a random amount of time that is exponential with mean λ, and then leaves
47
on the right. The associated phase-type random variable is the amount of time that they
wait, which is of course exponential with mean λ. So far, we have one parameter’s worth of
random variables, and we can get:
E[X] = λ
V ar[X] = λ2
p
V ar[X]
cv (X) ≡
=1
E[X]
1
,
L(s) =
1 + λs
Wehre cv (X) is callted the coefficient of variation. Once we’ve drawn this, the obvious next
picture is:
Here, we have somebody comes in from the left, waits an amount of time X1 ∼ exp(λ),
then waits an amount of time X2 ∼ exp(λ), then exits to the right. So, the phase-type
distribution is given by:
X = X 1 + X2 .
This is called a Erlang-2 distribution. We can calculate for this:
E[X] = 2λ
V ar[X] = 2λ2
1
cv (X) = √
2
L(s) =
1
.
(1 + λs)2
So, we get something new! In particular, we get a second value for the coefficient of variation.
Once we have two things in a line, the obvious thing to do is put many things in a line:
Here, we fix r, let X1 , . . .P
, Xr be a sequence of r i.i.d. exponential random variables, each
with mean λ. We set X = ri=1 Xi . The distribution of X is called Erlang-r.
We can calculate for this:
E[X] = rλ
V ar[X] = rλ2
48
1
cv (X) = √
r
1
.
(1 + λs)r
So, we get many new things... but we certainly don’t get everything. Our coefficients of
variation all look like √1r for r an integer. Similarly, all of the Laplace transforms look like
powers of monic polynomials - we certainly don’t get all Laplace transforms, or anything like
all of them. So, we continue to define more general phase-type distributions.
L(s) =
Remark 10.6. The textbook has explicit CDF’s for the Erlang-r distributions. You should
be able to use these formulas, but we won’t discuss how to derive them.
49
11. Lecture 10: Remainder of Chapter 7 (Feb. 12)
11.1. Summary.
• A few administrative details.
• Remember: Homework 2 is due at the start of class!
• Today, we finish our discussion of phase-type distributions.
11.1.1. Hypoexponential Distribution. So far, our diagrams are sequences of circles, each with
the same number in them. We’re going to make the diagrams more complicated soon, but
first we just change the number inside each circle:
We do the obvious thing: let Xi ∼ exp(λi ), and let X =
hypoexponential distribution. We can calculate for this:
r
X
E[X] =
λi
Pr
i=1
Xi . This is called a
i=1
V ar[X] =
r
X
λ2i
i=1
Pr
λ2
cv (X) = Pri=1 i 2
( i=1 λi )
1
.
L(s) = Qr
i=1 (1 + λi s)
So, we get a little more than before. In particular, we could always get any expectation, but
now we seem to be able to get just about any possible coefficient of variation that is less
than 1... or at least there aren’t any such values that are obviously impossible to obtain.
2
11.1.2. Hyperexponential Distributions. For the hyperexponential distribution, we use graphs
that aren’t lines! It actually isn’t so clear what that should mean, so first we describe
something a bit easier:
Definition
(Mixture). Consider a collection of PDF’s f1 , f2 , . . . , fn and fix 0 ≤ α1 , . . . , αn
P11.1
n
satisfying i=1 αi = 1. Then
n
X
f (s) =
αi fi (s)
i=1
is also a PDF. It is referred to as a mixture of the densities f1 , . . . , fn . It is sometimes just
called a mixture distribution as shorthand.
Remark 11.2. Mixture distributions are generally easiest to understand algorithmically. To
sample X from a mixture distribution (e.g. on a computer), do the following:
• Sample a number 1 ≤ i ≤ n according to the PMF fY (i) = αi .
50
• Sample X from fi .
The hyperexponential distributions are just mixtures of exponential distributions. The
associated diagram is:
To read this diagram: we go in on the left. With probability α1 , we take the top branch
and wait X1 ∼ exp(λ1 ) before exiting to the right; with probability α2 = 1 − α1 , we take the
top branch and wait X2 ∼ exp(λ2 ) before exiting to the right. In general, the diagram has r
branches (draw on board); we take branch i with probability αi and then wait an amount
of time Xi ∼ exp(λi ). We make similar calculations to before:
E[X] =
r
X
α i λi
i=1
r
X
E[X 2 ] = 2
αi λ2i
i=1
P
2 ri=1 αi λ2i
cv (X) = Pr
−1
( i=1 αi λi )2
r
X
αi
L(s) =
.
1 + λi s
i=1
2
It isn’t obvious on sight, but Cv (X)2 ≥ 1. This is new ground! It also isn’t obvious - see pp.
166 of the textbook for proof.
The obvious next step looks like:
51
That’s basically true, though there is another named distribution that is a little less general
than this, called the Coxian or extended Erlang distribution:
This can be written in the form of the previous figure (see figure 7.14 of the textbook).
Continuing our standard calculations, it is easy to check:
E[X] =
r
X
i=1
λi
i−1
Y
αj .
j=1
Unfortunately, the others are more annoying to write down. If you are interested, none of
them are ‘hard’ to write down, and many of these calculations are on pp. 167 of the textbook.
However, they are all fairly tedious, and they are all complicated enough to be difficult to
interpret. In a sense, that is a good thing: remember that we’re trying to approximate
general random variables using phase-type random variables, so these calculations had better
get messy!
11.1.3. General Phase-Type Distributions. Finally, we write down general phase-type distributions. The pictures associated with these general distributions have:
• Some collection of arrows entering from the left.
• Some collection of circles with parameters inside, and with arrows between them.
• Some collection of arrows exiting to the right.
52
Note that, unlike the pictures we’ve seen so far, we do not insist that all of the arrows
between circles move to the right. Here’s a typical example:
To interpret this sort of diagram, we run the following process:
• At time 0, pick one of the arrows entering from the left, according to some distribution
on those arrows.
• Every time you enter a circle with the number λ in it, wait time exp(λ). Then choose
one of the outgoing arrows, according to the numbers on those arrows.
• When you finally exit on the right-hand side, record the final time.
At this point, we’ll move from the ‘picture’ representation of phase-type distributions to a
more algebraic representation. The parameters of a phase-type distribution are:
• The number n of ‘circles,’ or states, in the diagram.
• The inverse-means µ1i of the holding time within each state.
• A distribution {σi }ni=1 of entrance probabilities. The initial entry is given by sampling
from this distribution. This is denoted by the length-n
P vector σ.
• The routing probabilities {rij }1≤i,j≤n , which satisfy j rij ≤ 1 for all 1 ≤ i ≤ n. The
number rij gives the probability of going next to state j after waiting in state i. This
is denoted by the n by n matrix R.
P
• The terminal probabilities {ηi }ni=1 , which satisfy ηi + j rij = 1 and maxi ηi > 0.
The number ηi gives the probability of immediately exiting to the right after leaving
state i. This is denoted by the length-n vector η.
There is an equivalent description, which basically merges {rij }1≤i,j≤n and {ηi }ni=1 into
one object. Heuristically, we just add a new state, called the graveyard state, all the way at
the right. Rather than counting time until we ‘exit’ the diagram, we count the time until
we get to the graveyard state. This doesn’t change any of the actual distributions,
it just slightly changes the notation and diagrams! To compare our ‘original’ diagrams to a
diagram with a graveyard:
53
Writing the graveyard as a ∆ is a common convention, which I will follow. We also often
merge some information, defining ri∆ = ηi , rδi = 0, and
qij = µi rij ,
X
qii = −
qij .
j 6= i,
j6=i
We write the associated matrix as Q. This changes the algebraic representation a little bit:
• The number n of states is as before.
• The distribution {σi }ni=1 of entrance probabilities is as before.
• The routing and terminal probabilities {rij }1≤i,j≤n and {ηi }ni=1 , as well as the inverse
mean holding times {µi }ni=1 , get smooshed into one object, the transition matrix Q.
Example 59. We draw the diagram for n = 5, σi = i, rij = 21 1|i−j|=2 for i 6= 5 and
r5j = 14 1|i−j|=2 , ηi = 21 1i=5 . This looks like a 5-sided star, with one arrow to the right:
54
We follow the textbook’s example by defining the n by n matrix S and the n by 1 matrix
S 0 via:
S S0
Q=
0 0
Define e to the be the column vector of all 1’s. There turn out to be reasonably nice
formulas for these general diagrams! Let X be the amount of time it takes to get to the
graveyard state:
FX (x) = 1 − σeSx e
E[X j ] = (−1)j j!σS −j e.
So, doing these calculations involves inverting matrices. This can be tedious, and there
are more places to make mistakes... but it isn’t so difficult. Here is a standard (simple)
question:
Example 60. Consider a network with
σ = (0.2, 0.2, 0.3, 0.3)
and



Q=



−2 1
0
0
1
1 −3
1
0
1 

0
1 −2.5 1 0.5 

1
0
0
−2 1 
0
0
0
0
0
.
We then have

S −1
−0.65
 −0.25
=
 −0.1
−0.05

−0.3 −0.25 −0.325
−0.5 −0.25 −0.125 

−0.2 −0.5 −0.05 
−0.1 −0.25 −0.525
.
55
NOTE: You need to be able to invert a small matrix, since a small example of this is likely
to be on an exam. However, when doing problems at home, feel free to use the R command
Solve(S)
We then have
E[X] = −σS −1 e = 0.95.
NOTE: Again, matrix multiplication can be done in R, though you will need to be able to
do this on exams.
11.1.4. Fitting Phase-Type Distributions. Section 7.6.7 of the textbook discusses matching
phase-type distributions to data. This isn’t the emphasis of this course, so I just mention
that it exists! The coverage in the textbook is short; just enough to say that fitting is
possible, but without any discussion as to when it is a good idea.
56
12. Lecture 11: Chapter 8 (Feb. 23)
12.1. Summary.
• A few administrative details.
• Today, we discuss theorems.
12.2. Lecture. This isn’t a very theoretical course, but today will be devoted to theory.
That means approximating things, taking limits, and so on.
12.2.1. Markov’s Inequality. We know that, if a ≤ X ≤ b,
a ≤ E[X] ≤ b
as well. This says that, if we know about the values X can take, we know something about its
expectation. Markov’s inequality provides a more interesting bound in the other direction:
if we know something about the expected value of X, we learn something about the values
that X is likely to take:
Theorem 61 (Markov’s Inequality). Let X be a random variable and h a nondecreasing,
nonnegative function. Then
P[X ≥ t] ≤
E[h(X)]
.
h(t)
How do we use this? Here is a silly application:
Example 62. Let X be the height, in inches, of a randomly chosen Canadian. This is
obviously nonnegative, and E[X] ≈ 69. Then
69
≈ 0.48.
144
This is obviously right, but obviously not very useful. This is because we aren’t using very
much information; the expectation doesn’t tell us a vast amount about possible values far
from the expectation.
P[X > 144] ≤
We’ve seen that Markov’s inequality can be quite lose. Is it always this bad? The answer
is no: the expectation simply doesn’t say very much.
Example 63. Fix r ≥ 1, and let X be 1 with probability
1
r
and 0 otherwise. Then
1
1
E[X] = r + 0(1 − ) = 1.
r
r
We then note that
1
E[X]
=
.
r
r
Thus, Markov’s inequality is actually an equality in this case. In math jargon, we say that
the inequality is tight. This means that you can’t improve the inequality without adding some
assumptions.
P[X ≥ r] =
There are many special cases of Markov’s inequality (that is, choices of the function h)
that have their own names. The first is
57
Theorem 64 (Chebyshev’s Inequality). Let X be a random variable with mean µ and variance σ 2 . Then
σ2
P[|X − µ| > s] ≤ 2 .
s
This is just a special case of Markov’s inequality. Heuristically, this is much better because
2
X grows much more quickly than X when X is large. Thus, if E[X 2 ] is not much larger
than E[X]2 , X must be small most of the time. Let’s look at heights:
Example 65. Human height X has mean 69 inches and variance 16. Thus, we have
16
≈ 0.00284.
P[X > 144] ≤ P[|X − 69| > 75] ≤
5625
This is still an overestimate - it is certainly not true that 2 in a thousand people are over 12
feet tall. But it is much more reasonable than the previous estimate!
Again, Chebyshev’s inequality is sharp by itself. The last ‘famous’ application of Markov’s
inequality involves the choice h(x) = eθx for some θ > 0.
Theorem 66 (Chernoff Bound). Let X be a random variable with moment generating function MX (θ) = E[eθX ]. Then
P[X > s] ≤ e−θs MX (θ).
This one is a little harder to understand, since most of us don’t have a good intuitive
understanding of how big a moment generating function should be. The basic idea, though,
is similar to the idea behind using Chebyshev’s inequality over the standard Markov’s inequality: just as X 2 grows more quickly than X, eθX grows more quickly than X 2 . Thus,
we can expect Chernoff’s inequality to give us better bounds for s ‘moderately large.’
Example 67. Assume that human height X has E[eX ] ≤ 8 × 1046 (NOTE: we should be
a little bit sceptical about using this estimate. It is based on the tallest recorded human, but
that is probably not really good enough to write down this sort of bound. We’ll continue for
now, as doing inference about the tails of distributions is far outside the scope of this course).
Then
P[X > 144] ≤ e−144 × 8 × 1046 ≈ 2 × 10−16 .
This is much, much smaller than what we had in the previous examples, even though 1046
is a huge number. (NOTE: It is hard to evaluate this number. There have been about 1012
people in the world, and nobody has ever been confirmed at a height of close to 12 feet. Again,
this is meant just as probability, not statistics).
12.2.2. Law of Large Numbers. We know that, if you flip a coin many, many, times, the
percentage of heads will eventually be close to 21 . The law of large numbers is a way to turn
this into math. Here is a first way to do so:
Theorem 68 (Weak Law of Large Numbers).
P Let {Xi }i∈N be a sequence of i.i.d. random
variables with E[X 2 ] < ∞ and define Sn = n1 ni=1 Xi . Then for any > 0,
lim P[|Sn − E[X]| > ] = 0.
n→∞
58
Proof. Fix > 0. By Chebyshev’s inequality,
V ar[Sn ]
2
V ar[X1 ]
=
→ 0.
2 n
P[|Sn − E[X]| > ] ≤
Remark 12.1. We don’t quite need E[X 2 ] < ∞ for the conclusion to be true.
Ok, so this is called the weak law of large numbers. To say where this word comes from,
we talk a little bit about what we want the law of large numbers to say, and what the weak
law of large numbers actually says. We’ll turn this into math soon, but the problem is easy
to understand. When considering coin-flipping, the WLLN says something like: fix = 0.1;
then for n very large, the probability that the proportion of heads is more than 0.6 or less
than 0.4 is very small. We want to be able to say that the sequence Sn actually converges
to 12 , but the WLLN doesn’t tell us this. For example, it can’t exclude the possibility that
Sn careens back and forth between 0 and 1, just spending an increasing percentage of its
time near 0.5. Here’s an illustration of what we would like Sn to look like as n increases,
next to a picture that the WLLN doesn’t exclude:
Ok, the second picture is obviously crazy - it doesn’t happen. So, there should be a better
LLN. Let’s make this careful:
Definition 12.2 (Weak Convergence of Random Variables). Let {Xi }ni=1 be a sequence of
random variables and let c ∈ R. We say that {Xi }ni=1 converges to c weakly if, for all > 0,
lim P[|Xn − c| > ] = 0.
n→∞
Thus, the WLLN can be restated as:
Theorem 69 (WLLN, Redux).
PnLet {Xi }i∈N be a sequence of i.i.d. random variables with
1
2
E[X ] < ∞ and define Sn = n i=1 Xi . Then {Sn }n∈N converges weakly to E[X].
We contrast weak convergence with another form of convergence:
Definition 12.3 (Almost Sure Convergence). Let {Xi }ni=1 be a sequence of random variables
and let c ∈ R. We say that {Xi }ni=1 converges to c almost surely if
P[ lim Xn = c] = 1.
n→∞
Remark 12.4. The phrase ‘almost surely’ is some math jargon that we don’t investigate in
this course. The definition above will be good enough for us, even if the phrase is a little
mysterious.
59
Remark 12.5. Note: this really does make sense, based on our naive notion of a limit! The
Xn ’s are a sequence of functions from a state space Ω, and so we can check if limn→∞ Xn (ω) =
c for ‘almost all’ ω ∈ Ω.
How are these definitions related? Unsurprisingly, weak convergence is less desirable than
almost sure convergence. In particular, the latter implies the former:
Theorem 70. If {Xn }n∈N converges to c almost surely, it also converges to c weakly.
But the former does not imply the latter!
Example 71 (Weak but not A.S. Convergence). Define the sequence of independent random
variables {Xn }n∈N by having Xn = 21 with probability 1 − n2 , equal to 0 with probability n1 ,
and equal to 1 with probability n1 . It is easy to check that, for any > 0,
2
1
lim P[|Xn − | > ] = lim = 0,
n→∞
n→∞
2
n
so the sequence converges weakly to 0. On the other hand, if it converged a.s. to 12 , there
would have to be some step τ so that Xt = 12 for all t > τ . It turns out that this happens
with probability 0!
There are some related constructions. Define
Pn {τn }n∈N to be a sequence of geometric random
n
variables, with means 2 , define bn = i=1 τi , and define B = {bi }i∈N . We then set Yn =
1n∈B . It is easy to check that Yn converges to 0 weakly, but not almost surely.
These examples are exactly about the ‘bad picture’ I was trying to avoid. So it should be
no surprise that:
Theorem 72 (Strong Law of Large Numbers).P
Let {Xi }i∈N be a sequence of i.i.d. random
variables with E[X 2 ] < ∞ and define Sn = n1 ni=1 Xi . Then {Sn }n∈N converges to E[X]
almost surely.
12.2.3. Central Limit Theorem. This is probably the most famous theorem in statistics. The
idea is: the law of large numbers tell us that sample averages converge to ‘true’ means; the
central limit theorem tells us how quickly this happens. We’ll see that taking that claim too
seriously can be a little misleading, but this is the goal.
Theorem 73 (Central Limit Theorem). Let {Xi }i∈N be a sequence of iid random variables
with E[X12 ] < ∞ and denote by Φ(x) the CDF of the standard Normal random variable.
Then, for all x,
Pn
(Xi − E[Xi ])
p
lim P[ i=1
≤ x] = Φ(x).
n→∞
nV ar[X1 ]
Lots of comments on this!
Remark 12.6. The formulation in the textbook is a little bit dangerous. It mentions that
you don’t need the Xi to be i.i.d. for something that looks like the CLT to hold, but doesn’t
really say what replaces this requirement. It is easy to check that you need some condition
i
- for example, the conclusion can’t hold for Xi ∼ exp(22 ). The short answer is that, for n
large, the sum can’t be dominated by a small number of terms. To see a slightly more specific
answer, see the Lindeberg CLT. A huge amount of probability theory is devoted to proving
things that look like the CLT, so there are very, very many more specific answers.
60
Remark 12.7. The LLN said that |Sn − E[X1 ]| → 0. The CLT refines this, suggesting that
|Sn − E[X1 ]| ≈ √σn → 0. Note that the CLT implies the weak LLN, but does not imply the
strong LLN. There are stronger theorems that incorporate both.
Remark 12.8. The CLT is a limit theorem - it doesn’t say anything for finite values of n,
and it doesn’t guarantee P
convergence for all x simultaneously. For example, you cannot use
the CLT to calculate P[ ni=1 Xi > 10]; you can’t even get a rigorous estimate! Nonetheless,
it is popular to pretend that the CLT is really an equality for n ‘reasonably large.’ This
is often fairly safe. If you are interested in justifying this sort of thing, a first step is the
Berry-Esseen theorem.
Remark 12.9. We won’t quite prove the CLT in this class, but we’ll give an outline of the
‘standard’Pproof. Let MX (θ) be the MGF for X1 , and let MSn (θ) be the MGF for Sn =
n
√ 1
i=1 (Xi − E[Xi ]). By our rules for MGF’s, we have:
V ar[X]n
θ
MSn (θ) = (MX ( p
))n
V ar[X]n
3
θ
1
θ2
= (MX (0) + p
+ O(n− 2 ))n
MX0 (0) +
2 V ar[X]n
V ar[X]n
3
1 θ2
+ O(n− 2 ))n
= (1 +
2n
θ2
→e2.
But this last object is exactly the MGF of the normal distribution! So, our MGF’s converge
to the right thing. The above calculation really does work; so why isn’t it a proof of the CLT?
We know that, if two random variables have the same MGF, they must have the same distribution. Here, we have only shown that the MGF’s of a sequence of random variables converge
to an MGF. To prove the CLT, we need to prove that convergence of the sequence of MGFs
implies convergence of the sequence of random variables. As we have seen already with
the LLN, the convergence of random variables can be a bit subtle, but everything turns out
to work in this case.
Remark 12.10. As you have probably seen, not every random variable has E[X 2 ] < ∞, and
so the CLT doesn’t apply to every random variable. It turns out that, for all 21 ≤ α ≤ 1,
there are stable distributions F so that i.i.d. sequences {Xi }i∈N ∼ F satisfy
n
X
−α
n
Xi ∼ X1 .
i=1
1
,
2
For α = this stable distribution is the normal distribution. For α = 1, the stable distribution is called the Cauchy distribution; it has PDF
1
fX (x) =
.
2
π(x + 1)
The standard application of the CLT in statistics classes is as follows:
Example 74. I flip a fair coin 500 times, and record the number of heads X. What is
(approximately) P[X > 270]?
61
Denote by Z a standard normal random variable. Then
X − 250
20
P[X > 270] = P[ q
>q ]
500
4
500
4
X − 250
= P[ q
> 1.79]
500
4
≈ P[Z > 1.79] ≈ 0.037.
62
13. Lecture 12: Midterm Exam Review (Feb. 26)
13.1. Summary.
• A few administrative details.
• Midterm is next class! Today is devoted to review.
13.2. Lecture. Today is devoted to review for the midterm next class, on March 2. We
don’t have time to go over every question that might show up. Instead, I will describe the
different types of questions that might show up. Midterm overview:
• The midterm covers chapters 1 to 8 of the textbook (i.e. everything we have covered
up to and including last class). There are no programming questions. It is closed
book, with a ‘cheat sheet.’ The cheat sheet is on my website.
• The midterm has 2 ‘long-answer’ questions and 8 multiple choice questions. The longanswer questions are computational, are involved but not at all tricky, and focus on
the later parts of the course (e.g. chapters 6/7). The 8 multiple choice questions
cover the 8 chapters evenly. All are fairly short; most are straightforward, but a few
are ‘tricky.’
• My feeling is that (some of) the homework is more difficult than the textbook questions. The computational questions on the midterm are meant to be of similar difficulty to the textbook questions; the tricky questions require little work but might
require a first step that isn’t obvious.
• In terms of material covered: I suggest that you make sure that you are very comfortable with the solutions to all three homework sets and to the examples in this
lecture. I also suggest that you do several practice problems from the textbook and
check your answers. Doing these things should get you a decent mark. The last
few marks (e.g. 1-3 parts of questions) might be a little trickier, though again some
version of the tricks have been mentioned in class or in homework at some point.
• Some general study advice:
– Look over the cheat sheet, and make sure that it makes sense to you. There is
one typo for the expected value of Weibull distribution. I may not have time
to correct this before the exam, and will not be able to correct it if you are not
writing the exam in class.
– It is probably not worthwhile to memorize formulas related to distributions. It
is probably worthwhile to memorize formulas related to fundamental objects
(e.g. linearity of expectation, product formula for generating functions) and
conditional distributions (e.g. Bayes’ rule, law of total variance, etc). The
difference is that it is hard to think about conditional distributions unless you
largely have the formulas memorized.
– For the computational questions, doing (lots of) practice should be enough. The
one ‘funny’ thing that pops up would be what I call ‘backwards’ versions of
questions. For example, a ‘standard’ question is: X is Poisson with E[X] = 3;
what is P[X = 0]? The answer is to note that a Poisson random variable with
mean 3 has λ = 3, and so P[X = 0] = e−3 ≈ 0.05. A ‘backwards’ version of the
question might be: X is Poisson and P[X = 0] = 0.05; what is E[X]? In this
case, we write 0.05 = e−λ , and so λ ≈ 3, and so E[X] = λ ≈ 3. These questions
63
are both referring to the same random variable, and involve doing essentially the
same calculations. You just need to be aware that these questions are fair game.
– For the tricky questions, my main advice is to not be afraid! For most of these
questions, there is a calculation you would like to do but are unable to do. The
right choice is almost always to write down the calculation that you want to do,
and then make a simple observation. Here is a prototypical example, from Ch. 1.
Let {Bi }4i=1 be a partition of the sample space Ω, and assume that P[A|Bi ] = 10i .
Is it possible that P[A] ≥ 12 ?
P
We want to write P[A] = 4i=1 P[A|Bi ]P[Bi ], but we don’t know P[Bi ]. Fortunately, we don’t need it to answer the question - we are averaging a bunch of
numbers that are less than 12 , so the average must be less than 12 . This principle
should get you through all of the tricky questions!
– There may be 1-2 questions that are really checking to see if you know a definition
(e.g. what a mixture of distributions is), an idea we’ve been using (e.g. the
diagrams for phase-type distributions), or a technique (e.g. using a Gaussian or
Poisson approximation). These are, of course, things that aren’t on the cheat
sheet.
Now we’ll do practice problems from each chapter. Since you all probably know the first
few chapters well, I start at the end:
13.2.1. Chapter 8. There are really only three types of questions here:
• Plug something into Markov’s inequality. E.g. E[X 2 ] = 12; is it possible that P[X >
12
200] > 0.5? No, since P[|X| > 200] = P[X 2 ≥ 40, 000] ≤ 40000
0.5.
• Do you understand types of convergence?
• Use the normal approximation suggested by the central limit theorem; then do a
‘chapter 7’-type question.
13.2.2. Chapter 7. There are a small number of significant question types:
• We give you a parameter and ask you to calculate an expectation or probability. Eg.
X has exponential distribution with mean 2; calculate E[X 3 ] and P[X < 1]. These
questions basically involve plugging into a formula; you need to practice, but there
isn’t much to say.
• A backwards version of this question: we give you an expectation or a probability
and ask you to calculate a parameter. This is only possible either for one-parameter
families of distribution, or if we give some of the parameters. E.g. X has normal
distribution with mean 0, and P[−9.1 ≤ X ≤ 9.1] = 0.95. What is V ar[X]? To
answer this, look up in a table that
p P[−1.96 ≤ Z ≤ 1.96] = 0.95 for the standard
9.1
normal distribution; thus, 1.96 = V ar[X]. For other distributions, you might have
a formula for the probability in the setup of the question rather than using a table
lookup.
• Compound questions, for which one essentially must do two of these questions together.
• Qualitative questions. For example, can a hypoexponential distribution ever have
larger variance than an exponential distribution with the same mean? To answer this
question, let X be hypoexponential with means λ1 , . . . , λr and let Y be exponential
64
with mean λ. Then
V ar[Y ] = λ2 = E[Y ]2 ,
r
r
X
X
2
V ar[X] =
λi < (
λi )2 = E[X]2 .
i=1
i=1
NOTE: I consider this to be a fairly tricky question!
The first of these question types is the most common. For these questions, the hardest
part is often recognizing what distribution to use. For this reason, I suggest reading over
many questions, even those that you will not complete.
Example 75. X has exponential distribution with mean 1. What is P[X > 4]?
P[X > 4] = e−4 ≈ 0.018.
It is worthwhile to pay special attention to calculations involving the phase-type distributions. They are new for most of you, the calculations are more difficult, and they are very
closely related to what we will study later in the term. These three things also mean that
they are likely to show up on exams! You are also expected to be familiar with the way
we have been representing phase-type distributions, including both the diagrams and the
various matrices and vectors. Let’s look at a ‘typical’ question:
Example 76. When people call a helpline at a bank, they must wait an initial amount of time
that is exponential with mean 10 minutes to get through the initial menu. Twenty percent of
callers must then talk to a representative; this takes an amount of time that is exponential
with mean 40 minutes. Ten percent of callers who talked to a representative must then talk
to a manager; this takes an amount of time that is exponential with mean 80 minutes. Let S
be the total amount of time taken. Draw the phase-type diagram associated with S and also
calculate E[S].
Let X1 , X2 and X3 be exponential distributions with means 10, 40 and 80 respectively. Let
Y2 and Y3 be Bernoulli random variables with means 0.2 and 0.1 respectively. To calculate
E[S], we have
E[S] = E[X1 + Y2 X2 + Y2 Y3 X3 ]
= E[X1 ] + E[X2 ]E[Y2 ] + E[Y2 ]E[Y3 ]E[X3 ]
= 10 + (0.2)(40) + (0.2)(0.1)(80) = 19.6.
Diagram goes here in class. NOTE: Calculating variance is also fair game! I omit it only
for time.
13.3. Chapter 6. Although there are many distributions here, the questions associated with
chapter 6 are almost identical to those associated with chapter 7. My only tips are:
• Try to remember which is which among binomial, negative binomial, hypergeometric,
and Poisson.
• The Poisson approximation is introduced in chapter 6.
65
Example 77. X has Poisson distribution, and P[X = 0] = 0.1. What is E[X]?
Looking at the PMF, we have
0.1 = P[X = 0] = e−λ ,
so λ ≈ 2.3. Thus, E[X] ≈ 2.3.
13.3.1. Chapter 5. There are a surprisingly large number of reasonable questions for chapter
5. Some straightforward questions:
• A distribution and a function are given, and you are asked to calculate the expectation
or conditional expectation. This is ‘just’ calculus; it might be difficult, and you should
practice, but there is no probability content.
• As above, but with a generating function.
Some harder computational questions involve conditioning. You should review question 3 in
homework set 2 for a simple version of this. Here is another:
Example 78. Say N is exponential with mean 5, and X is binomial with p =
What is V ar[X]?
Use the formula
1
2
and n = N .
V ar[X] = E[V ar[X|N ]] + V ar[E[X|N ]]
N
N
= E[ ] + V ar[ ]
4
2
5 25
= + .
4
2
This chapter also involves some questions that aren’t very computational:
• Proving relationships using generating functions. See question 2 of homework 2.
• Proving inequalities. Sometimes these are very simple (e.g. min(X) ≤ E[X] ≤
max(X), or V ar[X] ≥ E[V ar[X|Y ]]). Sometimes they require more thought (e.g.
question 1 of homework 2). Any on an exam will be straightforward - so, if you find
yourself doing a long calculation, know that there is a way to avoid it!
• Reading coefficients off of generating functions (see homework set 3).
13.3.2. Chapter 4. It is possible to ask some very difficult questions based on material from
chapter 4. However, the basic questions involve giving you a few pieces of information and
asking you to use some of the very many formulas that show up in this chapter. The following
is about as difficult as such a ‘standard’ question gets:
Example 79. fX,Y (x, y) = (2x+5y)
, 0 ≤ x ≤ 5, 0 ≤ y ≤ 2. Find fX|Y (x|y).
100
To do this question, the obvious first step is to write:
fX|Y (x|y) =
fX,Y (x, y)
.
fY (y)
You don’t know fY (y), so you calculate it:
Z 5
2x + 5y
1
1+y
fY (y) =
dx =
(25 + 25y) =
.
100
100
4
0
66
Plugging in,
fX|Y (x|y) =
2x + 5y
.
25(1 + y)
There are really three things to make sure you know how to do for questions in this chapter:
• Be familiar with the formulas (so that, e.g., the previous example is second nature
to you).
• Be familiar with calculus (I won’t ask anything mean here).
• Be able to read ‘word problems’ and recognize distributions. These will often be
either distributions from chapter 6/7 or counting problems from chapter 2.
Any ‘tricky’ question in chapter 4 will be of the type discussed at the start of class - some
piece of information needed to complete a calculation will be missing, but enough will be
there that you can answer the question.
13.3.3. Chapter 3. There are very few questions related to chapter 3:
• Standard definition questions (e.g. a CDF is given to you in chart form, and you are
asked to find P[X ≤ 12]). Include such an example
P in class.
P
• Tricky definition questions (e.g. recognizing that x pX (x) = 1, or that 0 ≤ x∈A pX (x) ≤
1). For example, We might say: P[A] = 0.7, P[B] = 0.5. Is it possible that A, B are
mutually exclusive?
• Questions that are ‘really’ chapter-1 questions (e.g. using Bayes’ rule).
13.3.4. Chapter 2. The biggest difficulty in chapter 2 is recognizing what situation you are
in! The following example is one that many people find hard:
Example 80. A box contains 5 white balls and 3 black balls. Two different balls are chosen
at random. What is the probability that they are the same colour?
P[same colour] = P[W W ] + P[BB]
5
3
=
2
8
2
+
2
8
2
≈ 0.464.
13.3.5. Chapter 1. The questions here are essentially the same as those in chapter 3. Here
is a typical question:
Example 81. Let X1 , X2 , X3 be independent Bernoulli random variables with success probabilities 0.9, 0.8, 0.5. Let X = X1 + X2 + X3 and Y = (X1 , X2 , X3 ). What is P[X = 1]?
P[X = 1] = P[{Y = (1, 0, 0)} ∪ {Y = (0, 1, 0)} ∪ {Y = (0, 0, 1)}]
= P[X1 = 1, X2 = 0, X3 = 0] + P[X1 = 0, X2 = 1, X3 = 0] + P[X1 = 0, X2 = 0, X3 = 1]
= P[X1 = 1]P[X2 = 0]P[X3 = 0] + P[X1 = 0]P[X2 = 1]P[X3 = 0] + P[X1 = 0]P[X2 = 0]P[X3 = 1]
= (0.9)(0.2)(0.5) + (0.1)(0.8)(0.5) + (0.1)(0.2)(0.5) = 0.14.
Note that we are following a fairly standard machine here, one which you follow for essentially all ‘standard’ questions. The machine looks like:
67
(1) Write the event you’re interested in as a complicated intersection/union of other
events.
, and more complicated ones
(2) Get rid of conditional probabilities via P[A|B] = P[A∩B]
P[B]
via Bayes’ rule.
(3) Get rid of unions via P[A ∪ B] = P[A] + P[B] − P[A ∩ B].
(4) Get rid of intersections by independence or mutual exclusion assumptions.
(5) If everything is elementary, you’re done! If you have intersections of events that
aren’t independent or mutually exclusive, go back to step 1.
Another typical question is:
Example 82. Box 1 has a fair die in it, and box 2 has a die with 6’s printed on all faces.
A die is picked at random from one of the boxes and rolled; it lands with a 6 face up. What
is the probability that it was picked from the second box?
P[B = 2|F = 6] =
=
P[F = 6|B = 2]P[B = 2]
P[F = 6|B = 2]P[B = 2] + P[F = 6|B = 1]P[B = 1]
1
2
1
2
+
1
12
=
6
≈ 0.86.
7
Tricky questions involve being unable to use this machine. I like the following example
from the textbook:
Example 83. We have
1
P[A|B] = P[B] = P[A ∪ B] = .
2
Are A, B independent?
This looks scary - it isn’t clear what the first step should be. It is possible to answer this
question by thinking ahead and deciding what you want to calculate, but that might be too
much to expect on an exam. Let’s see if we can come up with an algorithm for this type of
question. Like all tricky questions, we first write down what we want:
P[A ∩ B]? =?P[A]P[B].
One of these terms, P[B] = 12 , we already know. Lets look at the others. First,
P[A ∩ B] =???
We don’t have many formulas here. Some prospects are:
P[A ∩ B] = P[A] + P[B] − P[A ∪ B]
P[A ∩ B] = P[B|A]P[A]
P[A ∩ B] = P[A|B]P[B].
The first two look hopeless, but the last looks good! We have
1
P[A ∩ B] = .
4
68
Finally, we try to calculate P[A]. I only know two formulas with P[A] by itself:
P[A ∩ B]
P[B|A] =
P[A]
P[A ∪ B] = P[A] + P[B] − P[A ∩ B]
The first we can’t use, but the second involves only terms we’ve calculated. We find:
1
1 1
= P[A] + −
2
2 4
1
P[A] = .
4
Thus, we have:
1
P[A ∩ B] =
4
11
P[A]P[B] =
6= P[A ∩ B].
42
69
14. Lecture 13: Midterm Exam (Mar. 2)
Good luck!
70
15. Lecture 14: Start of Chapter 9: Markov Chains (Mar. 5)
15.1. Summary.
• A few administrative details.
• Brief midterm recap.
• We start studying Markov chains.
15.2. Lecture.
15.2.1. Midterm Recap. The midterm was graded out of 36. The mean was 25.27, the median
was 29. In terms of individual questions, the number of people getting the multiple-choice
questions wrong were:
1:6
2 : 16
3 : 10
4:7
5 : 13
6:5
7:9
8 : 6.
This struck me as reasonably uniform. Question 2 seemed to be the most difficult. This
was expected (although a nearly-identical question was in Homework 3). Question 5 was
the next-most difficult; it was intended to be less straightforward. All other questions were,
basically, done well.
The midterm basically worked as I had hoped: roughly 75 percent of the questions were
straightforward and roughly 25 percent were somewhat tricky. The result was an exam
that most students did fairly well at. Since it worked fairly well, I intend to have a similar
structure for the final exam and to run review in a fairly similar way. If that worked for you:
great! If not, I hope to hear from you on what we can do differently.
15.2.2. Introduction. Today, we begin studying Markov chains, one of the main new subjects
in this course. The first 2-3 weeks of this will be fairly abstract, so we take a moment to say
what they are about.
Concretely, most Markov chains we encounter will look like the drawings of phase-type
distributions we’ve been making. There will be a bunch of states with arrows between them,
and we imagine following the arrows in some random way. Rather than merely studying the
exit time, however, we’ll study many other properties of this process and consider variants
and extensions of this idea.
In general, Markov chains describe a system that evolves over time, with the important
‘memoryless’ property that the next state depends only on the current state, never the distant
past. They are described mathematically by square matrices with nonnegative entries (or
sequences of matrices, or matrix-like objects), and there is a lot of overlap between the
theory of Markov chains, applied linear algebra, and partial differential equations. Quite
71
often, the ‘same’ object will be interpreted through all of these lenses. We give a few places
that Markov chains pop up in the sciences and computation:
(1) Markov chains as models.
• Statistical physics and material science: we would like to model how molecules
bounce around in a box, or otherwise interact. The states of the Markov chain
include all information about all of the particles being tracked - for example,
a list of their coordinates. The ‘randomness’ here corresponds to details of the
system that aren’t included in the model - e.g. the exact orientation of molecules.
I don’t think it is obvious a priori why these models should have the memoryless
property, but it turns out to be a good idea. These models are extremely popular
and can be used to make detailed calculations.
• Queues: Consider the number of people in a queue - for example, waiting on the
phone for technical support. In many situations, the evolution of this number
turns out to be quite closely modeled by a Markov chain.
• Language processing: the sequences of characters, or pairs of characters, or
words, etc. can be modeled as a Markov chain. This idea, and sophisticated
variants, have been very useful ways to understand speach and text. Unlike the
previous examples, these models are ludicrous - it is easy to tell apart real text
and a simple Markov chain model. Nonetheless, as you will see in homework,
the simple models are very useful. They are also easier to work with than many
otherwise-comparable models.
(2) Markov chains as computational tools.
• The most famous use of Markov chains in computation is as a way to combine
rankings into one global ranking. This is the basis of the famous PageRank
algorithm developed by/for Google. The idea is to build a Markov chain out
of linked websites, and analyze the chain’s behaviour. The same approach has
been used successfully in many other settings.
• In many situations, it turns out to be much easier to design a Markov chain to
sample from a distribution than it is to sample from the distribution directly.
The prototypical example here is again statistical physics: the easiest way to
simulate the ‘typical’ arrangement of a bunch of molecules is just to simulate
the system for a very long time. The same idea turns out to be useful in many
surprising places, and Markov chains are an important tool for doing simulations.
We now get started with a sequence of definitions:
15.2.3. Definitions.
Definition 15.1 (Stochastic Process). Fix some set T , known as an index set. A stochastic
process is a collection of random variables {Xt }t∈T . If T ⊂ N, {Xt }t∈T is known as a discretetime stochastic process. If T = (a, b) ⊂ R, it is known as a continuous-time stochastic
process.
The values (or range) of the random variables are often called states. NOTE: Unlike
previous random variables, we may not identify states with elements of R. That is fine!
Definition 15.2 (Stationary Process). A stochastic process {Xt }t∈T is stationary if, for all
n ∈ N, all x1 , . . . , xn R, all t1 , . . . , tn ∈ T , and all α ∈ R so that t1 + α, . . . , tn + α ∈ T , we
72
have
P[Xt1 ≤ x1 , . . . , Xtn ≤ xn ] = P[Xt1 +α ≤ x1 , . . . , Xtn +α ≤ xn ].
Stationary processes ‘look the same’ at all time points. This doesn’t mean that they are
iid:
Example 84. Let X0 be Bernoulli with success probability 12 . Then set
Xt+1 = 1 − Xt .
We have that {Xt }t∈N is stationary, but there is only one ‘bit’ of randomness.
Remark 15.3. Despite what the book claims here, we are certainly interested in nonstationary processes.
We don’t yet have the notation to describe homogeneous and inhomogeneous stochastic
processes. Roughly, the former correspond to processes that evolve according to a fixed
diagram, the latter according to a different diagram at every time.
Definition 15.4 (Discrete-Time Markov Chain (DTMC)). A DTMC is a stochastic process
{Xt }t∈T with index set T = N that satisfies
P[Xt+1 = xt+1 |Xt = xt , . . . , X1 = x1 ] = P[Xt+1 = xt+1 |Xt = xt ]
for all t ∈ N and all x1 , . . . , xt+1 .
We will generally assume that our Markov chains have domains that are subsets of N and
write:
Definition 15.5 (Transition Matrix). Define
pij (n) = P[Xn+1 = j|Xn = i]
to be the single-step transition probabilities of a Markov chain {Xt }t∈N . Denote by P (n) the
matrix with entries pij (n), and call this the transition probability matrix. NOTE: This may
not really be a ‘square matrix,’ as we allow it to have infinitely many rows and columns. We
will spend quite a bit of time thinking about this sort of infinity.
We notice that:
• For all i, j, n,
P0 ≤ pij (n) ≤ 1.
• For all i, n, j pij (n) = 1.
Any matrix with these properties is called a Markov matrix or transition matrix, and
defines the transition probabilities of some Markov chain.
Definition 15.6 (Time-Homogeneous Markov Chain). A Markov chain is time-homogeneous
if its transition matrix P (n) does not depend on n. Otherwise, it is called time-homogeneous.
Example 85 (Simple Markov Chain Calculations). Consider the Markov chain with transition matrix


0.7 0.2 0.1
P =  0.5 0.1 0.4 
0 0.5 0.5
The associated picture is:
73
Let’s consider a Markov chain {Xt }t∈N started at X1 = 1, evolving according to the transition matrix P . We can read off immediately:
P[X2 = 1|X1 = 1] = P [1, 1] = 0.7.
We can also easily calculate the probabilities of any sample path as follows:
P[(X1 , X2 , X3 , X4 ) = (1, 2, 3, 3)]
= P[X4 = 3|(X1 , X2 , X3 ) = (1, 2, 3)]P[X3 = 3|(X1 , X2 ) = (1, 2)]P[X2 = 2|X1 = 1]
= P[X4 = 3|X3 = 3]P[X3 = 3|X2 = 2]P[X2 = 2|X1 = 1]
= P [3, 3]P [2, 3]P [1, 2] = (0.5)(0.4)(0.2) = 0.04.
Lets go further. What about just calculating P[X3 = 1]? We make the following calculation,
which will be ubiquitous:
P[X3 = 1] = P[X3 = 1, X2 = 1|X1 = 1] + P[X3 = 1, X2 = 2|X1 = 1]
+ P[X3 = 1, X2 = 3|X1 = 1]
= P[X3 = 1|X2 = 1, X1 = 1]P[X2 = 1|X1 = 1]
+ P[X3 = 1|X2 = 2, X1 = 1]P[X2 = 2|X1 = 1]
+ P[X3 = 1|X2 = 3, X1 = 1]P[X2 = 3|X1 = 1]
= P[X3 = 1|X2 = 1]P[X2 = 1|X1 = 1] + P[X3 = 1|X2 = 2]P[X2 = 2|X1 = 1]
+ P[X3 = 1|X2 = 3]P[X2 = 3|X1 = 1]
= P [1, 1]P [1, 1] + P [2, 1]P [1, 2] + P [3, 1]P [1, 3]
= (0.7)(0.7) + (0.2)(0.5) + (0)(0.1) = 0.59.
In general, we will often find ourselves writing for s < t < u:
X
P[Xu = i|Xs = j] =
P[Xu = i|Xt = k]P[Xt = k|Xs = j].
k
What about P[X4 = 1] or P[X5 = 1]? We could obviously continue as above, working out
these probabilities one index at a time. However, we will see that we can be much more
efficient.
We won’t see nearly as many inhomogeneous Markov chains, but here is one:
Example 86. Define
P (n) =
.
1−
1
n
74
1
n
1
n
1−
1
n
Assume X1 = 1. Then
P[(X1 , X2 , X3 , X4 , X5 ) = (1, 2, 1, 1, 1)] = p12 (1)p21 (2)p11 (3)p11 (4)
1
1 2 3
= (1)( )( )( ) = .
2 3 4
4
Example 87 (Finite-Memory Processes). Let {Xt }t∈N be a stochastic process with range Ω.
Assume that, for some k ∈ N, it satisfies:
P[Xt+1 = xt+1 |Xt = xt , . . . , X1 = x1 ] = P[Xt+1 = xt+1 |Xt = xt , . . . , Xt−k+1 = xt−k+1 ].
If k = 1, this would be a Markov chain. For k > 1, it is not quite a Markov chain. These
processes are often called k-dependant processes, and they can easily be converted into Markov
chains in the following way. Define:
Yt = (Xt , Xt+1 , . . . , Xt+k−1 ).
Then {Yt }t∈N really is a Markov chain. This is a very general trick, and it allows us to use
our Markov chain techniques to study things that don’t (initially) look like Markov chains.
Note: this is a simple but important trick, and it is easy to put on an exam.
15.2.4. Sojourn Times and Embedded Chains. So far, our Markov chains have been discretetime, even though the phase-type distributions involved waiting times that were continuous.
We give some definitions that will be used later on to bridge some of that gap:
Definition 15.7 (Sojourn Time). Consider a Markov chain with Xt = i, Xt−1 6= i. Then
define the sojourn time or holding time of state i to be the random variable:
R = R(i, t) ≡ max{s ∈ N : Xt+s−1 = i}.
We note that, if Xt is time-homogenous, we can write
P[R = s] = P[Xt = i, . . . , Xt+s−1 = i; Xt+s 6= i]
= ps−1
ii (1 − pii ).
This is the PDF of a geometric random variable with parameter p = pii .
And then:
Definition 15.8 (Embedded Chain). Let {Xt }t∈N be a time-homogeneous DTMC. Then
define τ0 = 0, and
τs+1 = min{t > τs : Xt 6= Xτs }.
Finally, let
Y s = X τs .
This is the embedded chain for {Xt }t∈N . Some observations:
• If pii < ∞ for all i, this definition makes sense. Otherwise, the embedded chain might
not be defined for all indices.
• If the embedded chain is defined, it is a Markov chain. Actually, we can do better.
Let P be the transition matrix for Xt . Then the transition matrix for Yt is given by:
pˆii = 0
pˆij =
pij
.
1 − pii
75
16. Lecture 15: More Chapter 9: Linear Algebra Review and
Classification of States (Mar. 9)
16.1. Summary.
• A few administrative details.
• We continue building the theory of Markov chains.
16.2. Lecture. Recall that a time-homogeneous discrete-time Markov chain is a stochastic process {Xt }t≥0 , whose law is determined completely by the distribution of X0 and the
transition ‘matrix’ P . Recall that the embedded chain has transition matrix Pˆ given by:
pij
pˆij =
1 − pii
pˆii = 0.
Remark 16.1. Implicitly, we will generally talk about Markov chains that are discrete-time
and which have a finite number of states {1, 2, . . . , n}. The textbook refers to these as finite
Markov chains.
Example 88 (Simple Markov Chain Calculations, Redux). Recall our previous Markov
chain:


0.7 0.2 0.1
P =  0.5 0.1 0.4 
0 0.5 0.5
The transition matrix for the embedded chain is:


0 32 31
P =  59 0 94 
0 1 0
We can do all of our favorite calculations here.
16.2.1. Stationary Distributions and the Chapman-Kolmogorov Equations. So far, we can
calculate the probability that a Markov chain follows a given path for a short period of time,
and also calculate the probability that it is in a specific state at a specific time. However, for
most of our applications, we are interested in the long-run behaviour of the Markov chain.
For example, we might want to know:
• How much time, on average, does the chain spend in state i?
• Does the chain return to state i infinitely often?
• Does the chain ever get stuck in some state?
These questions are hard to answer based on the types of calculations that we know how to
do, and so we begin to develop a theory to answer them.
Let P be the transition matrix for a time-homogeneous DTMC Xt . We recall that
X
P[X3 = i|X1 = j] =
P[X3 = i|X2 = k, X1 = j]P[X2 = k|Xi = j]
k
=
X
P[X3 = i|X2 = k]P[X2 = k|Xi = j]
k
76
=
X
pki pjk .
k
But this formula looks pretty familiar. It is the j, i entry of P 2 ! By induction on t, it is easy
to check that
P[Xt = i|X1 = j] = e†i P t−1 ej ,
(16.1)
where ej is the vector with a ‘1’ in entry j and a ‘0’ everywhere else. We will often write
pij = e†j P m ei
(m)
for the m-step transition probabilities of a Markov chain. We note that the associated matrix,
P m , really is the transition matrix for a Markov chain. Indeed, it is the transition matrix
for the Markov chain Yt ≡ Xmt .
Equation (16.1) tells us that
lim P[Xt = i] = lim e†i P t−1 ej .
t→∞
t→∞
(16.2)
What does this look like? We can calculate P t directly, but it is a bit of a nightmare for
large t. Instead, we’ll review the way that you were taught to calculate powers of matrices in
linear algebra class, and then we’ll notice that we don’t quite need to calculate everything.
Example 89 (Review of Powers). We consider the matrix


0.7 0.2 0.1
P =  0.5 0.1 0.4 
0 0.5 0.5
,
and try to calculate limt→∞ P t if that exists. We have a machine for doing this calculation:
(1) First, we find the eigenvectors and eigenvalues of P . The usual way to do this is:
(a) Calculate the characteristic polynomial of P , given by f (λ) = det(P − λId).
(b) Find the roots λ1 , λ2 , λ3 of f .
(c) Find vi satisfying the equation P vi = λi vi for 1 ≤ i ≤ 3. This step changes a
little if there is multiplicity in the roots of f , but we ignore that for now.
For now, we just use R. Recall that our commands are:
P = matrix(c(0.7,0.2,0.1,0.5,0.1,0.4,0,0.5,0.5), nrow = 3)
eigen(P)
R will spit out:
$values
[1] 1.0000000 0.5405125 -0.2405125
$vectors
[,1]
[,2]
[,3]
[1,] -0.7407611 -0.7992229 0.4337809
[2,] -0.4444566 0.2549321 -0.8159527
[3,] -0.5037175 0.5442907 0.3821718
This means that the eigenvalues are λ1 = 1,λ2 ≈ 0.5405 and λ3 ≈ −0.24. The
eigenvectors are v1 = (0.741, 0.444, 0.504) and so on.
77
(2) Next, we write
P = QDQ−1 ,
where Q = [v1 v2 v3 ] and


λ1 0 0
D =  0 λ2 0 
0 0 λ3
(3) Our formula for P now has a magical property, which was the whole reason for doing
this:
P t = (QDQ−1 )t = QDt Q−1 .
Unlike P , which is complicated, it is easy to take powers of D. They are:

 t
λ1 0 0
Dt =  0 λt2 0 
0 0 λt3
.
(4) Finally, we take limits:
lim P t = Q lim Dt Q−1 .
t→∞
t→∞
But these limits are either 0 or 1 if they exist! In our case, we have


1 0
0
0 
D =  0 0.54
0 0 −0.24
,
so


1 0 0
lim Dt =  0 0 0 
t→∞
0 0 0
.
We will see that this isn’t an accident - any ‘reasonable’ Markov chain will have
this as the limiting matrix, even in high dimensions.
(5) We conclude:
lim P[Xt = i] = lim e†i P t−1 ej
t→∞
t→∞
v1 [i]
=P
.
j v1 [j]
In this case,
lim P[Xt = 1] = 0.439
t→∞
lim P[Xt = 2] = 0.263
t→∞
lim P[Xt = 3] = 0.298.
t→∞
78
So, we have seen:
• It is straightforward to calculate the probability P[(X1 = x1 , . . . , Xt = xt )] of a path.
• It is annoying, but possible, to calculate the probability P[Xt = x] directly for t small.
• For t large, we can use linear algebra to calculate the probability P[Xt = x].
16.2.2. Chapman-Kolmogorov Equations. We recall:
Definition 16.2 (m-step transition matrix). For a time-homogeneous DTMC, define
(m)
pij = P[Xn+m = j|Xn = i].
We denote the associated matrix by P (m) = P m . This is analogous to the usual single-step
(1)
transition matrix pij = pij .
We then have the following expression, called the Chapman-Kolmogorov Equations:
X (`) (m−`)
(m)
pij =
pik pkj .
k
This holds for all 0 < ` < m. In matrix form, this looks much nicer:
P m = P ` P m−` .
We can use these formulae to calculate transition probabilities. Assume that our Markov
chain starts out according to the distribution π = π (1) . That is, we write
P[X1 = i] = π (1) (i).
Then define
π (m) (i) = P[Xm = i].
We then have, for time-homogeneous DTMC’s, that
π (m) = π (1) P m−1 .
For time-inhomogeneous DTMC’s, we have the slightly uglier formula:
π (m) = π (1) P (1)P (2) . . . P (m − 1).
We then have, as before,
lim π (n) = π (1) lim P n .
n→∞
n→∞
Remark 16.3. This limit may not exist! We’ve already seen a few hints as to when it exists
- it depends on the eigenvalues and eigenvectors of P . We’ve also seen that, at least in one
example, it actually doesn’t depend on π (1) . This should make some sense after it has been
pointed out: you are jumping around a plot, and eventually you ‘forget’ where you started.
We’ll make all of this more precise soon.
Remark 16.4. The book stops keeping track of time-inhomogeneous Markov chains about
here, restricting their attention to time-homogeneous Markov chains for the most part. I’ll
do the same, and we’ll assume that chains are time-homogeneous unless otherwise stated.
79
16.2.3. Classification of States. The Markov chains we saw when studying phase-type distributions all had the following property: we started on the left-hand side of the graph, bounced
around for a while, and then entered a final graveyard state and stayed there forever. We’ll
introduce some vocabulary based on this, and discuss the different types of behaviours that
Markov chains can have. First, a picture:
We make some simple observations about a Markov chain starting in state 1:
(1) The Markov chain will leave state 1 immediately. States that can only be occupied
for a fixed, finite number of steps are called ephemeral. Note: this word is not used
very often.
(2) The Markov chain will bounce between states 2 and 3 for some period of time, and
there is no deterministic upper bound on the number of steps that the chain can spend
in states 2 and 3. However, it will eventually leave them, and will never come back.
States that can be occupied for any number of steps, but which are occupied in total
for only finitely many steps, are called transient. Note: any ephemeral state is also
transient, but the reverse is not true.
(3) The Markov chain may eventually reach states 4 and 5. If it does, it will bounce
between them forever, reaching both infinitely many times. States that may be reached
infinitely many times are called recurrent. Furthermore, the time in between visits to
state 4 is always exactly 2. States for which these return times are always (multiples
of ) some number k 6= 1 are called periodic with period k. More generally, recurrent
states with return times that have finite mean are called positive recurrent; other
recurrent states are called null recurrent. Note: since our stochastic process is a
time-homogeneous Markov chain, the return time I’ve just described is an honest-togoodness random variable with a specific distribution. At some point, you should make
sure that you understand why this is true, though we will also discuss it more soon.
Note: in Markov chains with finite state spaces, all recurrent chains are positive
recurrent. For infinite Markov chains, the two ideas may not coincide.
(4) If the Markov chain ever reachest state 6, it stays there forever. Such states are called
absorbing states. This is the easiest state to understand: a state i is absorbing if and
only if pii = 1.
All of these definitions are really ways to understand return times, so let us begin by making
that notion more precise:
80
Definition 16.5 (Return Time). Assume X0 = i and define the return time τ by
τ = min{t > 0 : X0 = i}.
Following the textbook, we define
(n)
fii = P[τ = n].
Note: In general, we have
(n)
(n)
P[τ = n] = fii ≤ pii = P[Xn = i].
The RHS is the probability that Xn = i; the LHS is the probability that Xn = i and Xs 6= i
for all 0 < s < n.
We now try to calculate these return time probabilities, using similar tricks to our compu(n)
(k)
tation of pij with the Chapman-Kolmogorov equations. We break up the probability pii in
terms of the first return time to i. This looks like:
(0)
pii = 1
(1)
(1)
(1) (0)
(2)
(2)
(1) (1)
(0)
(3)
(3)
(2) (1)
(1) (2)
pii = fii = fii pii
pii = fii + fii pii + fii
(0)
pii = fii + fii pii + fii pii + fii ,
and in general
(k)
pii =
k
X
(j) (k−j)
fii pii
.
j=1
We can use all of these equations together to compute
(k)
fii
(k)
pii
=
−
k−1
X
(j) (k−j)
fii pii
.
j=1
We then define
fii =
∞
X
(k)
fii .
k=1
(k)
(k)
Remark 16.6. Warning! pii and fii are very similar, but pii and fii are very different!
Definition 16.7 (Recurrent State). A state i is recurrent if fii = 1.
Remark 16.8. Recall that
fii =
∞
X
k=1
(k)
fii
=
∞
X
P[τ = k] = P[τ < ∞].
k=1
Thus, i is recurrent if and only if the probability of return to that state is one.
We now relate the condition fii = 1 to other functions we understand. We note that, if
P[τ < ∞] = 1, we must have that in fact Xt returns to i infinitely often.
81
Remark 16.9. This is a special property of time-homogeneous Markov chains! The idea is
that, whenever the chain returns to i, you imagine that you are starting the chain all over
again. This ‘restarting,’ or ‘memoryless,’ property is exactly what makes Markov chains
special.
Thus, we have
∞
X
E[
1Xt =i ] = E[|{t : Xt = i}|] = ∞.
t=0
But we can rewrite this!
∞ = E[
∞
X
1Xt =i ] =
t=0
∞
X
(t)
pii .
t=0
We have:
Theorem 90. If a state i is recurrent, then
P∞
t=0
(t)
pii = ∞.
We have the related definition:
Definition 16.10 (Transient State). A state i is transient if it is not recurrent.
We note that the pdf of the number of returns to i that a Markov chain ever has is
(n−1)
f (n) = (1 − fii )fjj
.
1
.
fii
In particular, the mean number of
This is exactly a geometric distribution with mean
returns is finite! This lets us upgrade our previous theorem:
P
(t)
Theorem 91. A state i is recurrent if and only if ∞
t=0 pii = ∞.
Remark 16.11. We now have two different but equivalent ways of checking if a state is
recurrent.
82
17. Lecture 16: More Chapter 9: Classification and Special Matrices.
(Mar. 12)
17.1. Summary.
• A few administrative details.
• More on Markov chains
17.2. Lecture.
17.2.1. General note: I will not spend quite as much time repeating notation associated with
Markov chains as I have been. Please ask if there are any questions as to whether a theorem
applies, or what a piece of notation means. In general, however, there is quite a bit of
notation associated with Markov chains and you will have to memorize most of it for the
lectures (and questions) to make sense.
17.2.2. More classification. Last class, we discussed recurrent and transient states. We now
introduce a slightly strange distinction between types of transient states. Define:
Definition 17.1 (Mean Recurrence Time). For a recurrent state i, the mean recurrence
time is
∞
X
(n)
nfii .
Mii ≡
n=1
and
Definition 17.2 (Null and Positive Recurrence). A recurrent state is:
• Null recurrent if Mii = ∞.
• Positive recurrent if Mii < ∞.
All of the recurrent states we have seen so far were positive recurrent. Are there any null
recurrent states? Before we check, I mention the related idea of passage times. For a chain
Xt started at X0 = i, define (temporarily)
τ = inf{t > 0 : Xt = j},
and let
(n)
fij = P[τ = n].
By the same sorts of calculations we did last class, we have
n
X
(n)
(`) (n−`)
pij =
fij pjj
`=1
(n)
fij
=
(n)
pij
−
n−1
X
(`) (n−`)
fij pjj
.
`=1
We recall that P is the matrix with entries pij , and define F to be the matrix with entries
P
(n)
fij . When fij ≡ ∞
n=1 fij = 1, define the mean first passage time to be
Mij =
∞
X
n=1
83
(n)
nfij .
The above tells us that Mij satisfies the equation
X
Mij = 1 +
pik Mkj .
k6=j
This is quite similar to the Chapman-Kolmogorov equation in spirit: both say that we can
decompose the first time that something happens in terms of intermediate events.
Section 9.4 of the textbook gives a discussion on how to change this into linear algebra. I
just mention that we can rewrite the equation as linear algebra as
M = E + P (M − Diag(M )),
†
where E = ee and Diag(M )ij = Mij 1i=j (that is, the matrix that has the same diagonal
entries as M and 0’s everywhere else).
We now find an example of a chain with states that are null recurrent but not positive
recurrent.
Example 92. Define a transition matrix by
pi,i+1 = pi,i =
1
2
for i ∈ Z. We want to show that 0 is null recurrent.
First, recurrence. We have:
1
(2n)
−2n 2n
p0,0 = 2
≈√ ,
n
πn
where the “≈” here is based on Stirling’s formula (which would certainly be provided in an
exam). Thus,
X (n)
1 X 1
√ = ∞.
p0,0 ≈ √
π
n
n
n
By our theorem giving equivalent conditions for a state to be recurrent, this implies that 0 is
recurrent for this Markov chain. On the other hand, I claim that M00 = ∞. To see this, let
Mij be the expected time to get from state i to state j for the first time. By symmetry, we
have all of the following relationships:
1
1
M0,0 = M0,1 + M−1,0
2
2
1
1
M0,1 = M−1,1 +
2
2
M−1,1 = 2M0,1 .
Putting together the last two, we have
M−1,1 = M−1,1 + 1.
This can’t be true for any finite value of M−1,1 , so M−1,1 = ∞. Plugging this into the first
expression, we have M0,0 = ∞ as well.
The following theorem tells us that there are no ‘really simple’ null-recurrent states:
Theorem 93. Let Xt be a finite time-homogeneous DTMC. Then
• No state is null-recurrent.
84
• Not all states are transient.
Proof. We sketch the arguments. For the first: if a state j is recurrent, then for all states i
(n )
there exists ni < ∞, pi > 0 so that pij i > pi . Thus, the mean recurrence time is at most
maxi ni maxi p1i .
For the second: if all states were transient, the Markov chain would only spend a finite
amount of time in each state. But then the Markov chain would have nowhere to go... We give one last classification of states:
(n)
Definition 17.3 (Periodic). Let I = {n : pii > 0} and let p be the greatest common divisor
of I. Then i is called periodic with period p. If p = 1, it is also called aperiodic.
Remark 17.4. Please read section 9.4. It is fairly important, and we have spent a lot of
time on it, but we have skipped several details.
Example 94 (Based on 9.4.3 of textbook). Consider the matrix


0.1 0.4 0.5
P =  0.5 0.2 0.3 
0.4 0.4 0.2
.
We calculate the mean first passage times to state 1. By our recurrence, we have:
M21 = 1 + (0.2)M21 + (0.3)M31
M31 = 1 + (0.4)M21 + (0.2)M31 .
Rewriting, we have the linear equations:
0.8M21 − 0.3M31 = 1
−0.4M21 + 0.8M31 = 1.
I get
M21 ≈ 2.12
M31 ≈ 2.31.
We can then calculate
M11 = 1 + 0.4M21 + 0.5M31 ≈ 3.
17.2.3. Irreducibility. So far, we have discussed classifications of individual states. Here, we
discuss classifications of collections of states.
(n)
Definition 17.5 (Closed). A collection of states S is called closed if pij = 0 for all i ∈ S,
j∈
/ S and n ≥ 1. If |S| = 1, the single state is called absorbing. A state that is not closed
is called open.
We then have
Definition 17.6 (Irreducibility). If a Markov chain has a proper closed subset of states, it
is called reducible. Otherwise, it is called irreducible.
85
We have some equivalent definitions:
(n)
Definition 17.7 (Reachability). A state j is reachable or accessible from i if supn pij > 0.
We denote this i 7→ j.
A chain is reducible if and only if all states are reachable from all other states.
Definition 17.8 (Communicating). We say i, j are communicating if i 7→ j and j 7→ i.
This can be written i ↔ j.
An important abstract relationship:
Definition 17.9 (Equivalence Relation). Fix a set S. A subset A ⊂ S 2 is an equivalence
relation on S if
(x, y) ∈ A ⇒ (y, x) ∈ A
(x, x) ∈ A
(x, y) ∈ A, (y, z) ∈ A ⇒ (x, z) ∈ A.
Remark 17.10. Let A be an equivalence relationship. Then there exists a partition P =
P1 , . . . , Pk of S that satisfies
x, y ∈ Pi ⇔ (x, y) ∈ A.
Remark 17.11. Let S be the collection of states in a Markov chain. We define A ⊂ S 2 by
(i, j) ∈ A ⇔ (i ↔ j). This is an equivalence relation.
We briefly give some related, less important, definitions:
Definition 17.12 (Return states, etc). A state that communicates with itself is a return
state. The set of states that communicate with i is denote by C(i), the class of i.
Remark 17.13. Section 9.5 has some further results on infinite Markov chains, and these
are probably worth reading.
17.2.4.
(1)
(2)
In this
Special Matrices. We recall that we have seen two special matrices already:
The transition matrix P with entries pij .
The reachability matrix F with entries fij .
section, we define some related matrices. The first is:
Definition 17.14 (Potential Matrix). The potential matrix, with entries rij , is
R=
∞
X
P n.
n=0
Remark 17.15. We have
∞
X
rij = E[
1Xn =j |X0 = i].
n=0
In particular, it is often the case that many entries of R are infinite, while other states can
be 0. We will start out by studying all of the entries that ‘might’ not be 0 or infinity, then
come back and classify the remaining entries.
86
The rows and columns corresponding to transient elements of the state space are called the
fundamental matrix and are denoted by S. These are the only elements of R that may have
entries that are not 0 or infinity. We often order the states of a Markov chain so that states
i, j with rij < ∞ are first. That is, we will write our transition matrices P in the form:
T U
P =
0 V
where T represents all transitions between transient states, V represents all transitions
between recurrent states, and U represents all transitions from transient to recurrent states.
Remark 17.16. This ordering doesn’t quite make sense if the Markov chain is infinite. In
that case, we keep the definition of T, U, V despite the fact that we can’t really ‘order’ the
states in a nice way and the fact that we can’t write
T U
P =
0 V
In this notation, we have that
R=
∞
X
n=0
∞ X
T n U (n)
P =
0
Vn
n
n=0
where U (n) is a complicated collection of matrices that we will ignore for now. Thus, we
can write the fundamental matrix S, which is the upper left-hand part of R, as:
S=
X
T n.
n
There is a nice formula for S:
S = (Id − T )−1 .
Remark 17.17. It isn’t so obvious that this formula makes sense. However, it is ‘really’ the
same thing as the following familiar Taylor series from calculus:
∞
X
xn =
n=0
1
.
1−x
The textbook has more details.
Let’s use this idea:
Example 95. We consider a matrix P of the form
T U
P =
0 V
with
87


0.2 0.4 0.4
T =  0.3 0.3 0.3 
0.1 0.4 0.4
Then, using Gaussian elimination or the R command
solve


03.75 5
5
(Id − T )−1 =  4.5 5.5 4.5 
2.375 4.5 5.5
We now look at the value of rij when we don’t have i, j transient:
• i recurrent: we have rij = ∞ if j ∈ C(i) and rij = 0 otherwise.
(n)
• i transient, j recurrent: we have rij = ∞ if maxn>0 pij > 0 and rij = 0 otherwise.
There are no real calculations to be done for these elements of the matrix.
We note that the entry sij of S gives the mean number of times that a Markov chain
enters state j when started from state i. Section 9.6 of the textbook gives similar formulas
for several related quantitites, such as the variance of this number. We don’t go over this in
class, but please look at this section. I expect you to be aware that the formulas are available.
We note that the textbook also discusses the relationship between the first-passage matrix
F and the potential matrix R. Let H denote the entries of F that correspond to transient
pairs of vertices (just like S denotes the entries of R that correspond to transient pairs of
vertices). The most important relationship is:
H = (S − Id)(Diag(S))−1 ,
where for any matrix M the entries of Diag(M ) are Diag(M )ij = Mij 1i=j .
Section 9.6 of the textbook has many more details. They are worth understanding if you
want to understand the theory, but I will not ask about any of the details in the exam or
expect you to remember them in class.
88
18. Lecture 17: More Chapter 9: Random Walks and Distributions (Mar.
16)
18.1. Summary.
• A few administrative details.
• More on Markov chains
18.2. Lecture. So far, we have focused on finite time-homogeneous DTMC’s, but almost all
of our theorems have also applied to infinite time-homogeneous DTMC’s. Today, we begin
to look at the infinite chains a little more seriously.
89
18.2.1. Random Walk Problems. We will look at ‘nice’ infinite Markov chains. The following
two theorems allow us to distinguish between positive recurrent, null recurrent and transient
chains. We give no proofs, or even arguments:
Theorem 96 (9.7.1 of Textbook). Let P be the transition matrix of an irreducible Markov
chain. Then all of the states of the chain are positive recurrent if and only if the system of
linear equations
z = zP
P
has a solution with j zj = 1. Furthermore, if such a solution exists, it is the unique solution
to these equations and satisfies zj > 0 for all j.
Theorem 97 (9.7.2 of Textbook). Let P be the transition matrix of an irreducible Markov
chain and let P 0 be the matrix obtained by deleting the first row and column of P . Then all
of the states of the chain are recurrent if and only if all solutions y of the system of linear
equations
y = P 0y
satisfy at least one of
(1) y ≡ 0, or
(2) there exists i so that yi ∈
/ [0, 1].
These theorems apply to all Markov chains, but we will begin by applying them to a special
class of chains:
Definition 18.1 (Birth and Death Chain). A DTMC is called a birth-and-death chain if the
transition probabilities pij satisfy
pij = 0
for all |i − j| > 1.
We will study one particular birth-and-death chains in this section; the textbook has another.
Example 98 (Reflecting Random Walk). We define a family of birth-and-death chains with
parameter 0 < p < 1 as follows:

0
1
0
0
0
...
0
p
0
0
... 
 1−p


1−p
0
p
0
... 
 0
P =

0
1−p
0
p
... 
 0
 0
0
0
1−p
0
... 
0
0
0
0
1 − p ...
That is, at any state besides i, we add 1 with probability p and subtract 1 with probability
1 − p. We would like to understand what this chain looks like. It is clear that there are
basically three things that can happen:
(1) p > 21 : In this case, we ‘normally’ add 1. The law of large numbers (and a bit of
thought) is enough to tell us that limt→∞ Xt = ∞.

90
(2) p < 12 : In this case, we ‘normally’ subtract 1 and expect to ‘bounce off of the origin’
quite frequently. None of our old theorems tell us exactly what to expect the Markov
chain to look like, but we might think that the chain spends a lot of time close to 0.
(3) p = 21 : In this case, we wander back and forth, sometimes ‘bouncing off of the origin.’
The central limit theorem tells us what happens without this bouncing... but the longterm behaviour is quite mysterious! Of course, we saw some of the answers last
class.
We look for solutions to the linear equations suggested by our theorems. For the first, we
have
1
z1 =
z0
1−p
zi = pzi−1 + (1 − p)zi+1 , i > 1.
i
p
z0 for any z0 , but don’t prove it (see the
We claim that this has the solution zi = 1−p
textbook for details). Summing, we have
∞
X
1−p
1
zi = z0
,p<
1 − 2p
2
i=0
∞
X
i=0
1
zi = ∞ p ≥ .
2
P
for p < 12 , we get a solution with i zi = 1; for p ≥ 21 , our calculations
Choosing z0 = 1−2p
1−p
show that there is no such solution.
Conclusion: All states are positive recurrent if and only if p < 12 .
We still need to see if the states are null recurrent or transient when p ≥ 21 . We define
0
p
0
0
0
0
p
0
0
 1−p

0
1
−
p
0
p
0

P0 = 
0
1−p
0
p
 0
 0
0
0
1−p
0
0
0
0
0
1−p
0
Then our solutions to y = P y look like

...
...
...
...
...
...







y1 = py2
yi = (1 − p)yi−1 + pyi+1 , i > 1.
A rather more complicated calculation shows that, when p = 21 , the only solution is
yi = iy1 .
Thus, by our earlier theorem, we have that all states are null recurrent in this case.
When p > 21 , we have a solution
j
1−p
yi = 1 −
.
p
Thus, by the same theorem, we that all states are transient in this case.
91
Conclusion:
(1) p > 12 : All states are transient.
(2) p < 12 : All states are positive recurrent.
(3) p = 12 : All states are null recurrent.
Section 9.7 of the textbook has another lovely example, called the Gambler’s ruin, that is
worth reading.
18.2.2. Limits and Stationary Distributions. We saw already, two classes ago, how to calculate
lim P[Xn = i],
n→∞
at least for ‘nice’ finite Markov chains. In this section, we look at this problem in a more
systematic way.
First, some definitions:
πi (n) ≡ P[Xn = i].
In matrix notation, we write:
π(n) = π(0)P n .
We then have
Definition 18.2 (Stationary Distribution). Let P be the transition matrix of a time-homogeneous
DTMC and let z be a solution to
z = zP
that satisfies
P
j
zj = 1. The z is called a stationary distribution of the Markov chain.
Remark 18.3. The theorem from the start of class said that there is at most one solution
z if P is also irreducible. If P is not irreducible, there may be many different solutions.
Regardless of whether or not P is irreducible, we have already seen examples showing that
there may be no solutions at all.
Remark 18.4. If z is a solution and we set π(0) = z, then
π(1) = π(0)P = π(0),
and in general π(n) = π(0). Thus, π(i) doesn’t move - it is ‘stationary.’
Remark 18.5. This means that, at least sometimes, the stationary distribution is the limiting distribution... and when this occurs, we don’t have to diagonalize the whole matrix. It is
enough to solve a single linear equation. This is much easier! Depending on how you learned
to solve linear equations, note that taking transposes might be helpful:
z† = P †z†.
92
Example 99. Set


0.3 0.2 0.5
P =  0.5 0.1 0.4 
0 0.5 0.5
Then we can use R to solve z = zP with the commands
P = matrix(c(0.3,0.2,0.5,0.5,0.1,0.4,0,0.5,0.5),nrow=3)
v = eigen(P)$vectors[,1]
v/sum(v)
The result is z ≈ (0.22, 0.31, 0.47). You should get some practice solving these equations
by hand, using Gaussian elimination/row reduction.
We now have a related definition:
Definition 18.6 (Limiting Distribution). Let P be the transition matrix of a time-homogeneous
DTMC. If
lim P n
n→∞
exists, then
π ≡ lim π(n)
n→∞
exists and is called a limiting distribution of the Markov chain.
Remark 18.7. This is a little misleading. The limiting distribution is not necessarily a
distribution at all! Consider the Markov chain Xt = t. This has no stationary distribution.
However, the vector π(i) = 0 for all i is a ‘limiting distribution’ according to our definition,
despite the fact that it isn’t a distribution function.
Theorem 100. When all states of a Markov chain are positive recurrent and aperiodic, then
there exists a unique limiting distribution which is also a stationary distribution.
Example 101. We have already seen chains with unique limiting distributions and nonunique limiting distributions, even with state space {0, 1}.
(1) Let {Xs }s∈N be an i.i.d. Bernoulli(0.5) sequence. This is also a Markov chain; it has
unique limiting (and stationary) distribution π(0) = π(1) = 12 .
(2) Let Xs+1 = 1 − Xs . This has no limiting distribution, since P n doesn’t have a limit.
However, the distribution π(0) = π(1) = 12 is a stationary distribution.
(3) Let Xs+1 = Xs . This has infinitely many limiting distributions, which are also stationary distributions. For any q ∈ [0, 1], you can check that πq (1) = q, πq (0) = 1 − q
is a stationary measure for the chain.
We skip ahead a little, to pp. 240 of the textbook, and give some examples showing the
relationship between stationary distributions and limiting distributions, as well as uniqueness
of each:
Example 102 (Irreducible chains that are null recurrent or transient). Our previous theorem tells us that any irreducible chain for which all chains are null recurrent or transient
cannot have a stationary distribution. If these chains have a limiting distribution, it must be
identically 0.
93
Example 103 (Irreducible chains that are positive recurrent). These chains always have at
least one stationary distribution, given by
1
.
π(i) =
Mii
As we can see by looking at examples we already know, this may not be unique, and may not
be a limiting distribution, if the chain is periodic.
The last example gives us all ‘nice’ DTMCs; the other classes mostly indicate that you
should really be analyzing several different chains:
Example 104 (Irreducible chains that are aperiodic). These chains always have a unique
limiting distribution. This limiting distribution π either satisfies:
• π ≡ 0, in which case all states are transient or null recurrent.
• π(i) > 0 for all i, in which case all states are positive recurrent and the limiting
distribution is a stationary distribution.
This class of chains is very important. Pages 242-243 of the textbook give some formulas
associated with this stationary distribution. The formulas are consequences of things we have
already written down, but are worth looking at.
Example 105 (Irreducible chains that are periodic). These chains cannot have limiting
distributions, since P n does not have a limit. If the states are all positive recurrent, they will
have stationary distributions. However, they will not have a unique stationary distribution.
The textbook discusses these chains in greater detail. I will not emphasize them, however,
and mention only the observation that if all states are periodic with period k, then P k is an
aperiodic irreducible chain. This is often the chain that you ‘really’ want to analyze.
18.2.3. Reversibility. All of our categorizations so far have been related to ‘big’ problems that
stochastic processes might have:
• Discrete-time, time-homogeneous Markov chains are a very nice class of Markov
chains, where we can start to develop some nice formulas.
• Recurrence/transience, periodicity, reducibility all detect macroscopic problems (e.g.
states that you can ignore after some short initial period, chains that ‘look like’ disjoint unions of smaller chains, etc).
Reversibility is rather more subtle. The idea is as follows. We consider a Markov chain
{Xs }s∈Z , and then define the ‘time-reversed’ chain Ys = X−s . We would like to be able
to say if {Ys }s∈Z is the ‘same’ as {Xs }s∈Z . This property of being the ‘same’ is known as
reversibility.
Some observations and pointers:
• Our notion of ‘the same’ should be at the level of distribution. It should also ignore
the ‘starting’ point X0 , since that starting point is sort of arbitrary.
• This is relevant to modelling: we expect any system from physics to be reversible.
• This turns out to be mathematically relevant as well - reversible Markov chains have
some very nice properties. However, that is much less obvious.
Assume that our chain has a stationary distribution π, and that X0 ∼ π. Then:
P[Xs = j, Xs+1 = i]
P[Xs = j|Xs+1 = i] =
P[Xs+1 = i]
94
pji π(j)
π(i)
The textbook defines this number to be rij , and the associated matrix to be R. This is not
the potential matrix, which was also written as R and which we studied in section 9.6. We
have:
=
Definition 18.8 (Reversibility). A time-homogeneous DTMC is reversible if
R = P.
This is equivalent to
π(i)pij = π(j)pji ,
sometimes called the detailed balance equation.
We give some results related to reversibility:
Theorem 106. All birth-and-death chains are reversible.
Theorem 107. All irreducible, positive-recurrent chains with transition matrices P that
satisfy P = P † are reversible, and have stationary distribution that is uniform.
Example 108. We find the reversal of a Markov chain. Let


0.7 0.2 0.1
P =  0.5 0.1 0.4 
0.2 0.5 0.3
We can calculate its stationary distribution by solving at the equation
v = P t v,
which gives us π ≈ (0.54, 0.24, 0.22). Recall that the time reversal is rij =


0.7 0.22 0.08
R =  0.45 0.1 0.45 
0.25 0.25 0.5
pji π(j)
.
π(i)
Thus,
There are many reasons to be interested in reversibility. One of my favorites, from statistics and computing, is the Metropolis-Hastings algorithm:
Example 109 (Metropolis-Hastings Algorithm). Let P be the transition matrix of a reversible Markov chain with stationary distribution π > 0. Let µ > 0 be a distribution of
interest. Define
π(j)pji
aij = min 1,
.
π(i)pij
Then define the adjust transition matrix Q by
qij = pij aij , i 6= j
X
qii = 1 −
qij , otherwise.
j
Then Q is a reversible Markov chain with stationary distribution µ.
95
Remark 18.9. This example turns out to be very useful. We will see why in the homework
set.
96
19. Lecture 18: Closing Chapter 9: Continuous-Time and Special Processes
(Mar. 19)
19.1. Summary.
• A few administrative details.
• Yet more on Markov chains
19.2. Lecture. Today we discuss continuous time Markov chains. These are slightly complicated objects in generality, but we will end up only looking at continuous-time Markov chains
that are ‘really’ discrete-time Markov chains in a light disguise. We give first the following
non-rigorous definition:
Definition 19.1 (Almost-a-definition for continuous time Markov chains). The processes we
studied when defining phase-type distributions are continuous-time, time-homogeneous, finite
Markov chains. All of the continuous-time Markov chains we study will be of this form.
In particular, the exponential distribution will be quite important. This isn’t obvious from
the following, more precise, definition:
Definition 19.2 (Continuous-Time Markov Chain). A continuous time, discrete space Markov
chain is a stochastic process {Xt }t∈R that satisfies
P[X(tn+1 ) = xn+1 |X(tn ) = xn , . . . , X(t1 ) = x1 ] = P[X(tn+1 ) = xn+1 |X(tn ) = xn ]
for all t1 < t2 < . . . < tn+1 and all x1 , . . . , xn+1 .
We then have
Definition 19.3 (Time-Homogeneous Continuous-Time Markov Chain). A continuous time,
discrete space Markov chain is time-homogeneous if
P[X(s + t) = j|X(s) = i] ≡ pij (s, s + t)
does not depend on s.
When a chain is time-homogeneous, which will be the case for all chains of interest, we
will write pij (τ ) = pij (s, s + τ ). We note that these transition probabilities still satisfy
X
pij (τ ) = 1
j
for all i and all τ .
Although we have transition probabilities, they are less fundamental to the study of continuoustime Markov chains. Instead, we consider:
Definition 19.4 (Instantaneous Transition Rate). For states i 6= j, define
pij (h) − pij (0)
qij = lim
.
h→0
h
For state i, then define
X
qii = −
qij .
j6=i
97
Definition 19.5 (Infinitesimal Generator). The infinitesimal generator of a Markov chain
is the matrix Q with entries qij .
Remark 19.6. It isn’t so clear that pij (t) should have a derivative at 0 (e.g. that qij actually
exists). In fact, we already know that this derivative can fail to exist: i.i.d. sequences satisfy
our definition for continuous-time Markov chains, and derivatives generally don’t exist for
this type of Markov chain. We’ll ignore this difficulty and assume that qij exists and is finite
unless we explicitly say otherwise.
Note that the textbook glosses over this problem completely, but it seems important to bring
up: we have essentially looked at all possible DTMCs, but we are not looking at all possible
continuous-time Markov chains (CTMCs).
We’ll now see why exponential distributions show up everywhere for Markov chains with
infinitesimal generators. Fix i, assume X0 = i and define the holding time
Ti = inf{t > 0 : Xt 6= i}.
Theorem 110. Ti has exponential distribution.
Proof. Fix a, b > 0. By the Markov property,
P[Ti > a + b|Ti > a] = P[Ti > a + b|Xa = i]
= P[Ti > b].
But this says exactly that Ti has the memoryless property. The exponential distribution is the
only continuous distribution with the memoryless property, and so Ti must have exponential
distribution.
We can do better. Not only does Ti have exponential distribution; we can even decompose
it in terms of other exponential distributions!
To be explicit, for all j, define Tij to be an exponential distribution with mean q1ij . Then
Ti has the same distribution as inf j (Tij ) and P[XTi = j] = P[Tij = inf k Tik ].
Thus, we have a correspondence between the continuous-time Markov chains we study here
and discrete-time Markov chains:
Definition 19.7 (Skeleton Chain). Let τ0 = 0. Then define
τi+1 = inf{t > τi : Xt 6= Xτi }.
The skeleton of Xt is the discrete-time Markov chain Yi = Xτi .
We have:
Definition 19.8 (Correspondence). Fix a state space {1, 2, . . . , n} and let Q be an infinitesimal transition matrix. Then the skeleton of the associated chain has transition matrix
qij
pij = P
k6=i qik
for j 6= i and pii = 0.
Remark 19.9. This correspondence means that we can do almost every calculation we might
be interested in by just looking at the associated discrete-time chain, and then making adjustments to take into account the holding times.
98
Warning: it is very tempting to just do all of the calculations for the discrete-time
chain, and apply them without any adjustments to the continuous-time chain. Somebody will
do this on an exam, but it doesn’t work. To see this, consider the family of infinitesimal
generators
−C C
QC =
1 −1
,
where 0 < C < ∞. We calculate the the skeleton of this chain has transition matrix
0 1
PC =
1 0
for any value of C. Thus, any answers you get based only on the skeleton of the chain will
not depend on C. This can be deeply misleading. For example, we will calculate soon that
for any starting point X0 ,
C
,
t→∞
1+C
which depends a great deal on C! However, the skeleton chain has unique stationary distribution π(1) = π(2) = 12 .
Again, somebody will make this mistake on the exam. Don’t be that person! (REVISIT
this example in the section on stationary distributions!)
lim P[Xt = 2] =
We now try to figure out how to use the correspondence properly. For qualitative features
of the Markov chain, we really don’t need to adjust very much. For example, we say that
a continuous-time chain is irreducible if and only if the skeleton is irreducible. The same
carrying-over happens for all of our other classifications, with the exception of periodicity:
we always say that a continuous-time chain is aperiodic.
We do get continuous-time versions of all of our formulas. We won’t go over all of these
in class, but we will illustrate some of them today and when doing problems next class. You
should be comfortable using other analogous formulas.
We give one example to start. The continuous-time version of the Chapman-Kolmogorov
equation is:
X
pij (τ ) =
pik (s)pkj (τ − s)
k
for all i, j and all 0 < s < τ . Again, in matrix form, this is
P (τ ) = P (s)P (τ − s).
However, since we have continuous-time chains, we can take derivatives of this expression:
d
P (t) = QP (t) = P (t)Q.
dt
This has a rather nice solution:
Qt
P (t) = e
≡
∞
X
Qn tn
n=0
99
n!
.
Remark 19.10. This last formula is similar to the much more trivial solution P (t) = P (1)t
we had for discrete-time Markov chains.
19.2.1. Stationary, limiting and reversible distributions. Stationary and limiting distributions could be defined as in the finite case. However, as we have seen, there is some subtlety
here: we don’t want to define the stationary or limiting distribution as coming from the
skeleton of the chain.
The good news is that the theory of limiting and stationary distributions are very similar to
the discrete-time setting. The textbook has all of the details, but I highlight the one important
difference and two most important similarities:
• Although the skeleton of a chain might be periodic, no continuous-time chains are
periodic.
• Other than the fact that all continuous-time chains are aperiodic, all of the existence/uniqueness theorems concerning discrete-time chains apply as written.
• Even the formula for a stationary distribution in terms of mean recurrence time still
holds!
Although we can define stationary and limiting distributions exactly as we did for DTMC’s,
it is convenient to use some closely related definitions:
Definition 19.11 (Stationary Distribution). Consider a continuous-time Markov chain
(CTMC) with generator Q. If the equation
zQ = 0
has a solution with
P
i zi
= 1, then we call z a stationary distribution of Q.
Remark 19.12. This is just another way of saying our old definition. Recall that, in discrete
time, we wanted z to solve
z = zP.
However, if z is a stationary distribution for a CTMC, we have for all t > 0:
zP (t) = z
∞ n n
X
t Q
n=0
n!
∞
X
tn+1 Qn
= zId + zQ
(n + 1)!
n=0
= z.
Thus, our new definition agrees with our old one.
Since there are no periodic continuous-time Markov chains, the distinction between limiting
and stationary distributions is less interesting. The following theorem, which is identical to
one that we already saw for discrete-time Markov chains, gives us conditions under which
the limiting and stationary distributions are the same:
Theorem 111. For any finite, irreducible, continuous-time Markov chain, there exists a
unique limiting distribution and it is equal to the unique stationary distribution.
We revisit our old example:
100
Example 112. Recall the CTMC with generator:
−C C
QC =
1 −1
We then note that any stationary distribution satisfies:
−Cz1 + z2 = 0
Cz1 − z2 = 0
and
z1 + z2 = 1.
Putting these together, we have
z1 + Cz1 = 1,
so
(z1 , z2 ) = (
C
1
,
).
1+C 1+C
This is what we had before.
19.2.2. Reversibility. Reversibility is similar in continuous time:
Definition 19.13 (Reversible CTMC). A CTMC is reversible if
πi qij = πj qji .
We have
Theorem 113. A CTMC is reversible if and only if its skeleton is reversible.
19.2.3. Not-Quite-Markov Processes. We’ve finished our general introduction to Markov chains,
bringing us up to section 9.11 of the textbook. The next two sections of Chapter 9 discuss
things that aren’t quite Markov chains, despite the chapter heading.
These sections are actually building towards our last major topic: queuing theory. Just
like we began with stochastic processes and then came up with definitions of nicer and nicer
stochastic processes before getting to reversible, irreducible, aperiodic Markov chains, we begin
with some very general classes of stochastic processes and will eventually get to some very
nice Markov-chain-like objects called queues.
Our first definition is quite general, and seems a little silly at first. We generalize our
notion of a skeleton:
Definition 19.14 (Skeleton Process). Let {Xt }t∈R be a continuous-time stochastic process.
Let τ0 = 0. Then define the transition times
τi+1 = inf{t > τi : Xt 6= Xτi }.
The skeleton of Xt is the discrete-time stochastic process Yi = Xτi .
We then have:
Definition 19.15 (Semi-Markov Process). A continuous-time stochastic process {Xt }t∈N is
called a semi-Markov process if its skeleton is a DTMC.
101
Example 114. To construct a big class of semi-Markov process, fix any transition matrix P
for a DTMC and any CDF for a nonnegative random variable F . Let {Yt }t∈N be a DTMC
evolving according to P , and let {τi }i∈N be an i.i.d. sequence of random variables distributed
according to F . Define
Mt = max{i ≥ 0 :
i
X
τ` < t}.
`=1
Then
Xt = YM t
is a semi-Markov process. This doesn’t represent all semi-Markov processes, but it gives a
way to build many of them.
Definition 19.16. A stochastic process {Nt }t≥0 is a counting process if
• N (0) ≥ 0.
• N (s) ∈ Z.
• N (s) ≥ N (t) whenever s ≥ t.
We then look at some special semi-Markov processes:
Definition 19.17. A counting process {Nt }t≥0 with skeleton {Mn }n∈N and transition times
{τi }i∈N is a renewal process if the sequence of transition times are i.i.d.
Remark 19.18. This is the definition from the textbook, and so I follow it in these notes.
However, you will see slightly different definitions in other books.
Under this definition, renewal processes may or may not be a semi-Markov process. However, essentially all of the renewal processes that we care about will be semi-Markov.
Example 115 (Nice renewal processes).
random variables and define
N (t) =
• Let X1 , X2 , . . . be a sequence of Bernoulli(p)
X
1Xi =1 .
0≤i<t
We note that the transition times have geometric distribution with mean p1 . Thus,
N (t) is a renewal process and also a semi-Markov process. It is called the binomial
process.
• Let {Xi }i∈N be any sequence of i.i.d. nonnegative random variables (e.g. a sequence
of Gamma random variables). Then
Mt = max{i ≥ 0 :
i
X
X` < t}
(19.1)
`=1
is a renewal process, and also a semi-Markov process. Note: the binomial process
is a special case of this construction, with Xi being geometric random variables with
mean p1 .
Note: Although the textbook doesn’t say so explicitly, it is only interested in renewal
processes of this form. Be aware of the fact that many of its calculations don’t work
for all of the processes encompassed by its definition.
102
Example 116 (Bad renewal process). Consider a process {Xn }n∈N that satisfies:
X0 = 0
1
P[X1 = 1] = P[X1 = 2] =
2
∀n > 1, Xn = Xn−1 + X1 .
Then let Nt = Xbtc . This is a counting process (it always increases) and a renewal process (the time between changes is always exactly 1, which is certainly an i.i.d. sequence).
However, it is not a semi-Markov process: the value of Xn does not become independent of
X1 .
In many books, this would not be considered a renewal process at all (though it is still a
counting process).
The most important renewal process is:
Definition 19.19 (Poisson Process). Let X1 , X2 , . . . be i.i.d. random variables with exponential distribution and mean λ1 . Then
N (t) = max{i ≥ 0 :
i
X
X` < t}
`=1
is called a Poisson process.
Remark 19.20. The Poisson process is very nice! For example,
• It is actually a Markov chain, not just a semi-Markov process.
P
• We can do lots of special calculations with it! For example, let Sn = ni=1 Xi . Then
Sn has an Erlang distribution, and
(λt)n
P[N (t) = n] = P[Sn ≤ t] − P[Sn+1 ≤ t] = e−λt
.
n!
We could try to do the same calculations for other renewal processes, but we generally
don’t have an explicit formula for the CDF of Sn . Note: that last step is really
the only obstacle - we aren’t using anything fancy about the renewal process being a
Markov chain.
103
20. Lecture 19: Review of Chapter 9 (Mar. 23)
20.1. Summary.
• A few administrative details.
• Last bit of chapter 9: renewal processes.
• We review the types of questions that can be asked about this material.
20.2. Lecture. Recall: renewal processes look like:
Example 117 (Nice Renewal Processes). Let {Xi }i∈N be any sequence of i.i.d. nonnegative
random variables. Then
i
X
Mt = max{i ≥ 0 :
X` < t}
`=1
is a renewal process.
For general renewal functions, we often content ourselves with moments rather than full
CDF’s:
Definition 20.1. We call
M (t) = E[N (t)]
the renewal function of a renewal process {N (t)}t≥0 .
For nice enough renewal processes, the renewal function actually defines the entire process,
in the same way that the transition matrix defined a DTMC:
Theorem 118. Let N (t) be a renewal process of the sort defined in equation (19.1). Define
Z ∞
˜ (s) =
e−st M (t)dt
M
Z0 ∞
F˜X (s) =
e−st FX (t)dt.
0
The fundamental renewal equation is:
˜ (s)
M
˜ (s)
1+M
˜
˜ (s) = FX (s) .
M
1 + F˜X (s)
F˜X (s) =
Since the renewal processes of this form are characterized completely by FX , this implies that
˜ (s).
they are also characterized completely by M
Remark 20.2. What are some consequences of the fact that a renewal process is ‘characterized completely’ by its renewal function? One is that somebody can ask a question by saying
something like “let N (t) be the renewal process with renewal function M (t) = F oo,” just
like we often began questions with ”let X be the normal random variable with mean Foo and
variance Bar.”
That is: when we describe a renewal process to you, we can just tell you the renewal
function.
104
Remark 20.3. At first glance, this should be very surprising! After all, the random variables
X define the renewal process, and they certainly aren’t characterized completely by their
means!!
When it exists, we define:
Definition 20.4. The derivative m(t) =
density.
d
M (t)
dt
of the renewal function is called the renewal
Taking derivatives of the fundamental renewal equation gives:
Z
t
m(t − s)fX (s)ds.
m(t) = fX (t) +
0
20.2.1. Chapter 9 Review. (Most important:) most exam questions will be similar to the
computational questions in the back of chapter 9 of the textbook. As we covered essentially
all of chapter 9, you should make sure that you are comfortable with essentially all of these
questions. The end of this section of the lecture notes has one example of most ‘types’ of
questions.
Today, I’ll go over the main subjects in chapter 9, discuss the types of questions that
we should be able to answer about this material, and give some examples. I will begin by
emphasizing things that are not sample questions in the textbook. This isn’t because they are
more important; it is because we won’t have time to go over everything, and I expect you to
read the textbook problems as well.
The main subjects we covered in chapter 9 were:
• A fairly small number of objects that we use in computation: the transition matrix
P , the expected transition times Mij , potential and fundamental matrices R and S,
the limiting and stationary distributions z, and the infinitesimal generator Q of a
continuous-time chain. A large percentage of the Markov chain questions will be
about computing these objects.
• A fairly large number of definitions. From my point of view, these basically fall into
three categories:
(1) The basics: things like what a stochastic process is, what a Markov chain is, etc.
(2) Things to compute: these are the things above, like the expected transition times.
(3) Classification: these are the words like transient, irreducible, closed state, aperiodic. Several people have said that they find these definitions hard to remember.
I have one main tip here. All of these definitions are really about describing a
problem with how a Markov chain is behaving. For example, when a Markov
chain is irreducible, it means that a whole bunch of states that are on the graph
associated with the Markov chain won’t actually get visited after some initial
period.
So, to memorize these definitions, I suggest memorizing alongside them an example of the problems that they are meant to avoid. For example, when memorizing
what ‘periodic’ means, one might try to associate:
(n)
– The (rather opaque) definition: a state i is periodic if the set {n : pii > 0}
has 1 as its greatest common divisor.
105
– The picture: a Markov chain with d > 1 states arranged in a circle, and
all arrows pointing clockwise.
– The obstacle: a Markov chain that is periodic has no limiting distribution
(in the previous obstacle, this is because P kd+` = P ` but P , P 2 , . . . , P k−1
are all distinct) but often has a unique stationary distribution (in this case,
the uniform distribution). This is viewed as surprising - it is more ‘usual’
to have a unique limiting distribution that is equal to the unique stationary
distribution.
If trying to remember what ‘reducible’ means, one might try to associate:
– The (again rather opaque) definition: a Markov chain is irreducible if, for
(n)
all chains i, j, we have supn pij > 0. Otherwise, it is reducible.
– The picture: Two ‘nice’ Markov chains next to each other, with no arrows
between them.
– The obstacle: a Markov chain that is reducible will not have one stationary
distribution (‘nice’ reducible chains will have infinitely many, but some
might have 0).
Associating the definition, example and obstacle has a few advantages. The
definition will make more sense, and might be easier to remember. It will also
let you quickly answer various tricky questions quite easily. For example:
Example 119. Give an example of a Markov chain with at least two different
stationary distributions and no limiting distribution. Provide the two stationary
distributions.
To answer this, we recall that periodic chains often have no limiting distributions,
while reducible chains often have many stationary distributions. So we write
down the simplest chain that is both:


0 1 0 0
 1 0 0 0 

P =
 0 0 0 1 
0 0 1 0
.
Two stationary distributions are z1 = (0.5, 0.5, 0, 0) and Z2 = (0, 0, 0.5, 0.5).
Example 120. Let


0.6 0.4 0
0
 0.3 0.7 0
0 

P =
 0
0 0.5 0.5 
0
0 0.8 0.2
.
Create a new matrix P 0 that differs from P in exactly two elements in the first
and third rows, so that P 0 has a unique stationary distribution.
This question is trickier. The first thing you must discover is: why doesn’t P
already have a unique stationary distribution? If you draw the picture for this
transition matrix, you will quickly see that it is reducible. Reducible chains often
don’t have unique stationary distributions, so the obvious fix is to make the chain
reducible:
106


0.5 0.4 0.1 0
 0.3 0.7 0
0 

P0 = 
 0 0.1 0.4 0.5 
0
0 0.8 0.2
.
Remembering this can also make multiple-choice questions much easier. For
example, you can answer a question like ‘which of the following Markov chains
does not have a unique stationary distribution’ without having to do all of the
linear algebra to actually compute stationary distributions.
• Several theorems. I will generally not ask for you to state a theorem, but will certainly
ask some related questions. The examples above related to stationary and limiting
distributions were very closely related to some theorems that were stated in class, but
didn’t actually require the theorems in order to be solved.
• Some computational strategies and formulae. The most important of these is the
Chapman-Kolmogorov equation.
• Some miscellaneous bits. Most of these are not important to the course (e.g. MetropolisHastings chains; the details of the calculations we did for reflecting random walk). A
few, such as the renewal equation, are very important.
20.2.2. Important Calculations. All of these questions are very similar to questions in the
textbook. I have included the number of the associated textbook question.
Example 121 (Path Probability (9.2.3)). Let



P =


0
0 1
0
0 0
0.2 0 0
0 0.6 0
0
1 0

0 0
1 0 

0 0.8 

0 0.4 
0 0
.
What is the probability of being in states 1, 2, . . . , 5 after starting in state 1?
We draw:
107
Example 122 (Classification of States (9.4.2)). Classify the states and subsets of the Markov
chain drawn here:
108
109
Example 123 (Potential Matrix (9.6.1)). Compute the potential matrix of


0 0.3 0.7 0
0
 0.4 0
0 0.6 0 



0 0.8 0 
P =  0.2 0

 0
0 0.5 0 0.5 
0
0
0
0
1
and use this to compute the expected number of visits:
• To state 2, started at state 1.
• To state 1, started at state 1.
• To state 4, started at state 3.
We begin by classifying states. Drawing the diagram for this chain, it is easy to check that
states 1,2,3,4 are transient while state 5 is recurrent. Thus,

0 0.3 0.7 0
 0.4 0
0 0.6 

T =
 0.2 0
0 0.8 
0
0 0.5 0

We then have S = (Id − T )−1 . Calculating this, we have

1.62
 0.81
S≈
 0.54
0.27
0.49
1.24
0.16
0.08
2.14
1.57
2.38
1.19

2
2 

2 
2
Note: it is striking that the last column has all 2’s. Where does this come from? Well,
the only way to exit the group of transient states is to go through state 4. Also, every time
you go to state 4, you exit the group of transient states with probability 0.5. Thus, regardless
of where you start, the number of times through state 4 before exiting the group of transient
is actually a geometric random variable with parameter 0.5. In particular, this number has
mean 2.
Asking for the expected number of trips through a state like this in a much larger Markov
chain would make a (slightly mean) tricky question. Of course, now that I’ve pointed it out,
I would feel less guilty about asking it...
Continuing, we can read off from the matrix S that the three answers are:
• 0.49.
• 1.62.
• 2.
NOTE: I might ask you to compute the expected number of visits from i to j without
telling you to compute the potential matrix. This is harder for many people, as you have to
remember that you should calculate the potential matrix!
Example 124 (Stationary Distributions via Return Times (9.8.4)). Consider the transition
matrix:
110


0.6 0.3 0.1
P =  0.1 0.7 0.2 
0.1 0.1 0.8
(1) Does this matrix have a unique stationary distribution? What about a unique limiting
distribution?
(2) Assume that you started in state 1. What is limt→∞ P[Xt = 3]?
(3) What is the mean recurrence time of state 2?
This question is really asking two things:
(1) Can you calculate the stationary distribution?
(2) Do you remember some of the theorems we have about limiting distributions?
Lets do the computational part first, as almost everybody finds it easier. We solve
z = Pz
z1 + z2 + z3 = 1
to find z = (0.2, 0.35, 0.45). So, we have found a stationary distribution. We now use this
to actually answer the question:
(1) Yes. P is the transition matrix of an irreducible aperiodic finite DTMC, and so there
is a unique stationary distribution that is also the limiting distribution.
(2) Since the limiting distribution is equal to the stationary distribution, this probability
is z3 = 0.45.
1
(3) For an irreducible aperiodic finite DTMC, zi = M1ii . Thus, M22 = 0.35
≈ 2.86.
Note: This question might look like a question about computation. However, for almost
everybody, solving z = P z will be the easiest part. Make sure that you understand the rest
of the solution!
Example 125 (Reversed Chain (9.9.1)). For each transition matrix below, find the transition
matrix of the reversed chain:


0.6 0.3 0.1
P1 =  0.1 0.7 0.2 
0.1 0.1 0.8


0 0 1
P2 =  1 0 0 
0 1 0
We first calculate the stationary measures π (1) , π (2) of P1 , P2 . They are:
π (1) = (0.2, 0.35, 0.45)
1 1 1
π (2) = ( , , ).
3 3 3
pji πj
We then use the formula rij = πi for the reversible chain and find:


0.6 0.175 0.225
0.13 
P1 =  0.17 0.7
0.04 0.16
0.8
111


0 1 0
P2 =  0 0 1 
1 0 0
Note: The second is interesting to draw. It changes a Markov chain that goes around a
circle counter-clockwise into a Markov chain that goes around the same circle in th opposite
direction.
Example 126 (Embedded Chain (9.10.3)). Construct the embedded chain of the continuoustime Markov chain with infinitesimal generator


−4 2
2
Q =  1 −7 6 
3
3 −6
Recall that the embedded chain has transition matrix:
qij
,
i 6= j
pij = P
k6=i qik
pii = 0.
Thus,

0
P =
1
7
1
2
1
2
0
1
2
1
2
6
7


0
Example 127 (Stationary Distributions for Continuous-Time Chains (9.10.7)). Calculate
the stationary distribution of the continuous-time Markov chain with infinitesimal generator


−4 2
2
Q =  1 −7 6 
3
3 −6
We recall that the stationary distribution satisfies zQ = 0. We calculate this directly, and
find z ≈ (0.35, 0.26, 0.38).
Note: the following example was not in the lectures, but might be helpful for people who
haven’t seen much discussion of infinities, infinite sums, and so on.
Example 128. We have already
in this course a few times. For example, we
P seen infinities
j
know that for all 0 < p < 1, ∞
p(1
−
p)
=
1.
However, they will be more important for
j=0
Markov chains, so we take a moment to look at when infinities are ‘ok’ and when they’re
trouble. There are basically three types of sums:
∞
X
j=0
∞
X
2−j = 2
j −1 = ∞
j=0
112
∞
X
j=0
∞
X
(−2)−j =
2
3
(−j)−1 = 1 − log(2)
j=0
∞
X
(−1)j =???
j=0
The first two are infinite sums of positive terms. These are pretty safe: either they converge
to a number of they go to infinity. If you are dealing with only sums of positive terms that
you know converge, you can treat them almost like numbers.
The next term involves an infinite sum that has some negative signs. However, if you
remove the negative signs, the remaining sum still converges. These are absolutely convergent
sums, and they are also fairly safe. However, they don’t play well with diverging sums.
The last two terms involve infinite sums with some negative signs, for which the corresponding absolute sum diverges. These sums are quite dangerous, and don’t interact well
with each other or other sums.
In this class, we are going to be fairly careful to only look at infinite sums of the first two
types.
113
21. Lecture 20: Start of Chapter 11: Introduction to Queues (Mar. 26)
21.1. Summary.
• A few administrative details.
• Introduction to Queues.
21.2. Lecture. Queueing theory studies the way that queues evolve over time. In the next
few classes, we will only have the time to say a little bit about queueing theory - the main
examples, a few definitions, and a few calculations - and so we won’t be analyzing particularly
complicated systems. Once beyond this class, queueing theory is often more concerned with
the optimization of queueing systems than the modelling of specific systems. That is: although
we will mostly be doing computations about the properties of specific systems, this is just a
first step towards the more interesting task of optimizing your system.
The definitions in chapter 9 of the textbook were a little bit fuzzy, and I tried to clarify
them. The definitions in chapter 11 are even fuzzier, but this extra fuzziness seems useful in
describing the subject. I will still try to make it clear when a definition is ‘precise’ and when
it is only a rough description.
First, some jargon that gets used to describe queues. We imagine an infinite sequence of
potential customers. As they arrive at the office, they form one or more queues in front of
one or more servers. Each customer eventually gets to one of the servers, and then some
amount of time passes as they are served. The random process that we study is generally the
total number of people in line at a given time. This may or may not be a Markov chain.
Next, a rough definition of the simplest possible queue. This queue describes, roughly, what
happens if a bunch of people are lining up at the sole checkout counter in a small grocery
store:
Definition 21.1 (M/M/1 Queue (rough description)). An M/M/1 queue N (t) describes
the evolution of the number of people waiting in a single line. It is assumed that the time
in-between arrivals of people has exponential distribution. It is also assumed that only the
front person in the line is being served at any time, and that the amount of time it takes for
a person to be served is also exponential.
This is a nice queueing model, but it certainly isn’t the only one. In a large grocery store,
there are often several lines, and people jump between them. This will influence the number of
people who are waiting. More complicated queues are possible, and indeed are quite common.
In the literature, there is a standard notation for describing queues as a sequence of up to
6 letters or numbers in a row, seperated by slashes. I’ll give a rough version of the notation
here, and say why the first example is called M/M/1:
Definition 21.2 (Kendall’s Notation).
• A: The first element of the list describes the
inter-arrival time. The letter M in M/M/1 queue stands for Markov. This indicates
the fact that the number of customers who have arrived by time t is a Markov chain.
We already know, at least roughly, that only the exponential distribution can have this
property.
• B: the second element of the list describes the service time once somebody is at the
front of the queue. The second M in M/M/1 queue also stands for Markov, and again
indicates an exponential distribution.
• C: The number of servers. This is self-explanatory!
114
• X: The system capacity. This is the maximum possible size of the queue - any extra
customers who arrive are assumed to wander off (possibly to buy groceries elsewhere).
When this isn’t mentioned (as in M/M/1 queue) it is assumed to be infinity.
• Y: The size of the customer population. Again, when it isn’t mentioned it is assumed
to be infinity.
• Z: The queue scheduling discipline. This refers to the way that the line works. Most
grocery lines are first-come, first-served (FCFS). That is, you start at the back of
the line, move forward, and get served when you get to the front. Other possibilities
include last-come, first-served (so you get served from the back of the line) or roundrobin (like FCFS, but you get booted back into line part-way through being served if
you are taking too long).
Even this notation obviously doesn’t cover everything (for example, it doesn’t cover the
fact that people don’t switch lines perfectly when going to a grocery store), but it is a good
start. With this overview, we’ll go back and look at some pieces of the M/M/1 queue more
carefully, with breaks to discuss other general principles as they show up.
First, the arrival process. Recall:
Definition 21.3 (Poisson Process). Let X1 , X2 , . . . be an i.i.d. sequence of exponential
random variables, and let
N (t) = max{i ≥ 0 :
i
X
X` < t}.
`=1
Then N (t) is called a Poisson process.
We know that N (t) is a counting process, a renewal process and a Markov process. It has
some other rather nice properties. First, two more definitions:
Definition 21.4 (Independent Increments). A counting process M (t) is said to have independent increments if M (b) − M (a) is independent of M (d) − M (c) whenever a ≤ b ≤ c ≤ d.
Definition 21.5 (Stationary Increments). A counting process M (t) is said to have stationary
increments if the distribution of M (b) − M (a) depends only on b − a.
We have:
• The increments of a Poisson process have Poisson distribution. That is,
P[N (a + b) − N (a) = n] = e−λb
(λb)n
,
n!
where λ−1 = E[N (1)].
• Any continuous-time counting process N (t) that has independent, stationary increments must in fact be a copy of the Poisson process.
• There are two other equivalent definitions in the textbook that I skip.
The Poisson process has a number of special properties that make them especially nice in
the context of queues. The first is the superposition/decomposition property:
115
Theorem 129 (Superposition Property). Let N1 (t), N2 (t), . . . , Nk (t) be independent Poisson
processes. Then
N (t) =
k
X
Ni (t)
i=1
is also a Poisson process.
Remark 21.6. Here is an interpretation. If you have k lines at a grocery store, each of
which gets filled up like a Poisson process, you can also model the total number of people
waiting in all lines as a Poisson process.
We also have the reverse:
Theorem 130 (Decomposition Property). Let N (t) be a Poisson process with E[N (1)] = λ1 .
Then for any k ∈ N and any λ1 , . . . , λk satisfying
k
1 X 1
=
,
λ
λ
i=1 i
there exist independent Poisson processes N1 (t), . . . , Nk (t) with E[Ni (t)] =
N (t) =
k
X
1
λi
and
Ni (t).
i=1
Remark 21.7. We don’t prove these two results. However, it is a nice exercise to do so
(and would make a fair exam question). The main idea in both cases is to do calculations
with generating functions to check that N (t) is Poisson with the right parameters for all t.
Example 131 (Similar to 11.1.1 of textbook). Let N (t) be a Poisson process with mean
E[N (1)] = 2. Calculate:
(1) P[N (5) = 0].
(2) E[inf{t : N (t) = 3}].
Both of these are easy if we remember our usual representation of the Poisson process:
(1) N (5) has Poisson distribution with mean E[N (5)] = 5E[N (1)] = 10. Thus,
100 −10
e
≈ 0.00005.
0!
(2) This looks funnier, but will turn out to be almost as easy. Let X1 , X2 , . . . be the interarrival times of the Poisson process. Then inf{t : N (t) = 3} = X1 + X2 + X3 . But
X1 , X2 , X3 are independent exponential distributions with mean 21 , so X1 + X2 + X3
has Erlang distribution. Thus,
P[N (5) = 0] =
E[inf{t : N (t) = 3}] = E[X1 + X2 + X3 ]
= 6.
Remark 21.8. You should also be able to calculate the variance Var[inf{t : N (t) = 3}]! It
isn’t any harder if you remember the formulas for Erlang random variables.
The next property is quite a bit more complicated. At first glance it might seem ‘obvious,’
but in fact it is quite special. We need some notation:
116
• Let an (t) be the probability that a person arriving at time t sees n people in line. Note
this is a distribution on the number of people n (e.g. N ∪ {0}), not on the time t (e.g.
R).
Unfortunately, we don’t quite have the notation to define an (t) carefully. It can be
written as
n
k
X
X
an (t) = P[
Xi = t | t ∈ {
Xi }k∈N ],
i=1
i=1
but the event we’re conditioning on has probability 0, and we haven’t really talked
about how to condition on general events with probability 0.
• Let pn (t) be the probability that there are n people in line at time t. This is, of course,
a Poisson distribution.
We then have:
Theorem 132 (PASTA Property). an (t) = pn (t).
Remark 21.9. The PASTA property also holds for the M/M/1 queue in full generality.
That is, if an (t) is the distribution of the length of the queue at time t, and pn (t) is the
stationary distribution of the length of the queue, then an (t) = pn (t).
Why is this special?
(1) The PASTA property doesn’t hold for any arrival process that isn’t Poisson, and
doesn’t hold for any queue that doesn’t start M/M. For example, assume that people
arrive in a line exactly once per minute and that the service time is exactly 30 seconds.
Then p0 (t) = p1 (t) = 12 , but a0 (t) = 1.
(2) Some very similar-sounding facts are false for the Poisson process. For example,
assume that the arrival of buses follows a Poisson process with mean inter-arrival
time of 20 minutes. The same calculation that gave us the PASTA property tells us
that the expected time until the next bus is also 20 minutes, and also that the expected
time since the last bus is 20 minutes. Thus, when you arrive at the bus station, you
expect to arrive during an inter-arrival period of length 40 minutes.
In particular, if people arrive at a rate of exactly 1 per minute, you expect to be
boarding with 40 other people, even though the average number of people boarding a
bus is only 20 people!
This can be a little surprising at first glance. If 20 people board the bus on average,
how can it be that the expected number of people boarding the bus with me is 40?
The answer is that more people arrive during the longer inter-arrival times than
the shorter ones. For the Poisson process, this means that you exactly double the
expected inter-arrival time. For other arrival processes, something more complicated
will happen in general.
117
22. Lecture 21: More Chapter 11: Properties and Birth-Death Chains
(Mar. 30)
22.1. Summary.
• A few administrative details.
• Studying specific queues.
22.2. Lecture. Recall, last class we saw:
• A lot of jargon related to queues (arrival process, service times, Kendall’s notation,
etc.)
• The simplest queue: M/M/1.
• Some special properties of the Poisson process.
• One special property of the M/M/1 queue.
Today, we’ll continue to do calculations related to the M/M/1 queue.
Remark 22.1. most of the calculations we do today are about the stationary distribution of
the M/M/1 queue, or about the dynamics if the M/M/1 queue is started at its stationary
measure.
Let’s remember what this means. When you run a Markov chain {Xt }t≥0 , you have to
choose its starting point X0 . In most of our discussion, we have been choosing this starting
point deterministically - we often assume that the chain starts in ‘state 1,’ and we might
assume that queues start with no people in line. However, we are also allowed to treat X0
as a random variable with some distribution. If the initial distribution of X0 is equal to a
stationary distribution of the Markov chain, we say that we have started at stationarity, or
started at the stationary distribution, or one of several similar phrases.
It is important that you recognize this. Most of the calculations we do today assume that
the queue is started at stationarity, and will not be accurate for the queue at some finite
time t when it is started at length N (0) = 0.
Before doing calculations, the textbook takes a break at this point to draw three types of
pictures of queueing processes. I won’t draw them on the board, as they are quite intricate and
it is worthwhile to look at them for more than a few seconds. Please look at the diagrams
on pp. 397-398 of the textbook to understand what they are plotting. They aren’t necessary
to do any of the problems, but I suspect that you will find them helpful, in much the same
way that we used a diagram to understand example 131 last class.
The textbook then takes another break to describe what sorts of questions are generally
asked in queueing theory. This may be interesting, but we will skip it.
The next important result holds for queues in a great deal of generality, not just for M/M/1
queues.
22.2.1. Little’s Law. For a queue process N (t), let a(t) denote the total number of arrivals
by time t and let
R t d(t) be the total number of departures by time t, so that N (t) = a(t) − d(t).
Define g(t) = 0 N (s)ds.
Next, define:
λt =
a(t)
t
118
g(t)
a(t)
g(t)
Lt =
t
Rt =
and also
λ = lim λt
t→∞
R = lim Rt
t→∞
L = lim Lt
t→∞
when the limits exist. We then have:
Theorem 133 (Little’s Law). L = λR.
Remark 22.2. It is just a matter of algebra that Lt = λt Rt , so this shouldn’t be so surprising!
Remark 22.3. This theorem is saying that the average number of people in the queue, L, is
equal to the average arrival rate, λ, multiplied by the average amount of time it takes to get
service, R.
Remark 22.4. Again, Little’s law is about limits. Its conclusions may not be useful for
finite time intervals.
Example 134 (Similar to 11.1.5). At a given grocery store checkout, the average interarrival time is 4 minutes and the average amount of time spent in line is 8 minutes. What
is the average number of people in line?
L = λR
60 8
= ( )( )
4 60
= 2.
Remark 22.5. Questions about Little’s law are generally phrased as in this example.
22.2.2. M/M/1 queues as birth-and-death processes. Recall the definition of a (continuoustime) birth-and-death chain:
Definition 22.6 (birth-and-death chain). A continuous-time Markov chain is called a birthand-death chain if:
• Its state space is the nonnegative integers {0, 1, 2, . . .}.
• The matrix qij satisfies qij = 0 whenever |i − j| > 1.
Remark 22.7. The M/M/1 queue is a continuous-time Markov chain, with off-diagonal
entries of the infinitesimal generator given by
qi,i+1 = λ
qi,i−1 = µ1i>0
qi,j = 0,
|i − j| > 1 or i, j < 0.
119
Thus, it is a birth-and-death chain. Indeed, it is just about the simplest possible birth-anddeath chain.
Remark 22.8. We have already (almost) seen this process before! Note that the transition
matrix of the embedded chain is:
λ
pi,i+1 =
λ+µ
µ
pi,i−1 =
1i>0
λ+µ
pi,j = 0,
|i − j| > 1 or i, j < 0 or i = j.
λ
Writing p = λ+µ
, this is exactly the random walk with reflection that we studied in chapter 9.
So, we will continue studying it now, but most of the answers we get should be no surprise.
We have essentially already calculated the stationary distribution pn . Let ρ = µλ . We have:
• When ρ < 1, pn = ρn (1 − ρ).
• When ρ > 1, all states are transient and so there is no stationary distribution. This
is obvious from our description of the process: on average, people are arriving faster
than they are processed, so the size of the queue goes to infinity.
• When ρ = 1, we know from earlier classes that all states of the embedded chain are
null-recurrent. This means that, again, there is no stationary distribution. This is
much less obvious than what happens when ρ > 1. Here, the average rate of arrival is
equal to the average service rate, and so the queue has no reason to grow or shrink.
However, the random fluctuations in the length of the queue tend to grow as time
passes.
If you are interested in the subject, this is worth simulating in R: I don’t think it
is obvious from the math that there won’t be a stationary distribution, but it should
be very easy to convince yourself of this fact by looking at the simulation.
Remark 22.9. Notice that pn is the PMF of a geometric random variable. If you are
studying for exams, this means that the stationary distribution of an M/M/1 queue is yet
another way to state easy problems about geometric random variables. For example, the
following.
Example 135. Consider a line at a grocery store, which we model as an M/M/1 queue.
Assume that people arrive at an average rate of 15 per hour, and that the average service
time is 2 minutes. What is the probability of finding 2 or more people in the line, under the
stationary distribution of this process?
Note: we know that the stationary distribution is really just a geometric random variable.
So, the only part of this question that is new is the work required to find the parameter!
We have λ = 15 customer arrivals per hour and µ = 60
= 30 customers served per hour,
2
λ
1
and so ρ = µ = 2 . Let X be the number of people in line. Then
1
1 1
− = .
2 4
4
We now consider some ‘performance measures’ for these queues. The most obvious is the
expected number of people in the queue. The most important is often the expected response
time - that is, the expected amount of time that a customer will wait. I’ll only give a few of
P[X ≥ 2] = 1 − P[X = 0] − P[X = 1] = 1 −
120
these for now; we will do a more comprehensive review when looking at the M/M/c queue
next class.
We first calculate the expected queue length L = E[N ]. When ρ ≥ 1, it is infinity. When
ρ < 1, we use a little trick for calculating infinite sums:
L=
∞
X
npn
n=0
=
∞
X
n(1 − ρ)ρn
n=0
= (1 − ρ)ρ
∞
X
nρn−1
n=0
∞
d X n
ρ
= (1 − ρ)ρ
dρ n=0
= (1 − ρ)ρ
=
d 1
dρ 1 − ρ
ρ
.
1−ρ
Writing this in terms of µ and λ,
λ
.
µ−λ
This trick of taking derivatives of infinite sums doesn’t always give the right answer, but
is safe in this situation. You can learn about when this trick works in an introduction to
analysis.
The same trick gives the variance of N :
ρ
.
V ar[N ] =
(1 − ρ)2
Note that, for ρ very close to 1, these numbers are enormous!
The average response time might seem harder to calculate. However Little’s law makes it
easy. Recall that the average occupancy E[N ] is equal to the average arrival rate λ multiplied
by the average response time E[R], so
E[N ] =
E[N ]
1
=
.
λ
µ−λ
Again, this goes to infinity as µ − λ goes to 0.
The textbook calculates several other performance measures for queues, such as the average
queue length conditional on the queue length being nonzero. We skip most of them, but
highlight two definitions:
E[R] =
Definition 22.10. The throughput of a queue is the rate at which customers are processed.
For the M/M/1 queue, this is min(λ, µ). The utilization of a queue is the percentage of the
time that the server is being used. For the M/M/1 queue, this is min(1, µλ ).
121
Remark 22.11. We highlight two statistics that seem most important. The average response
1
time, µ−λ
, measures how annoying it is to wait in a given line. The utilization, µλ , measures
how efficiently the server in the queue is being used. There is obviously a big tradeoff here having low waiting times on average means having long idle times for the server.
This isn’t surprising, but it is nice when the probability model picks up something that we
all know from the real world.
We can do better than using Little’s law to calculate the expected response time E[R] under
the stationary measure. We can actually calculate the whole distribution. The calculation is
a little involved, so we just give the conclusion:
P[R > t] = e−(µ−λ)t .
1
Thus, R actually has exponential distribution with mean µ−λ
.
We do a standard problem on the M/M/1 queue before moving on to general birth-anddeath chains:
Example 136 (Similar to 11.2.4 of textbook). Computer help desks are being set up on
campus. It is known that every student being served by a help desk generates on average 1
query every 500 hours. It takes a worker at a help desk, on average, 12 minutes to respond
to a query. If we want the average response time to be less than 20 minutes, how many
students can each help desk serve?
The first step is recognizing what equation we want to use. We are interested in the average
response time, so we should use the formula:
E[R] =
1
.
µ−λ
We want E[R] ≤ 20 and we know that µ is 60
= 5 queries per hour. What is λ, and how
12
is it related to the number of students being served? Let n be the number of students served.
n
Then the number of queries per hour is λ = 500
. Putting together these pieces of information,
we need
1
1
≥ E[R] =
n .
3
5 − 500
Solving,
5−
n
≥3
500
so
n ≤ 500(5 − 3) = 1000.
Remark 22.12. There are obviously many, many ways to ‘dress up’ this type of question.
The calculation will always be easy in the end, but you must be able to figure out what to
plug into the formula, and this is where the mistakes will be.
Remark 22.13. Notice that n depends relatively little on our rather strict requirement that
the service time should be less than 20 minutes. This is typical: waiting times are quite short
until the efficiency ρ is very close to 1, and then they explode.
122
We now move on to general birth-and death chains. In keeping with our notation for
the M/M/1 queue, the off-diagonal entries of the infinitesimal generators of these chains is
written
qi,i+1 = λi
qi,i−1 = µi 1i>0
qi,j = 0,
|i − j| > 1 or i, j < 0.
So, the M/M/1 queue is a special case, with λi = λ, µi = µ for all i. The mathematics of
these chains is very similar to the mathematics of the M/M/1 queue. The applications are a
little less obvious: why would the arrival and service rate depend on the number of customers
in the queue? At least one plausible reason is that people are less likely to join a queue that
is extremely long, and so λi might decrease as i increases. We will see in the next section
an example where µi changes as well.
Back to the mathematics, we look again for the stationary distribution pn . From the
definition of the stationary distribution, this must satisfy
0 = −(λn + µn )pn + λn−1 pn−1 + µn+1 pn+1 ,
0 = −λ0 p0 + µ1 p1 .
n≥1
Thus,
λ0
p0 ,
µ1
λn + µn
λn−1
=
pn −
pn−1 ,
µn+1
µn−1
p1 =
pn+1
n ≥ 1.
How do we solve this? A few observations:
P
• Since we are looking for a distribution, we know that n pn = 1. Thus, we have a
‘free parameter’ - we can find pn as a function of p0 , and then find p0 at the end.
This makes life much easier!
• We can get the first few values directly.
We find:
p0 = p0
λ0
p1 = p0
µ1
λ1 + µ1
λ0
λ1 + µ1 λ0
λ0
λ1 λ0
p1 − p 0 =
p0 − p0 =
p0
p2 =
µ2
µ0
µ2 µ1
µ0
µ2 µ1
λ2 λ1 λ0
p3 =
p0
µ3 µ2 µ1
λ3 λ2 λ1 λ0
p4 =
p0 .
µ4 µ3 µ2 µ1
It isn’t clear how to get a formula out of these calculations, but it certainly leads to a
natural guess:
123
pn = p0
n
Y
λi−1
i=1
µi
.
The most natural tool for going from a guess to a proof is induction. Let’s do this carefully:
Theorem 137. For all n ≥ 1,
pn = p0
n
Y
λi−1
i=1
µi
.
(22.1)
Proof. We know that equation (22.1) holds for n = 1, 2, 3, 4. To use the inductive framework,
we assume that it holds for n ≤ m and show that it also holds for n = m + 1. To do this,
we write:
λm + µm
λm−1
pm −
pm−1
µm+1
µm−1
m
m−1
λm + µm Y λi−1
λm−1 Y λi−1
= p0
− p0
µm+1 i=1 µi
µm−1 i=1 µi
m−1
Y λm + µm λm−1 λm−1 = p0
−
µ
µ
µm−1
m+1
m
i=1
pm+1 =
= p0
m
Y
λi−1
i=1
µi
.
We used the assumption that equation (22.1) holds for n ≤ m in the second line of this
calculation; the rest is just algebra.
Q
Following the definition of the textbook, let ρi = λµi−1
and let ζn = ni=1 ρi . We then have:
i
P
• If Pn ζn = ∞, there is no stationary distribution.
• If n ζn = ζ < ∞, there is a stationary distribution. It is given by
ζn
pn = .
ζ
Note that this agrees with our earlier result for the M/M/1 queue. It also gives us the
following ‘reasonable’ way to define queues that remain finite:
Theorem 138. Consider a birth-and-death chain that satisfies:
• µi = µ for all i.
• λi+1 ≤ λi for all i.
Then the associated chain has a stationary distribution if and only if limi→∞ λi < µ.
Remark 22.14. The limit in the above theorem always exists.
The textbook gives a finer analysis of the limits of these birth-and-death chains. For example, it distinguishes between chains with transient states and chains with null-recurrent
states.
124
23. Lecture 22: More Chapter 11: Multiserver Systems (Apr. 2)
23.1. Summary.
• A few administrative details.
• Studying more queues.
• Note: this is Easter weekend.
23.2. Lecture. Today, we go from the M/M/1 queue to the M/M/c queue. Recall from
Kendall’s notation that this means:
• We have an infinite pool of potential customers, arriving in a single line. The interarrival times are a series of iid exponential random variables.
• We have a collection of c servers. Once service has started, the service time is an
exponential random variable.
Remark 23.1. There is some asymmetry between the number of lines and the number of
servers. Imagine that we were to make up a type of queue with c lines as well as c servers,
and assume that the arrival process for each line is a Poisson process. By the superposition
property, this is equivalent to a queueing process in which all customers arrive in a single
line, as long as you assume that customers can move from a non-empty line to an empty
line when they want to. This last assumption is generally true.
On the other hand, an M/M/c queue is not equivalent to any M/M/1 queue. You can
check this with the mathematics, but there is a very simple ‘intuitive’ reason for this that we
understand from grocery stores: one customer can hop between lines before being served, but
one customer cannot be served by more than one server. To be a little more concrete, assume
that customers arrive very rarely (say once a month) and the service time is, on average,
two minutes. This service time cannot be decreased by adding more servers - it will always
be two minutes.
An M/M/c queue is still a birth-and-death process. Assume that customers arrive at rate
λ and that each server processes customers at rate µ. Using the notation for general birthand-death processes from last class, the M/M/c queue has parameters:
λn = λ,
µi = iµ,
µi = cµ,
for all n,
1 ≤ i ≤ c,
i > c.
Using the formulas for the stationary distribution of a general birth-and-death process, any
stationary distribution for an M/M/c queue must satisfy:
λn 1
pn = p0 n ,
1 ≤ n ≤ c,
µ n!
λn 1 c−n
pn = p0 n c ,
n > c.
µ c!
This is rather complicated. The formula for the solution to this stationary distribution is
given in the textbook on pp. 420 and 421. I don’t write it here, as I don’t think it is more
informative than the above. I do point out that:
• If λ < cµ, a stationary distribution exists.
• If λ ≥ cµ, no stationary distribution exists.
125
We now move on to performance measures. Unfortunately, the performance measures we
calculated for the M/M/1 queue are quite difficult to calculate directly! To calculate this, we
introduce several related quantities. Since we’ve seen so many performance measures, we’ll
also review the old ones:
• N : the number of customers in the system at steady state. Also define L = E[N ].
• Nq : the number of customers waiting in the queue at steady state. For M/M/c, we
have |N − Nq | ≤ c. Also define Lq = E[Nq ].
]
. That is, the response time
• R: the response time. From Little’s law, E[R] = E[N
λ
is the average number of people in the queue divided by the arrival rate. We write
W = E[R].
• Wq : the average waiting time. Rewriting Little’s law, Lq = λWq .
• L0q : the effective queue length, defined by L0q = E[Nq |Nq 6= 0].
Some of these are more interesting than others. Regardless of their practical use, it is
very nice to know all of these definitions and relationships when doing computations. In
particular, it will be easier to calculate Lq than L for the M/M/1 queue, so we start there:
∞
X
(n − c)pn
Lq =
=
n=c
∞
X
n=c
n
cn−c c!
n
(cρ) p0 −
∞
X
n=c
c
cn−c c!
(cρ)n p0 .
We can see why this is easier: pn has two formulas, one for n ≥ c and one for n ≤ c. Both
formulas show up when calculating L, but only one is needed when calculating Lq . Carrying
through this calculation, we find
(ρc)c+1
p0 .
Lq =
c!(1 − ρ)2 c
The textbook points out that we already have some simple formulas that allow us to find
all of the other simple performance measures once we know Lq . It is worth emphasizing this:
we only had to do the ‘hard work’ of actually computing an expectation once. The relations
are:
• Lq = λWq (this is essentially Little’s law), so we can find Wq .
• W = Wq + µ1 . This observation doesn’t have a name, and is very simple but quite
useful. The observation is that the mean waiting time (W ) is equal to the mean about
of time that you wait before meeting a server (Wq ) plus the mean amount of time that
you wait after meeting a server ( µ1 ). This relationship more-or-less holds for other
queuing models as well, and is helpful to remember. This lets us find W .
• L = λW (this is Little’s law), so we can find L.
The formulas you get out of this are:
Wq =
W =
λc
µ
µc
(c − 1)!(cµ − λ)2
λc
µ
µc
(c − 1)!(cµ −
126
p0
p0
λ)2
+
1
µ
L=
λc
λµ
µc
(c − 1)!(cµ −
p0
λ)2
+
λ
.
µ
We now do some useful calculations:
Example 139 (Erlang-C formula). We model a supermarket as an M/M/c queue, with
λ = 8, µ = 3 and c = 5. What is the probability that there is an empty check-out when you
arrive? Assume that the queue is started at the stationary distribution.
We could do this while plugging in the numbers, but it will be easier to do the algebra and
then plug in at the end. Write A for the event that the check-out is empty when you arrive.
We have
P[A] = p0 + p1 + . . . + pc−1
= 1 − pc − pc+1 − . . .
∞
X
cc ρ n
= 1 − p0
c!
n=c
(cρ)c
.
(1 − ρ)c!
This is called the Erlang-C formula. Plugging in, we have in this case
= 1 − p0
P[A] ≈ 0.838.
Next we try to understand: does being allowed to switch lines make a difference to the
efficiency of a queue?
Example 140 ((A tiny bit of) the Power of 2 Choices - Similar to example 11.11 of textbook).
Fix k > 1. We consider the following two collections of queues:
(1) There are k independent M/M/1 queues. Each has arrival rate λk and service rate µ.
(2) There is 1 M/M/k queue with arrival rate λ and service rate µ.
Note that, in both situations, customers arrive at total rate λ, and there are k servers with
rate µ. The only difference is that, in the former situation, there are k separate lines, while
in the latter, customers are allowed to switch lines at any time.
Let N1 , N2 denote the number of customers in these two queues at stationarity. Similarly
λ
let R1 , R2 be the average response times. Let ρ = kµ
. Then
ρ
E[N1 ] = k
1−ρ
1
k
E[R1 ] = E[N1 ] =
.
λ
kµ − λ
(kρ)k λµ
E[N2 ] = kρ +
p0
(k − 1)!(k − kρ)2
1
E[R2 ] = E[N2 ].
λ
When c = 2, we have
2ρ
1
E[N2 ] =
1−ρ1+ρ
127
E[R2 ] =
1 1
.
µ 1 − ρ2
What difference does this make? We have
1
E[N1 ]
1+ρ
1
E[R2 ] =
E[R2 ].
1+ρ
So, being able to swap lines always makes queues more efficient - sometimes almost twice
as efficient.
E[N2 ] =
Example 141 (Similar to 11.4.3 of textbook). Consider a help desk that receives, on average,
12 calls per hour. It takes a help desk employee 10 minutes to respond to a call, on average.
How many employees are needed if the average response time must be less than 2 hours? How
many extra employees are needed if the average response time must be less than 12 minutes?
We wish to use the formula
λc
µ
µc
1
(c − 1)!(cµ −
µ
and solve the inequality W ≤ 2 for c. This is going to be fairly terrible to do by hand - this
is a complicated formula. Instead, we plot in R, using:
RespTime<-function(c,mu,lambda){
rho = lambda/(c*mu)
pinv = ((c*rho)^c)/((1-rho)*factorial(c))
for(n in 0:(c-1)) {
pinv = pinv + ((c*rho)^n)/factorial(n)
}
W = (1/pinv)*(mu*(lambda/mu)^c)/
( factorial(c-1)*(c*mu - lambda)^2) + 1/mu
return(W)
}
W =
p0
λ)2
+
res = 3:10
for(i in 3:10) {
res[i-2] = RespTime(i,6, 12)
}
plot(res)
which(res < 2)
which(res < 1/5)
We find that 3 servers are required to get a ‘finite’ expected service time, 5 are required to
get a response time below 2 hours, and 6 are enough to get a service time below 12 minutes.
NOTE: On an exam, we would not make you calculate W for so many values of c. We
would ask a simpler question for example, we might ask if 25 servers would be enough for a
desired response time. We might also pre-compute some of the values, such as p0 , since the
calculations are fairly messy.
128
Having studied the M/M/1 queue and the M/M/c queue, the obvious next step is the
M/M/∞ queue. That is, we consider a line as the number of servers goes to infinity. This
queue can be understood quite well without using the Markov chain formalism that we have
developed so far. For example:
• Since there are infinitely many servers, nobody ever waits in line.
• Since nobody ever waits in line, the mean time spent per customer must be µ1 .
• From these two observations and Little’s law, the mean number of customers present
in the queue must be µλ .
However, the Markov chain formalism can still be somewhat helpful. An M/M/∞ queue
is a birth-and-death chain with
n∈N
n ∈ N.
λn = λ,
µn = µn,
When is there a stationary distribution? Recall that, for M/M/c queues, we had a stationary distribution whenever λ < cµ. By analogy, we expect a stationary distribution to
exist whenever λ < ∞µ - that is, we expect there to always be a stationary distribution. This
turns out to be correct. A stationary distribution must satisfy:
pn =
λn
p0 .
µn n!
We have
∞
X
λ
λn
= e− µ < ∞
n
µ n!
n=0
for any choice of µ, λ. Thus, a stationary distribution always exists, and we have
pn =
λn − µλ
e .
µn n!
We note that this is the PMF of a Poisson distribution with parameter
was probably not very obvious without using the Markov chain formalism.
λ
.
µ
This last fact
Example 142 (Similar to 11.4.4 of textbook). At a certain restaurant, customers arrive at
an average rate of 5 per hour. They stay, on average, for 30 minutes. The restaurant has
5 tables. Calculate the average number of customers present at any time using the M/M/5
model. Then calculate the same number, using the M/M/∞ approximation to the M/M/100
model.
For the M/M/5 model, we use the equation:
L = E[N ] =
=
( µλ )c λµ
(c − 1)!(cµ −
( 25 )100 10
99!(200 −
≈ 2.53.
p0
5)2
129
+
5
2
p0
λ)2
+
λ
µ
For the M/M/∞ model, we calculate directly that
λ
L = = 2.5.
µ
Remark 23.2. The M/M/∞ approximation is often a good approximation to the M/M/c
model when c µλ . Often c = 3 µλ is good enough; as you can see in this example, even
c = 2 µλ is acceptable.
130
24. Lecture 23: End of Chapter 11: More Special Systems (Apr. 9)
24.1. Summary.
• A few administrative details.
• Last bits of queueing theory.
24.2. Lecture. We finally add a fourth letter to our queueing notation: we look at the
M/M/1/K queue. Recall that the fourth letter indicates the maximum possible length of a
line, and that the M/M/1 queue could also be written as the M/M/1/∞ queue.
In the notation of birth-and-death chains, the M/M/1/K queue has state space {0, 1, 2, . . . , K}
rather than {0, 1, 2, . . .} and has transition rates
λi
λK
µi
µ0
i ∈ {0, 1, . . . , K − 1}
= λ,
=0
= µ,
= 0.
i ∈ {1, 2, . . . , K}
We now look at the stationary distribution. We recall that the M/M/1 queue had a stationary distribution if and only if ρ = µλ < 1. We can tell immediately that the M/M/1/K
queue will always have a stationary distribution, as follows.
Recall from chapter 9 that a finite, irreducible continuous-time Markov chain always has a
unique stationary distribution. The M/M/1/K is such a Markov chain, and so must have a
unique stationary distribution. The only remaining task is to calculate it. We use the general
formula for the stationary distribution of a birth-and-death chain to write
pn =
λn
p0 ,
µn
n ∈ {0, 1, . . . , K}.
When ρ 6= 1, we have:
p−1
0
=
K
X
ρn =
n=0
1 − ρK+1
.
1−ρ
When ρ = 1, we have
p−1
0
=
K
X
ρn = K + 1.
n=0
Thus, we find for n ∈ {0, 1, . . . , K} that
1
,
K +1
1 − ρK+1
pn = ρn
,
1−ρ
pn =
131
ρ=1
ρ 6= 1.
= ρ, regardless of the value of ρ. Let X be the number of
Remark 24.1. we have pn+1
pn
customers in a steady-state M/M/1/K queue with parameter ρ, and let Y be the number of
customers in a steady-state M/M/1/K queue with parameter ρ−1 . This implies that
P[X = n] = P[Y = K − n].
So, there is some symmetry between queues with ρ > 1 and queues with ρ < 1.
Remark 24.2. We have already seen that the M/M/∞ queue can be a good approximation
to the M/M/c queue when c is very large. This makes some sense: if it is rare for more than
3 check-out lines to be in use simultaneously, it hardly makes a difference if 50, or 500, or
infinitely many are availble. Indeed, M/M/∞ queues generally have about µλ queues in use,
and the approximation became good when c µλ .
The same reasoning suggests that the M/M/1 = M/M/1/∞ queue could be a good approximation to the M/M/1/K queue when K is large: if there are rarely more than 10 people
in line, it hardly makes a difference if the store has the capacity for a line of length 200, or
2000, or infinitely long. This idea suggests that the approximation should be good when K is
much larger than the typical line length.
This is basically correct. The one warning is that the M/M/1/K queue always has a
stationary distribution for any value of ρ, but the M/M/1 queue requires ρ < 1. Thus, we
can only expect the approximation to be good when ρ < 1.
Most performance measures are similar for the M/M/1/K queue as they were for the
M/M/1 queue. The expected system size is
L=
K
X
npn .
n=0
When ρ 6= 1, this is
L=
ρ(1 − (K + 1)ρK + KρK+1 )
.
(1 − ρ)(1 − ρK+1 )
When ρ = 1, this is
L=
K
X
n=0
n
K
= .
K +1
2
To find the average system response time W , we would like to use Little’s law. However,
Little’s law must be modified for M/M/1/K queues. Rather than the nominal rate of
customer arrivals λ, we must use:
Definition 24.3 (Effective rate of customer arrivals). The effective rate of customer arrivals
for an M/M/1/K queue is
λ0 = λ(1 − pK ).
Remark 24.4. Recall that pK is the probability that the queue is full, under the stationary
distribution. Thus, λ0 is the rate at which customers arrive and actually enter the queue.
132
Thus, Little’s law for the M/M/1/K queue says:
L = λ0 W.
Remark 24.5. For the M/M/1 queue, we effectively have “K = ∞,” and so λ0 = λ. Thus,
this is a generalization of our previous version of Little’s law. You only need to memorize
this one!
Similarly, we must modify our definitions for throughput and utilization.
Definition 24.6 (Throughput). Recall that the throughput of a system is the rate at which
customers are served. We define the throughput to be
X = λ(1 − pK ),
which is exactly λ0 .
Definition 24.7 (Utilization). The utilization of a server is the percentage of the time that
the server is being used. This is given by
λ
U = 1 − p0 = (1 − pK ).
µ
Remark 24.8. For the M/M/1 queue, we defined
X=λ
λ
U= .
µ
Thus, these definitions are again generalizations of the old ones. You don’t have to remember
the old ones separately.
Example 143 (Similar to 11.5.1). A barbershop has a single barber and 5 chairs. Assume it
takes on average 15 minutes to give a haircut, that 3 people arrive per hour, and that people
will leave the shop if they arrive when all chairs are full.
• What percentage of arriving customers will eventually get a haircut?
• What is the effective arrival rate?
• What percentage of the time is the barber occupied?
• What is the average response time?
We begin this type of question by doing some initial calculations: λ = 3, µ = 4 and
ρ = 0.75. Now for the specific questions:
• Let pn be the stationary distribution of the queue. We are interested in
(1 − ρ)ρ5
1 − p5 = 1 −
1 − ρ6
0.25(0.75)5
=1−
1 − (0.75)6
≈ 0.928.
• The effective arrival rate is given by
λ0 = λ(1 − p5 )
≈ 2.78.
133
• The occupation rate is
1−ρ
1 − ρ6
0.25
=1−
1 − (0.75)6
≈ 0.696.
1 − p0 = 1 −
• Using Little’s law, the average response time is
1
W = 0L
λ
1 0.75(1 − 6(0.755 ) + 5(0.756 ))
=
2.78
(0.25)(1 − 0.756 )
≈ 0.612.
24.2.1. M/M/c/K Queues. We bring together our two generalizations of M/M/1 queues, and
study M/M/c/K queues. We will assume that 1 ≤ c ≤ K. These queues are birth-and-death
chains, with rates on {0, 1, . . . , K} given by
λn = λ,
λn = 0,
0 ≤ n ≤ K − 1,
n ≥ K,
and
µ0 = 0,
µn = nµ,
µn = cµ,
1 ≤ n ≤ c,
c ≤ n ≤ K.
As with the M/M/1/K queue, the M/M/c/K queue always has a stationary distribution.
Solving the equations as before, we have
1 λn
p0 ,
0 ≤ n ≤ c,
n! µn
1 λn
pn = n−c
p0 ,
c ≤ n ≤ K,
c c! µn
where again p0 has a rather complicated formula that is in the textbook.
As with the M/M/c queue, it is difficult to compute the average length of the queue L =
E[N ] directly. Instead, one begins by computing Lq , then using the (modified!) version of
Little’s formula to compute Wq , then recalling W = Wq + µ1 to calculate W , and finally using
the (modified!) version of Little’s formula to compute L.
pn =
Example 144 (Similar to 11.6.2 from textbook). We consider the same barbershop. This
time, we assume that there are 2 barbers. Assume it takes on average 15 minutes to give a
haircut, that 3 people arrive per hour, and that people will leave the shop if they arrive when
all chairs are full.
134
• What percentage of arriving customers will eventually get a haircut?
• What is the effective arrival rate?
• What percentage of the time is at least one barber occupied?
We have again λ = 3, µ = 4. We also calculate
p−1
0
c−1
K
X
1 λn X 1 λn
=
+
n! µn n=c cn−c c! µn
n=0
≈ 2.19.
• This is
λn
p0
cn−c c! µn
≈ 0.993.
1
1 − p5 = 1 −
So, as expected, more people get served if you have more servers.
• The effective arrival rate is
λ0 = λ(1 − p5 ) ≈ 2.98.
• The occupation rate is
1 − p0 ≈ 0.54.
So, much lower than when there was only one server.
24.2.2. M/M/c//M queues. We add a 5th letter to our notation, studying what the textbook calls a M/M/c//M queue. This is shorthand for the M/M/c/∞/M queue. It has the
potential to be slightly confusing notation: the last M is a number, the total population of
customers that can ever show up, but the first two M’s both stand for Markov. I’ll try to use
M/M/c/∞/N instead. Note that this is equivalent to an M/M/c/N/N queue - if there are
only N potential customers, there are certainly not more than N in the queue at any time.
The prototypical example of an M/M/c/∞/N queue is a collection of N machines being
used, which sometimes break, and a collection of c teams who repair the machines.
Modelling this queue as a birth-and-death process, the rates are given by
λn = λ(N − n),
0 ≤ n ≤ N,
and
µ0 = 0,
µn = nµ,
µn = cµ,
1 ≤ n ≤ c,
c ≤ n ≤ N.
Solving the steady-state equations, we find
N!
λn
p0 ,
0 ≤ n ≤ c,
n!(N − n)! µn
N!
n! λn
pn =
p0 ,
c ≤ n ≤ N,
n!(N − n)! cn−c c! µn
pn =
135
where again the formula for p0 is fairly complicated and is in the textbook.
The textbook gives some other, terrible, formulas for measures of efficiency. These are not
worth memorizing. We do point out that the effective arrival rate has a nice formula. The
arrival rate is λ(N − n) when n people are currently in the queue, and so
0
λ =
N
X
λ(N − n)pn .
n=0
But this is
λ0 = N λ
N
X
pn − λ
n=0
N
X
npn
n=0
= λN − λL.
We can use this to calculate W by Little’s Law.
Example 145. A certain office has 80 computers. It is known that, on average, 1 computer
will break per day, and that each computer breaks down at a rate of once per 50 days. What
is the average response time?
1
We have λ0 = 1, λ = 50
, N = 80. Thus, by the formula immediately before this example,
L=N−
λ0
= 80 − 50 = 30.
λ
By Little’s law,
L
30
=
= 30.
0
λ
1
1
Note: It is clear that λ = 50
, since λ measures the rate at which each individual computer
0
breaks. How do we find that λ = 1?
We claim that the number of computers that will break per day is in fact the effective
arrival rate! To see this, note that the effective arrival rate is always equal to the rate at
which customers are added to the queue. Since all arriving customers are added to the queue
in the M/M/c/∞/N model, the effective arrival rate is equal to the average arrival rate.
In particular, even though the formula for effective arrival rate is different from the case of
the M/M/1 queue, the effective arrival rate still describes the number of customers arriving
per unit time. This is different from the case of the M/M/1/K queue, where the effective
arrival rate was equal to the ‘nominal’ arrival rate λ multiplied by the probability that an
arriving customer is actually added to the queue (i.e. the probability that the queue is not
full).
W =
136
25. Lecture 24: Final Exam Review (Apr. 13)
As with the midterm review, we go backwards, from chapter 11 to chapter 1 (skipping
chapter 10). This will have to be a little faster, unfortunately: we’ll only have times to
sketch types of solutions and large classes of collections. You should also review the notes
for the midterm review! I emphasize:
As with the midterm review, this review will focus more on ‘tricky’ questions
than the actual exam. This is because I expect you to do the ‘less-tricky’
practice problems from the textbook, and to check your answers.
25.1. Exam Basics.
• The exam will be on April 20, 2015 from 7 PM to 10 PM. The exam will take place
in DMS 1130.
• Simple calculators will be allowed. The calculators allowed during final examinations
of the Faculty of Science are: the Texas TI-30X, TI-30XA, TI-30SLR scientific and
non programmable. Note that I don’t set this policy - I am just warning you to be
careful! More information is available at
http://www.uottawa.ca/cgi-bin/cgiwrap/regist/print.cgi?E/academic/info/regist/crs/
• The exam will be closed-book. There will be a cheat sheet attached to the exam, just
like there was a cheat sheet attached to the midterm. This cheat sheet is available on
my website. There may be lengthy formulas on the exam that are not on the cheat
sheet, if necessary (for example, we might have the formula for the probability p0 that
an M/M/c/K queue is empty).
• The exam will cover what we have discussed in the classroom, as well as things that I
have suggested you read from the textbook and the material in the homework. It will
not cover any programming material.
• In the textbook, this corresponds roughly to chapters 1-9 and 11. The noteable differences are:
– Chapters 1,3,4,5: I will probably ask at least one question that is ‘trickier’ than
those in the textbook. The midterm review has examples of ‘tricky’ questions.
– Chapter 2: Not emphasized.
– Chapter 7: There will be no questions about the Gamma distribution, I will not
refer to Coxian distributions by that name, and there will be no questions about
‘fitting’ phase-type distributions to data.
– Chapter 8: There may be questions about the normal or exponential approximation to the binomial, even though it isn’t in the textbook. There will not be
questions on the difference between stronger and weaker forms of convergence.
– Chapter 9: There will be no questions about reversibility.
– Chapter 11: There will be no questions requiring you to use a graphical representation of a queue (though it may be helpful!). There will be no questions about
Kendall’s notation itself, though you must recognize the types of queues that we
discussed in class (i.e. you must know what an M/M/c/K queue is). There will
be no questions about what the textbook calls ‘transient behaviour’ - e.g. section
11.2.3 for the M/M/1 queue, and analogous sections for the other queues. There
will be no questions about state-dependent queues.
– Overall: There will be no questions about programming.
137
• The exam will have 4 long-answer questions and 10 multiple-choice questions. The
questions will be similar in spirit and difficulty to the midterm. The multiple-choice
questions will be fairly uniform over the course. The long-answer questions will be
mostly (though not exclusively) on material covered since the midterm.
25.2. Source of Review Material. You should study from:
• The examples and questions in the textbook.
• The questions from previous exams/midterms in this course.
• The review sections of the lecture notes (there is a review from the midterm, a minireview of chapter 9, and this review for the final exam).
• The homework problems and solutions.
Now, types of problems:
25.3. Chapter 11. There are quite a few worked examples in the lecture notes for chapter
11, and I expect you to do well on the exam if you can do all of those questions and understand
the remarks. That being said, here are the types of questions I see:
• Detailed questions about the Poisson distribution. We spent a good deal of time
discussing the Poisson distribution, and I can ask:
– Straightforward computations (e.g. calculate P[N (4) − N (2) = 5]).
– Use of special properties (e.g. any continuous-time counting process N (t) that
has independent, stationary increments must in fact be a copy of the Poisson
process; the superposition property).
– Relationship between Poisson and Exponential distributions (e.g. inf{t : N (t) =
5} has distribution X1 + . . . + X5 , where Xi are iid exponential).
All of these ‘special properties’ questions are really 2-part questions. The first part
is recognizing that there is a special property to use. The second part is a ‘standard’ question from the first half of the course. For example, if you recognize that a
continuous-time counting process with independent, stationary increments is a Poisson process in the first part of the question, you will then have to do some calculation
with that Poisson process in the second half.
• Many straightforward ‘plug-in’ questions are possible. This includes the renewal equation, Little’s law, calculations involving the stationary distribution of a queue, and
performance statistics such as expected response time and utilization. Most of the
work in these questions is understanding what parameters go where in which equations and remembering the definitions - the ‘actual math’ is generally pretty easy.
• We know some facts about birth-and-death chains in general. You should be familiar
with these facts and formulas.
• There may be less-straightforward calculations involving Little’s law L = λW and
other important performance statistics. See, for example, the convoluted calculation
of L and W for the M/M/c queue.
• There may be ‘tricky’ questions that revolve around the existence of a stationary
distribution. Any such question can be answered using our general condition for the
existence of a stationary distribution for birth-and-death chains, but most of the time
your intuition should be a very good guide.
138
There won’t be questions directly about definitions (e.g. Kendall’s notation, what a counting process is, etc), though I expect you to more-or-less understand them and to be able to
use them.
See the examples in class notes and the textbook for typical non-trick questions.
25.4. Chapter 9. This chapter was very long! There are some worked examples in the
lecture notes, but fewer than chapter 11. Here are some of the main question types, in order
roughly corresponding to the lecture notes:
• Initial explicit calculations. For example, I give you a transition matrix P and say
X1 = 1; I then ask you for P[X3 = 3]. The ‘standard’ answer is to write something
like:
X
P[X3 = 3|X1 = 1] =
P[X3 = 3|X2 = k]P[X2 = k|X1 = 1]
k
and then to grind out the sum. You should also be able to calculate P[X4 = 3] by
doing a longer, but very similar, calculation!
• Special properties: we already discussed the fact that the Poisson distribution has
some special properties, and that these properties let you do some calculations that
look difficult at first glance. The same is true for Markov chains. For example, we
saw that sojourn times have geometric or exponential distribution, we had some nice
properties for the embedded chain of a Markov chain, and so on. As with the ‘special
property’ questions for the Poisson distribution, these questions often effectively have
two parts: recognizing the property, then doing the calculation. These questions can
be a little tricky.
• Explicit calculations of transition probabilities using special formulas. For example,
we saw the Chapman-Kolmogorov equations, we used linear algebra to compute things
like P[X200 = 2|X1 = 1], and so on. These questions are generally identical to the
first type of question (where we were trying to calculate things like P[X3 = 2|X1 = 1]),
but one needs to do a slightly more involved calculation. I promise that any question
of this sort will have minimal computational burden (there will be no diagonalizing of
10 by 10 matrices).
As discussed in earlier classes, there is some room for trickiness here. Here is a
question that is trickier than any question on the exam, but might have a similar
flavour to one of them. Define


0.1 0.4 0.3 0.2 0 0 0
 0.5 0.3 0.0 0.2 0 0 0 


 0.2 0.2 0.4 0.2 0 0 0 


0
0
0 1 0 0 
P = 0
 0
0
0
0 0 1 0 


 0
0
0
0 0 0 1 
0
0
0
0 0 0 1
and let Xt be a Markov chain evolving according to the transition matrix P . What
is P[X200 = 7|X1 = 1]?
At first glance, this question looks terrible! Answering this the way we have answered questions in class involves finding all of the eigenvalues for a 7 by 7 matrix.
So we should do something clever. Notice that the chain has two pieces: there is a
139
mess involving states {1, 2, 3}, and then a line going from state 4 to 5 to 6 to 7. Let
τ = inf{t : Xt = 4}. Draw picture on board! By this observation,
P[X200 = 7|X1 = 1] = P[τ ≤ 197].
•
•
•
•
But then notice that the transition probability from any of state {1, 2, 3} to 4 is always
1
0.2. Thus, τ is a geometric random variable with mean 0.2
= 5, and so we can
calculate P[τ ≤ 197] explicitly! In this case, the number is basically 0.
There was a very similar (though slightly easier) phenomenon in example 123 from
the lecture notes. To prepare for this type of question, you should understand that
example as well, and how it relates to this example.
Classification of states: we have a large number of words to describe states of Markov
chains, and the Markov chain as a whole. You must memorize them all! As I suggested in the short review of chapter 9, it is often easier to remember the classifications if you remember ‘prototypical examples.’ I am guaranteeing that there will be a
classification question on the exam.
First important statistic: mean recurrence time. Remember that we have many formulas for this. One is a nice set of matrix equations. Another, which we used more
often in class, was to set up simple recurrences. A third was to recall that the stationary distribution often satisfies πi = M1ii . The first is quite general but hard to use.
The second is quite general, though it requires a bit more setup. The third is by far
the easiest to use when it applies - but it doesn’t always apply! You should at least be
familiar with the second and third methods.
P
Second important statistic: mean occupation time (that is, ∞
t=0 P[Xt = i]). Remember that we calculate this by taking the ‘transient part’ T of the potential matrix, then
looking at the entries of (Id − T )−1 .
Random walk problems: the calculations here are quite difficult, and you won’t be expected to solve the complicated recurrences on an exam. From my point of view, the
point of reviewing this section is that you should be able to make use of related formulas that are given to you (as you did in a homework assignment) to do calculations
and to classify the Markov chain.
Let’s make this concrete. I might give you a transition matrix P and say that the
only solutions z to the equation zP = z satisfies:
z0
zn+1 =
n+1
for n ∈ {0, 1, . . .}. Does the associated matrix have a stationary distribution? To
answer this, we note that the stationary distribution would have to satisfy
1=
∞
X
n=0
zn = z0
∞
X
n=0
1
= ∞.
n+1
This can’t be the case, so the chain has no stationary distribution. On an exam, the
non-convergence would likely be even more obvious.
• Stationary distributions: You should be able to calculate stationary and limiting distributions. We have three approaches:
(1) Diagonalize the entire matrix and take powers. This is the first thing we did.
It is never a good idea for calculating a stationary distribution, but can be a
140
good way to calculate a limiting distribution if there is not a unique stationary
distribution.
(2) Solve zP = z or zQ = 0. These are great ways to find the stationary distribution(s), and can be used to calculate the limiting distribution if there are equal.
(3) Use zi = M1ii when it applies.
In addition to calculating the stationary distribution, you should be very familiar with
the theorems (and especially the standard examples we discussed in class) that explain
when stationary and limiting distributions exist, when they are unique, and when they
are equal. This was one of the main topics of the chapter 9 review lecture, so I won’t
give more examples here.
• Continuous-time Markov chains: almost everything here is very similar to discretetime Markov chains. I emphasize:
– The only ‘new’ formula is for the embedded chain.
– Many of the ‘old’ formulas look different. For example, the stationary distribution satisfies zQ = 0 rather than zP = z.
– Some, but not all, calculations can be done using the embedded chain. Understand the calculations for which the embedded chain is enough (e.g. classifying
states of a finite Markov chain) and those for which it is not (e.g. computing a
stationary distribution).
25.5. Chapter 8. We saw:
• Markov’s inequality and variants (most questions involve just plugging in).
• Law of large numbers.
• Central limit theorem and the idea of a ‘normal approximation’ to a sum of random
variables.
25.6. Chapter 7. We saw:
• Many explicit families of continuous random variables. Most questions look like ‘X
is exponential with inverse-mean λ = 4, what is P[X < 2]?’ Some questions try to
disguise themselves a little, e.g. by saying V ar[X] = 12 and asking for E[X].
• We saw phase-type distributions. This is very similar to the above, but with more
complicated formulas.
• We saw reliability theory.
25.7. Chapter 6. Like chapter 7, but with discrete random variables.
25.8. Chapter 5. Lots of computations that should be second-nature at this point:
• Expectations and conditional expectations (this is ‘just calculus’).
• Generating functions show up, both as items to calculate and as tools for proving that
distributions are the same.
• On the midterm, this chapter had the most emphasis on ‘tricky’ questions that were
about inequalities rather than equalities. For example, noting that if X ∈ [a, b], then
E[X] ≥ a. In general, as with the midterm, most early-course tricky questions are
about trying to do a calculation with some information missing and then subbing in
an obvious inequality at that stage.
141
25.9. Chapter 4. Computational questions are similar to chapter 5. No generating functions.
25.10. Chapter 3. Questions in this chapter are really about remembering definitions.
25.11. Chapter 2. Not emphasized on the exam, except insofar as it is used in other chapters.
25.12. Chapter 1. Lots of computation that should be straightforward at this point. As was
the case in your first courses in probability, much of the trickiness comes from trying to apply
Bayes rule - remember that it requires you to come up with a partition, and not all questions
come with an ‘obvious’ partition. After chapter 5, this is the most likely place to see ‘tricky’
questions from the early material.
26. Final Exam: April 20, 2015
Good luck!
27. Miscellaneous
Last updated on April 9, 2015.
142