MATH 387: HANDOUT ON BASIC PROBABILITY CONCEPTS

MATH 387: HANDOUT ON BASIC PROBABILITY CONCEPTS
WINFRIED JUST, OHIO UNIVERSITY
Abstract. This note gives a summary of the basic concepts of probability
theory, as well as some practice problems.
1. Sample spaces and events
The sample space Ω is the set of all elementary outcomes e of an “experiment.”
For the time being we will assume that the sample space is finite or countably
infinite (which means it can be indexed by the set of natural numbers). The probability function P assigns numbers from the interval [0, 1], called probabilities, to
elementary outcomes. This function has the property that
X
(1)
P (e) = 1.
e∈Ω
An event E is a subset of Ω. The probability of an event E is given by
X
(2)
P (E) =
P (e).
e∈E
If Ω is finite and all elementary outcomes in Ω are equally likely, then (2) reduces
to
(3)
P (E) =
#(E)
,
#(Ω)
where #(E) denotes the number of elements of E.
An important example of a finite sample spaces for which usually all elementary
outcomes are assumed equally likely is the space of all permutations (i.e., ordered
arrangements) of r objects out of a given set of n objects. The size of this space is
given by
(4)
Prn =
n!
= n(n − 1) · · · (n − r + 1).
(n − r)!
Another important important example of such spaces is the space of all combinations (i.e., unordered arrangements) of r objects out of a given set of n objects.
The size of this space is given by
n!
n(n − 1) · · · (n − r + 1)
=
.
(n − r)!r!
r!
¯ or E c ) is the set E 0 = Ω\E
The complement of an event E, denoted by E 0 (or E,
of all elementary outcomes that are not in E. Intuitively, the complement of E
occurs if E does not occur. Its probability is given by
(5)
(6)
Crn =
P (E 0) = 1 − P (E).
1
2
WINFRIED JUST, OHIO UNIVERSITY
The union of two events A and B is the set A ∪ B of elementary outcomes that
are members of A or of B. Intuitively, A ∪ B occurs if at least one of A or B occurs.
The intersection of two events A and B is the set A ∩ B of elementary outcomes
that are members of both A and B. Intuitively, A ∩ B occurs if both A and B
occur. Two events A and B are mutually exclusive if A ∩ B = ∅. The probability
of the union of two events is given by the formula:
(7)
P (A ∪ B) = P (A) + P (B) − P (A ∩ B).
If A and B are mutually exclusive,then P (A ∩ B) = 0 and (7) simplifies to:
(8)
P (A ∪ B) = P (A) + P (B).
More generally, events A1 , . . . , An are pairwise mutually exclusive if Ai ∩ Aj = ∅
for all 1 ≤ i < j ≤ n. If, in addition, we have A1 ∪ · · · ∪ An = Ω, then we call
the family {A1, . . . , An} a partition of the sample space. For families of pairwise
mutually exclusive events the following generalization of (8) holds:
(9)
P (A1 ∪ · · · ∪ An) = P (A1) + · · · + P (An ).
For events A1 , . . ., An that are not necessarily pairwise mutually exclusive, the
following generalization of (7) holds.
P (A1 ∪ · · · ∪ An ) =
(10)
+
X
n
X
P (Ai ) −
i=1
X
P (Ai ∩ Aj ) +
1≤i<j≤n
P (Ai ∩ Aj ∩ Ak ) − · · · + (−1)n+1 P (A1 ∩ · · · ∩ An).
1≤i<j<k≤n
Equation (10) is called the inclusion-exclusion principle.
Suppose P (A) > 0. The conditional probability of B given A is the probability
that B occurs if it is already known that A occurred. It is denoted by P (B|A) and
given by the formula
(11)
P (B|A) =
P (A ∩ B)
.
P (A)
Note that if P (A) = 0, then P (B|A) is undefined.
It follows from (11) that the probability of the intersection of A and B is given
by
(12)
P (A ∩ B) = P (A)P (B|A).
Two events A and B are independent if the information that one of them occurred
does not alter our expectation that the other one occurred. In other words, if
P (A) > 0 then A and B are independent if, and only if, P (B|A) = P (B). If P (A) =
0, then A never occurs and A and we could never obtain a piece of information that
says “A occurred,” so the events A and B are again independent. In either case we
have
(13)
P (A ∩ B) = P (A)P (B),
and equation (13) is in fact considered the “official” definition of independence of
two events.
The events in a family A = {A1 , . . . , An} of events are said to be pairwise
independent if P (Ai ∩ Aj ) = P (Ai)P (Aj ) for all 1 ≤ i < j ≤ n. We say that the
MATH 387: HANDOUT ON BASIC PROBABILITY CONCEPTS
3
events in A are independent if for every increasing sequence of indices 1 ≤ i1 <
· · · < ik ≤ n we have
(14)
P (Ai1 ∩ · · · ∩ Aik ) =
k
Y
P (Aij ).
j=1
Independence clearly implies pairwise independence, but pairwise independence
does not imply independence when n > 2.
Conditional probability is a very important tool for constructing probability
functions on sample spaces of sequences. Suppose our sample space consists of
letter sequences ~s = (s1 , . . . , sn) of length n. For simplicity, let us write (a1, . . . , ak )
for the event s1 = a1 & s2 = a2 & . . . & sk = ak and ak for the event sk = ak . Then
(15)
P (a1 , . . ., ak ) = P (a1)P (a2|a1) . . . P (ak |a1 . . . ak−1) . . . P (an|a1 . . . an−1).
If the events a1, . . . , an are independent, then (16) reduces to
(16)
P (a1, . . . , ak ) = P (a1)P (a2 ) . . . P (ak ) . . . P (an).
Formula (16) underlies the procedure of calculating probabilities by using socalled decision trees. Sometimes it is easier to calculate P (B|A) and P (B|A0) than
P (B) itself. We can then compute P (B) form the formula
(17)
P (B) = P (B|A)P (A) + P (B|A0)P (A0 ).
More generally, if events A1, . . . , An form a partition of Ω, then the probability
of an event B can be calculated as
(18)
P (B) = P (B|A1 )P (A1) + · · · + P (B|An)P (An ).
Formula (18) and its special case (17) are called the formula for the total probability (of B).
Now let us return to equation (12). It implies that
(19)
P (A ∩ B) = P (B|A)P (A) = P (A|B)P (B).
If we divide by P (B) we obtain
(20)
P (A|B) =
P (B|A)P (A)
.
P (B)
Equation (20) is the most general form of Bayes Theorem or Bayes Rule. It
allows us to compute P (A|B), the posterior probability of A after obtaining the
information that B occurred from the prior probability P (A) and the conditional
probability P (B|A). In applications of Bayes Theorem the probability of B is
usually calculated using one of the formulas for total probability, either (17), in
which case Bayes Rule takes the form
(21)
P (A|B) =
P (B|A)P (A)
,
P (B|A)P (A) + P (B|A0 )P (A0)
or (18), in which case Bayes Rule takes the form
(22)
P (A|B) =
P (B|A1 )P (A1)
.
P (B|A1 )P (A1) + · · · + P (B|An )P (An)
4
WINFRIED JUST, OHIO UNIVERSITY
2. Practice problems for Section 1
This section gives some practice problems for the material in Section 1. These
questions form a logical sequence and should be attempted in the order given. The
exception are (9)-(13) which depend only on (1), and (14), which depends only
on (1)-(4)
(1) Suppose you want to model the space of all sequences of length n of letters
from the alphabet {a, c, g, t}. Describe a suitable sample space Ω and determine how many elements it has. If all such sequences are considered equally
likely, what would be the probability of each individual sequence? What
would be the probability that the sequence contains c at loci i1 , i2, . . . , ik
and other letters at all other loci? How would the result change if we have
only two letters? Only three letters?
(2) Is the probability function you defined in Problem (1) adequate if you want
to model a genome with cg-content 60%? If not, how would you define
the probability of an individual sequence if you assume that all loci are
independent? What would be the probability that the sequence contains
c at loci i1 , i2 , . . ., ik and other letters at all other loci? How would the
result change if the alphabet has only two letters {a, t}? Only three letters
{g, a, t}?
(3) In the model of Problem (1), find the probability that a sequence contains
exactly k occurrences of the letter c. How would you express the probability
that the sequence contains at most k occurrences of the letter c? Hint:
Consider a two-step approach: First choose the loci i1 , i2 , . . . , ik at which c
occurs, think in how many ways this can be done, and then use the result
for the third last sentence of Problem (1).
(4) In the model of Problem (2), find the probability that a sequence contains
exactly k occurrences of the letter c. How would you express the probability
that the sequence contains at most k occurrences of the letter c? Hint:
Consider a two-step approach: First choose the loci i1 , i2 , . . . , ik at which c
occurs, think in how many ways this can be done, and then use the result
for the third last sentence of Problem (2).
(5) Let E be the event that a sequence of n nucleotides contains kc occurrences
of the letter c, kg occurrences of the letter g, ka occurrences of the letter a,
kt and occurrences of the letter t, where kc + kg + ka + kt = n. Let C(kc)
be the event that c occurs exactly kc times; define P (Cg ), P (Ca), P (Ct)
analogously. Note that E = C(kc) ∩ C(kg ) ∩ C(ka) ∩ C(kt), and show that
E = C(kc) ∩ C(kg ) ∩ C(ka ).
(6) Let E be defined as in the previous problem. Show that
P (E) = P (C(ck ))P (C(kg )|C(kc))P (C(ka)|(C(kc) ∩ C(kg )).
(7) Find a formula for the probability of event E as in Problem (5) in the model
of Problem (1). Hint: You already found a formula for P (C(kc)). Note that
P (C(kg )|C(kc)) is the same as probability P (C(kg )) in a sequence space
where there are only n − kc nucleotides from the alphabet {g, a, t}, and use
a similar trick for finding P (C(ka)|(C(kc) ∩ C(kg )).
(8) Find a formula for the probability of event E as in Problem (5) in the model
of Problem (2). Hint: Argue as in the previous problem.
MATH 387: HANDOUT ON BASIC PROBABILITY CONCEPTS
5
(9) Consider again the sample space of problems (1) with all four nucleotides
equally likely in locus 1, but now assume that the loci are not independent.
Specifically, assume that a c is more likely to be followed by a g, but all other
probabilities are the same, and there are no other dependencies between
loci. Let si denote the letter encountered in locus i and assume for all
1 ≤ i < n we have:
P (si+1 = g|si = c) = 0.4,
P (si+1 = a|si = c) = P (si+1 = c|si = c) = P (si+1 = t|si = c) = 0.2,
(10)
(11)
(12)
(13)
(14)
P (si+1 = x|si = d) = 0.25,
where x stands for any nucleotide and d stands for any nucleotide other
than c. Let n = 5, and compute the probability of the sequence ccgca
under these assumptions. How does it compare with the probabilities of
the same sequence in the models of Problems (1) and (2)? Why do these
probabilities differ in the way they do?
In the model of Problem (10), find P (s2 = g) and P (s2 = c). Hint: Use
the formula for the total probability.
In the model of Problem (10), find P (s1 = c|s2 = g) and P (s2 = c|s3 = g).
Hint: Use Bayes formula.
In the model of Problem (10), find P (s3 = c) and P (s3 = g).
Write a short MatLab code that computes P (si = c) and P (si = g) for
i = 1, 2, . . ., n. Run it for n = 20. What pattern do you observe for the
sequences of these probabilities? How would you explain these patterns?
Suppose you have a sequence of n nucleotides of a bacterium that was
randomly chosen from a culture that contains 10% bacteria C. minimus,
20% bacteria C. medianis, and 70% C. maximus. It is known that the cgcontent of the genome of C. minimus is 40%, the cg-content for C. medianis
is 50%, and the cg-content for C. maximus is 60%. A quick test tells you
that among the n nucleotides in the sequence of unknown origin exactly k
of them are c’s. Based on this information, how would you calculate the
probability that the sequence comes from any of these sources? Use MatLab
to find the respective probabilities for each organism if n = 10, k = 2 and
n = 50, k = 10. Note that in each case the proportion knc of c’s is the same.
Why do you get such
different probabilities? Hint: Use Bayes rule. The
MatLab code for nk is nchoosek(n,k).
3. Discrete random variables
A random variable, or r.v. is any function X : Ω → R defined on a sample space
Ω that takes on real values.
For example, consider the experiment of rolling two dice. The number shown by
the first die, the number shown by the second die, the sum of these two numbers,
as well as their difference are all examples of random variables. Note that each of
the above random variables can take only a finite number of possible values. They
are examples of discrete random variables.
Now consider the experiment of rolling a die infinitely often. Then the number
of the first roll in which six comes up is also a random variable. The set of its
possible values is infinite, but since there is always a “next larger” possible value,
this random variable is still considered discrete.
6
WINFRIED JUST, OHIO UNIVERSITY
Now suppose our sample space consists of all OU students. The the Social
Security number of each student, ZIP code of primary residence, height, weight
are all random variables. The first two are discrete, the last two continuous since
they can potentially take all real values from an interval. One may also consider a
student’s gender or major to be so-called categorical random variables. Since one
can always assign arbitrary numerical codes to categories (think of major codes), we
will consider categorical random variables simply as examples of discrete random
variables.
Two r.v.s X and Y are independent if the values of X don’t give us any additional
information about the likelihood of values of Y on the same outcome; independence
of several r.v.s is defined similarly as in the definition of independence of events.
For example, the random variables “Social Security number,” “height,” and “major code” for OU students can reasonably be assumed independent, while “Social
Security number” and “ZIP code” are likely dependent, and “height” and “weight”
are the opposite of independent, they are strongly correlated.
The distribution of a random variable measures how likely it is for this random
variable to take on values from a given interval. In the case of a discrete random
variable X that can take on only values x0 , x1, . . . we can represent the distribution
simply by listing for each xi the probability P (X = xi) that X will take the value xi .
For example, let X be a random variable that can take the value 1 (“success”)
with probability p and the value 0 (“failure”) with probability 1 − p = q. This is
called a Bernoulli r.v. Its distribution is given by P (1) = p, P (0) = q. Now let us
repeat the experiment n times and assume that the “trials” are independent. Let
Xi be the Bernoulli r.v. that takes the value 1 if we have a success in the i-th trial
and 0P
if we have a failure in the i-th trial. Then we can define a random variable
X= n
i=1 Xn . The r.v. X counts the total number of successes in these n trials.
It is called a binomial r.v. The distribution of X is given by
(23)
P (X = k) =
n k n−k
p q
.
k
We also could imagine repeating the experiment infinitely often and define a
random variable Y that returns the number of the first trial for which a success
occurs. This kind of r.v. is called a geometric r.v. Its distribution is given by
(24)
P (Y = k) = pqk−1.
The mean value or expected value E(X) of a random variable gives us a notion
of “average” value. It is also often denoted by µ instead of E(X). For a discrete
r.v. X, it can be computed by the formula
(25)
E(x) =
X
xi P (xi),
i
where i ranges over all indices for possible values xi of X.
For example, for a Bernoulli variable with success probability p we get
(26)
E(X) = 1 · p + 0 · q = p.
MATH 387: HANDOUT ON BASIC PROBABILITY CONCEPTS
7
P
P
It turns out that if X = k Xk , then E(X) = k E(Xk ), regardless of whether
or not the r.v.s Xk are independent. We can now calculate the expected value of
a binomial r.v. X that counts the number of successes in n trials with success
probability p in each trial as:
(27)
E(X) =
n
X
E(Xk ) =
k=0
n
X
p = np.
k=0
For a geometric r.v. Y with distribution (24) we have E(Y ) = 1p .
Using formula (23) for large n becomes cumbersome; it is better to work with
approximations in this case. If n is very large and p is very small, but np is of
moderate size, then we can use the approximation of X by a r.v. Y with Poisson
distribution with parameter λ = np. The distribution of such Y is given by
e−λ λk
.
k!
The expected value for a r.v. Y with Poisson distribution with parameter λ is
E(Y ) = λ.
(28)
P (Y = k) =
The variance of a random variable X is a measure of dispersion, that is, a
measure of how much the values of X tend to differ from its mean µ. The variance
is usually denoted by V ar(X) or σ2 if X is implied by the context. Its square root
σ is also a measure of dispersion and is called the standard deviation. There are
two formulas that are commonly used for the variance of a discrete r.v.:
(29)
V ar(X) = E((X − µ)2) =
X
(xi − µ)2 P (xi).
i
(30)
V ar(X) = E(X 2 ) − µ2 = (
X
x2i P (xi )) − µ2.
i
In both formulas, xi ranges over all values that X can possibly take. While
formula (29) makes it easier to see what V ar(X) is, formula (30) is usually more
convenient for actual computations.
Consider a Bernoulli r.v. X with success probability p. We have seen above that
µ = E(X) = p. Using formula (30) we obtain for this type of r.v.:
(31)
V ar(X) = (02 · q + 12 p) − p2 = p − p2 = p(1 − p) = pq.
Exercise 1. Consider the following two experiments: In the first experiment, flip
a fair coin twice and let X be the number of times if comes up heads. Compute
E(X) and then V ar(X) using (30) above. In the second experiment assume you
have two biased coins; one will come up heads with probability 0.4, the other will
come up heads with probability 0.6. You don’t know which is which, randomly pick
one of the dice, and roll it twice. Let Y be the number of times it comes up heads.
Compute E(Y ) and then V ar(Y ) using (30) above.
P
P
Now assume X = k Xk . Recall that E(X) = k E(Xk ), regardless of whether
or not the r.v.s Xk are independent. If the r.v.s Xk are independent, then we still
8
WINFRIED JUST, OHIO UNIVERSITY
P
have V ar(X) = k V ar(Xk ), but this formula becomes usually false if the r.v.s
Xk are dependent.
In the case of a binomial r.v. X that counts the number of successes in n trials
with success probability p in each trial we can represent X as a sum of n independent
Bernoulli variables and we get the formula V ar(X) = npq = np(1 − p).
If n is large and p is small with λ = np of moderate size, then we get np(1 − p) =
np − np2 ≈ np = λ. Accordingly, the variance of a random variable Y with the
Poisson distribution given by (28) is λ. This should be expected, since the Poisson
distribution is an approximation to the corresponding binomial distribution.
The variance of a random variable Y that has the geometric distribution of
formula (24) is given by V ar(Y ) = pq2 .
Exercise 2. Let Y be a r.v. with the geometric distribution of (24). Find a
formula for P (Y = k1 + k2|Y > k1 ). Why are geometric random variables called
“memoryless?”
Exercise 3. Write a MatLab program of a function that takes as input integers
n ≥ k ≥ m ≥ 0 and a probability p and outputs P (m ≤ X ≤ k) where X is a
binomial r.v. that has a distribution given by (23).
Exercise 4. Assume that the number X of Athens residents that need an ambulance
during any given hour has approximately a Poisson distribution with parameter λ =
2. Write a little MatLab program that computes, for any given positive integer k,
the probability that during any given hour there will be no more than k ambulance
calls from Athens residents. Use your program to determine the smallest k for
which the probability that there will be more than k ambulance calls during a given
hour is no more than 0.001. Assume that you are in charge of deciding the number
of ambulances needed in Athens and each ambulance trip takes an hour. Assume
furthermore that if an ambulance call cannot be answered, a loss of one human
life will result with probability 0.05 and that it costs 120, 000 per year to maintain
an ambulance (you need to pay for the ambulance and at least three drivers to be
present during 8-hour shifts). If the decision on how many ambulances to maintain
on call in Athens is based on the probability 0.001 mentioned above, what value is
implicitly placed on the cost of a human life (think about this in terms of how things
would change by maintaining one more ambulance).
Exercise 5. Suppose you have a way of adjusting the probabilities of a coin at will
and you perform the following experiment: In the first flip, choose the probability
p1 that it comes up heads as p1 = 0.5. For coin flip number i + 1, let Xi be the
number of times heads comes up in the first i trials and let the probability that it
comes up heads in trial number i + 1 be pi+1 = (pi + Xi /i)/2.
(a) Write a MatLab code that takes as input a positive integer n and outputs the
probability distribution of Xn in the above experiment and its mean value E(Xn ).
What output do you get for n = 10? How does this compare with the binomial
distribution for n = 10 and p = 0.5?
(b) Extend your MatLab code so that it outputs also the variance V ar(Xn ).
(c) How will the picture change if we adjust the probability pi+1 according to the
formula pi+1 = (pi + (i − Xi )/i)/2 instead?
MATH 387: HANDOUT ON BASIC PROBABILITY CONCEPTS
9
4. Continuous random variables
Example 1. Suppose X is a binomial r.v. with distribution (23). Find the most
likely value that Xn can take. What happens to the probability of this value of
n → ∞?
For the r.v. X of Example 1, let Y = X
be the proportion of successes. For very
n
large n, we may (roughly) treat Y as a continuous r.v. that takes arbitrary values
from the interval [0, 1] with each individual value having probability zero. In this
type of r.v. we can no longer compute the probability of an event E as
X
P (E) =
P (e).
e∈E
Instead, we will need a continuous function g on the interval I of possible values
for our r.v. called a probability density function and define
(32)
P (E) =
Z
g(x) dx.
E
The simplest continuous distribution is the uniform distribution over an interval
1
I = [a, b] where g(x) = b−a
for all x ∈ [a, b].
Example 2. Suppose X is uniformly distributed over an interval [a, b] and
a ≤ c < d ≤ b. Find P (x ∈ [b, d]).
The uniform distribution is an important tool for generating random numbers
in MatLab. The command
>> rand(1)
gives you a random number from the interval [0, 1] with the uniform distribution. You can use this to define a random value of a Bernoulli r.v. with success
probability p by entering (if, for example, p = 0.3)
>> p = 0.3;
>> rand(1) < p
The command
>> p = 0.3;
>> rand(10) < p
now gives you a 10 × 10 matrix of random variates with this Bernoulli distribution; in order to get ten random variates for a binomial distribution with parameters
n = 10 and p = 0.3 enter
>> p = 0.3;
>> sum(rand(10) < p)
Perhaps the most important continuous distribution is the normal distribution.
To be more precise, for every real number µ and positive real number σ there is one
normal distribution N (µ, σ) with mean µ and standard deviation σ. Its probability
density function is given by the formula
The constant
(x−µ)2
1
e 2σ2 .
2πσ
R∞
is needed to make sure that −∞ g(x) = 1.
g(x) = √
(33)
√1
2πσ
10
WINFRIED JUST, OHIO UNIVERSITY
The standard normal distribution N (0, 1) has mean µ = 0 and standard deviation
σ = 1. Its density function is given by
1
x2
g(x) = √ e− 2 .
2π
ToP
see why the distribution N (0, 1) is so important, consider a random variable
n
X = i=1
i , where the r.v.’s Xi are independent and have
PX
Pn identical distributions.
n
Let µ = i=1 E(Xn ) be the mean of X and let σ2 = i=1 V ar(Xn ) be its variance. Define Z = X−µ
σ . The new r.v. Z is called the z-score of X; it measures by
how many standard deviations a given value of X differs from the mean of X. The
Central Limit Theorem or CLT states that for sufficiently large n the distribution
of Z will be well approximated by N (0, 1). Since tables or software for calculating
the probability that a r.v. with the standard normal distribution takes on a value
from a given interval are readily available, this gives us a convenient tool for answering (at least approximately) the same questions for X. For example, if Z has
distribution N (0, 1), then in MatLab you can enter
>> normcdf(a)
to calculate the probability P (Z ≤ a). For practice, you may want to do the
following practice exercises:
(34)
Exercise 6. Suppose X has distribution N (0, 1). Find the following probabilities:
(a) P (X ≤ 0)
(b) P (X > 2)
(c) P (|X| > 3)
(d) P (−1 < X ≤ 0.5)
(e) P (1 < X ≤ 2.5)
For a normally distributed r.v. the probabilities that it takes takes a specific
value are exactly zero, but if X is approximately normally distributed but takes
integer values, this is no longer the case. To correct for this effect, one uses a
so-called continuity correction. For example, to approximate the probability that
X > k, one finds the z-score z = k+0.5−µ
for k + 0.5, where µ and σ denote the
σ
mean and standard deviation of X, and uses the normal approximation to find the
probability that X > k + 0.5. To approximate the probability of X ≥ k one uses
the z-score z = k−0.5−µ
instead.
σ
Exercise 7. Suppose X has a binomial distribution with parameters n = 400 and
p = 25. Use the CTL to approximate the following probabilities:
(a) P (X ≤ 100)
(b) P (X > 140)
(d) P (90 < X ≤ 115)
(e) P (110 < X ≤ 120)
An important notion in statistics is the so-called p-value. Its definition uses
the concept of a null hypothesis (usually an assumption that a given r.v. has a
particularly simple distribution or that two r.v.s are independent), performs an
experiment, and calculates the probability that X takes a value that is at least as
extreme as the observed value under the assumption that the null hypothesis is true.
This probability is the p-value. If the p-value turns out lower than a previously
specified significance level α, one can feel entitled to reject the null hypothesis. In
MATH 387: HANDOUT ON BASIC PROBABILITY CONCEPTS
11
science, one usually works with α = 0.05, but α = 0.01 or α = 0.001 are also
sometimes used. The proper interpretation of the phrase “at least as extreme as
the observed value” usually depends on the context. Suppose you flip a coin 10,000
times and heads comes up 5,100 times. Your “null hypothesis” is the assumption
that the coin is unbiased, in which case the number X of heads is a binomial
variable with parameters n = 10, 000 and p = 0.5. The observed number exceeds
the mean by 100. What is the probability of obtaining this value or at least as
“extreme” ones? This depends. If we are playing in a casino and heads favor the
house, the null hypothesis is really: “The coin is not biased in a way that would
favor the house” and the proper interpretation is to consider all values for X that
are ≥ 5, 100 as “at least as extreme as the observed one.” If, however, we have no
prior conception of why he coin might be biased one way or the other, we need to
consider all values of X that are ≥ 5, 100 together with all values that are ≤ 4, 900
“at least as extreme as the observed one.”
Exercise 8. Use the technique of the previous exercise to calculate the p-values
corresponding to one and the other interpretation. Will the significance level of 0.05
allow us to reject the null hypothesis?
While widely used in science, the p-value may be misleading in some bioinformatics problems. Let us return to a previous exercise and expand on it.
Exercise 9. Suppose you sequence loci s1 , . . . , s3n of a DNA strand. Assume that
the strand is completely random, with all four nucleotides equally likely and no
dependencies between the loci. Let X count the number of i’s between 1 and n such
that (s3i−2, s3i−1, s3i) is one of the sequences (tga), (taa), (tag).
(a) Find a formula for P (X = k).
(b) Find a formula for P (X = 0) if your sequence represents a fragment of a coding
region that is 900 nucleotides long (with the last three positions representing a STOP
codon), if you know that s1 represents the first position of a codon in the correct
reading frame, but if you don’t know which of the 300 first positions it is, with all
300 first positions being equally likely.
(c) Let Y be the r.v. that returns the smallest i such that (s3i−2, s3i−1, s3i) is one of
the sequences (tga), (taa), (tag). Then Y returns the number of the trial on which
the first “success” occurs. Assuming that we sequence random DNA, what is the
distribution of Y ? What is the expected value of Y ?
(d) Assume that Y as in (c) takes the value 65 (that is, the first time you encounter
a triplet that looks like a STOP codon is at positions (193, 194, 195). If your null
hypothesis is that the genome sequence is random, what p-value does this observation
correspond have? Can we reject the null hypothesis at significance level 0.05?
(e) Consider an idealized bacterium b. idealis by making the following slightly false
assumptions:
• Every nucleotide belongs to a coding region.
• All coding regions are exactly 900 nucleotides long.
• All six reading frames are equally likely.
• A sequence that is read in the wrong reading frame is completely random,
with no dependencies between different loci and all nucleotides equally likely.
Assume again that Y as in (c) takes the value 65 (that is, the first time you
encounter a triplet that looks like a STOP codon is at positions (193, 194, 195).
What is the probability that you have been sequencing from a coding region in the
12
WINFRIED JUST, OHIO UNIVERSITY
correct reading frame? Hint: Let C be the event “correct reading frame,” let I be the
event “incorrect reading frame.” The a priori probabilities are P (C) = 16 , P (I) = 56
(why?) Let S be the observed occurrence of the first STOP codon in position 66.
Use your work in (a) or (c) to calculate P (S|I); use (b) to calculate P (S|C). Then
use Bayes Theorem to calculate the desired probability P (C|S).
(f) Would you draw the same or opposite conclusions from an approach based on
the p-value and from an approach based on Bayes Theorem? In each case, would
you consider the available evidence compelling or rather weak?
(g) Consider an idealized amoebum a. idealis by making the following slightly false
assumptions:
• Only 0.01% of all nucleotide belong to a coding region.
• All coding regions are contiguous (no introns) and exactly 900 nucleotides
long.
• All six reading frames are equally likely for the coding regions.
• A sequence that is read in the wrong reading frame or that belongs to an
intergenic region is completely random, with no dependencies between different loci and all nucleotides are equally likely.
Assume again that Y as in (c) takes the value 65 (that is, the first time you
encounter a triplet that looks like a STOP codon is at positions (193, 194, 195).
What is the probability that you have been sequencing from a coding region and in
the correct reading frame? Hint: The new twist here is that much of the genome
will no longer be coding in any of the six reading frames.
(h) Would you draw the same or opposite conclusions from an approach based on
the p-value and from an approach based on Bayes Theorem? In each case, would
you consider the available evidence compelling or rather weak?
Here is an additional practice problem:
Exercise 10. Suppose you perform independent Bernoulli trials with success probability in an individual trial until you achieve a total of m successes. Let Y be the
number of the trial in which the m-th success occurs.
(a) Find the set of all possible values of Y .
(b) Find the probability distribution of Y . Hint: You will need to use your knowledge
of both the binomial and the geometric distributions.
(c) Find the expected value and variance of Y . Hint: It is somewhat difficult to do
this from the formula you found in (b). Instead, notice that Y = Y1 + Y2 + · · ·+ Ym ,
where Yi is the waiting time for success number i counting from the number of the
trial where success number i − 1 occurred.