MATH 387: HANDOUT ON BASIC PROBABILITY CONCEPTS WINFRIED JUST, OHIO UNIVERSITY Abstract. This note gives a summary of the basic concepts of probability theory, as well as some practice problems. 1. Sample spaces and events The sample space Ω is the set of all elementary outcomes e of an “experiment.” For the time being we will assume that the sample space is finite or countably infinite (which means it can be indexed by the set of natural numbers). The probability function P assigns numbers from the interval [0, 1], called probabilities, to elementary outcomes. This function has the property that X (1) P (e) = 1. e∈Ω An event E is a subset of Ω. The probability of an event E is given by X (2) P (E) = P (e). e∈E If Ω is finite and all elementary outcomes in Ω are equally likely, then (2) reduces to (3) P (E) = #(E) , #(Ω) where #(E) denotes the number of elements of E. An important example of a finite sample spaces for which usually all elementary outcomes are assumed equally likely is the space of all permutations (i.e., ordered arrangements) of r objects out of a given set of n objects. The size of this space is given by (4) Prn = n! = n(n − 1) · · · (n − r + 1). (n − r)! Another important important example of such spaces is the space of all combinations (i.e., unordered arrangements) of r objects out of a given set of n objects. The size of this space is given by n! n(n − 1) · · · (n − r + 1) = . (n − r)!r! r! ¯ or E c ) is the set E 0 = Ω\E The complement of an event E, denoted by E 0 (or E, of all elementary outcomes that are not in E. Intuitively, the complement of E occurs if E does not occur. Its probability is given by (5) (6) Crn = P (E 0) = 1 − P (E). 1 2 WINFRIED JUST, OHIO UNIVERSITY The union of two events A and B is the set A ∪ B of elementary outcomes that are members of A or of B. Intuitively, A ∪ B occurs if at least one of A or B occurs. The intersection of two events A and B is the set A ∩ B of elementary outcomes that are members of both A and B. Intuitively, A ∩ B occurs if both A and B occur. Two events A and B are mutually exclusive if A ∩ B = ∅. The probability of the union of two events is given by the formula: (7) P (A ∪ B) = P (A) + P (B) − P (A ∩ B). If A and B are mutually exclusive,then P (A ∩ B) = 0 and (7) simplifies to: (8) P (A ∪ B) = P (A) + P (B). More generally, events A1 , . . . , An are pairwise mutually exclusive if Ai ∩ Aj = ∅ for all 1 ≤ i < j ≤ n. If, in addition, we have A1 ∪ · · · ∪ An = Ω, then we call the family {A1, . . . , An} a partition of the sample space. For families of pairwise mutually exclusive events the following generalization of (8) holds: (9) P (A1 ∪ · · · ∪ An) = P (A1) + · · · + P (An ). For events A1 , . . ., An that are not necessarily pairwise mutually exclusive, the following generalization of (7) holds. P (A1 ∪ · · · ∪ An ) = (10) + X n X P (Ai ) − i=1 X P (Ai ∩ Aj ) + 1≤i<j≤n P (Ai ∩ Aj ∩ Ak ) − · · · + (−1)n+1 P (A1 ∩ · · · ∩ An). 1≤i<j<k≤n Equation (10) is called the inclusion-exclusion principle. Suppose P (A) > 0. The conditional probability of B given A is the probability that B occurs if it is already known that A occurred. It is denoted by P (B|A) and given by the formula (11) P (B|A) = P (A ∩ B) . P (A) Note that if P (A) = 0, then P (B|A) is undefined. It follows from (11) that the probability of the intersection of A and B is given by (12) P (A ∩ B) = P (A)P (B|A). Two events A and B are independent if the information that one of them occurred does not alter our expectation that the other one occurred. In other words, if P (A) > 0 then A and B are independent if, and only if, P (B|A) = P (B). If P (A) = 0, then A never occurs and A and we could never obtain a piece of information that says “A occurred,” so the events A and B are again independent. In either case we have (13) P (A ∩ B) = P (A)P (B), and equation (13) is in fact considered the “official” definition of independence of two events. The events in a family A = {A1 , . . . , An} of events are said to be pairwise independent if P (Ai ∩ Aj ) = P (Ai)P (Aj ) for all 1 ≤ i < j ≤ n. We say that the MATH 387: HANDOUT ON BASIC PROBABILITY CONCEPTS 3 events in A are independent if for every increasing sequence of indices 1 ≤ i1 < · · · < ik ≤ n we have (14) P (Ai1 ∩ · · · ∩ Aik ) = k Y P (Aij ). j=1 Independence clearly implies pairwise independence, but pairwise independence does not imply independence when n > 2. Conditional probability is a very important tool for constructing probability functions on sample spaces of sequences. Suppose our sample space consists of letter sequences ~s = (s1 , . . . , sn) of length n. For simplicity, let us write (a1, . . . , ak ) for the event s1 = a1 & s2 = a2 & . . . & sk = ak and ak for the event sk = ak . Then (15) P (a1 , . . ., ak ) = P (a1)P (a2|a1) . . . P (ak |a1 . . . ak−1) . . . P (an|a1 . . . an−1). If the events a1, . . . , an are independent, then (16) reduces to (16) P (a1, . . . , ak ) = P (a1)P (a2 ) . . . P (ak ) . . . P (an). Formula (16) underlies the procedure of calculating probabilities by using socalled decision trees. Sometimes it is easier to calculate P (B|A) and P (B|A0) than P (B) itself. We can then compute P (B) form the formula (17) P (B) = P (B|A)P (A) + P (B|A0)P (A0 ). More generally, if events A1, . . . , An form a partition of Ω, then the probability of an event B can be calculated as (18) P (B) = P (B|A1 )P (A1) + · · · + P (B|An)P (An ). Formula (18) and its special case (17) are called the formula for the total probability (of B). Now let us return to equation (12). It implies that (19) P (A ∩ B) = P (B|A)P (A) = P (A|B)P (B). If we divide by P (B) we obtain (20) P (A|B) = P (B|A)P (A) . P (B) Equation (20) is the most general form of Bayes Theorem or Bayes Rule. It allows us to compute P (A|B), the posterior probability of A after obtaining the information that B occurred from the prior probability P (A) and the conditional probability P (B|A). In applications of Bayes Theorem the probability of B is usually calculated using one of the formulas for total probability, either (17), in which case Bayes Rule takes the form (21) P (A|B) = P (B|A)P (A) , P (B|A)P (A) + P (B|A0 )P (A0) or (18), in which case Bayes Rule takes the form (22) P (A|B) = P (B|A1 )P (A1) . P (B|A1 )P (A1) + · · · + P (B|An )P (An) 4 WINFRIED JUST, OHIO UNIVERSITY 2. Practice problems for Section 1 This section gives some practice problems for the material in Section 1. These questions form a logical sequence and should be attempted in the order given. The exception are (9)-(13) which depend only on (1), and (14), which depends only on (1)-(4) (1) Suppose you want to model the space of all sequences of length n of letters from the alphabet {a, c, g, t}. Describe a suitable sample space Ω and determine how many elements it has. If all such sequences are considered equally likely, what would be the probability of each individual sequence? What would be the probability that the sequence contains c at loci i1 , i2, . . . , ik and other letters at all other loci? How would the result change if we have only two letters? Only three letters? (2) Is the probability function you defined in Problem (1) adequate if you want to model a genome with cg-content 60%? If not, how would you define the probability of an individual sequence if you assume that all loci are independent? What would be the probability that the sequence contains c at loci i1 , i2 , . . ., ik and other letters at all other loci? How would the result change if the alphabet has only two letters {a, t}? Only three letters {g, a, t}? (3) In the model of Problem (1), find the probability that a sequence contains exactly k occurrences of the letter c. How would you express the probability that the sequence contains at most k occurrences of the letter c? Hint: Consider a two-step approach: First choose the loci i1 , i2 , . . . , ik at which c occurs, think in how many ways this can be done, and then use the result for the third last sentence of Problem (1). (4) In the model of Problem (2), find the probability that a sequence contains exactly k occurrences of the letter c. How would you express the probability that the sequence contains at most k occurrences of the letter c? Hint: Consider a two-step approach: First choose the loci i1 , i2 , . . . , ik at which c occurs, think in how many ways this can be done, and then use the result for the third last sentence of Problem (2). (5) Let E be the event that a sequence of n nucleotides contains kc occurrences of the letter c, kg occurrences of the letter g, ka occurrences of the letter a, kt and occurrences of the letter t, where kc + kg + ka + kt = n. Let C(kc) be the event that c occurs exactly kc times; define P (Cg ), P (Ca), P (Ct) analogously. Note that E = C(kc) ∩ C(kg ) ∩ C(ka) ∩ C(kt), and show that E = C(kc) ∩ C(kg ) ∩ C(ka ). (6) Let E be defined as in the previous problem. Show that P (E) = P (C(ck ))P (C(kg )|C(kc))P (C(ka)|(C(kc) ∩ C(kg )). (7) Find a formula for the probability of event E as in Problem (5) in the model of Problem (1). Hint: You already found a formula for P (C(kc)). Note that P (C(kg )|C(kc)) is the same as probability P (C(kg )) in a sequence space where there are only n − kc nucleotides from the alphabet {g, a, t}, and use a similar trick for finding P (C(ka)|(C(kc) ∩ C(kg )). (8) Find a formula for the probability of event E as in Problem (5) in the model of Problem (2). Hint: Argue as in the previous problem. MATH 387: HANDOUT ON BASIC PROBABILITY CONCEPTS 5 (9) Consider again the sample space of problems (1) with all four nucleotides equally likely in locus 1, but now assume that the loci are not independent. Specifically, assume that a c is more likely to be followed by a g, but all other probabilities are the same, and there are no other dependencies between loci. Let si denote the letter encountered in locus i and assume for all 1 ≤ i < n we have: P (si+1 = g|si = c) = 0.4, P (si+1 = a|si = c) = P (si+1 = c|si = c) = P (si+1 = t|si = c) = 0.2, (10) (11) (12) (13) (14) P (si+1 = x|si = d) = 0.25, where x stands for any nucleotide and d stands for any nucleotide other than c. Let n = 5, and compute the probability of the sequence ccgca under these assumptions. How does it compare with the probabilities of the same sequence in the models of Problems (1) and (2)? Why do these probabilities differ in the way they do? In the model of Problem (10), find P (s2 = g) and P (s2 = c). Hint: Use the formula for the total probability. In the model of Problem (10), find P (s1 = c|s2 = g) and P (s2 = c|s3 = g). Hint: Use Bayes formula. In the model of Problem (10), find P (s3 = c) and P (s3 = g). Write a short MatLab code that computes P (si = c) and P (si = g) for i = 1, 2, . . ., n. Run it for n = 20. What pattern do you observe for the sequences of these probabilities? How would you explain these patterns? Suppose you have a sequence of n nucleotides of a bacterium that was randomly chosen from a culture that contains 10% bacteria C. minimus, 20% bacteria C. medianis, and 70% C. maximus. It is known that the cgcontent of the genome of C. minimus is 40%, the cg-content for C. medianis is 50%, and the cg-content for C. maximus is 60%. A quick test tells you that among the n nucleotides in the sequence of unknown origin exactly k of them are c’s. Based on this information, how would you calculate the probability that the sequence comes from any of these sources? Use MatLab to find the respective probabilities for each organism if n = 10, k = 2 and n = 50, k = 10. Note that in each case the proportion knc of c’s is the same. Why do you get such different probabilities? Hint: Use Bayes rule. The MatLab code for nk is nchoosek(n,k). 3. Discrete random variables A random variable, or r.v. is any function X : Ω → R defined on a sample space Ω that takes on real values. For example, consider the experiment of rolling two dice. The number shown by the first die, the number shown by the second die, the sum of these two numbers, as well as their difference are all examples of random variables. Note that each of the above random variables can take only a finite number of possible values. They are examples of discrete random variables. Now consider the experiment of rolling a die infinitely often. Then the number of the first roll in which six comes up is also a random variable. The set of its possible values is infinite, but since there is always a “next larger” possible value, this random variable is still considered discrete. 6 WINFRIED JUST, OHIO UNIVERSITY Now suppose our sample space consists of all OU students. The the Social Security number of each student, ZIP code of primary residence, height, weight are all random variables. The first two are discrete, the last two continuous since they can potentially take all real values from an interval. One may also consider a student’s gender or major to be so-called categorical random variables. Since one can always assign arbitrary numerical codes to categories (think of major codes), we will consider categorical random variables simply as examples of discrete random variables. Two r.v.s X and Y are independent if the values of X don’t give us any additional information about the likelihood of values of Y on the same outcome; independence of several r.v.s is defined similarly as in the definition of independence of events. For example, the random variables “Social Security number,” “height,” and “major code” for OU students can reasonably be assumed independent, while “Social Security number” and “ZIP code” are likely dependent, and “height” and “weight” are the opposite of independent, they are strongly correlated. The distribution of a random variable measures how likely it is for this random variable to take on values from a given interval. In the case of a discrete random variable X that can take on only values x0 , x1, . . . we can represent the distribution simply by listing for each xi the probability P (X = xi) that X will take the value xi . For example, let X be a random variable that can take the value 1 (“success”) with probability p and the value 0 (“failure”) with probability 1 − p = q. This is called a Bernoulli r.v. Its distribution is given by P (1) = p, P (0) = q. Now let us repeat the experiment n times and assume that the “trials” are independent. Let Xi be the Bernoulli r.v. that takes the value 1 if we have a success in the i-th trial and 0P if we have a failure in the i-th trial. Then we can define a random variable X= n i=1 Xn . The r.v. X counts the total number of successes in these n trials. It is called a binomial r.v. The distribution of X is given by (23) P (X = k) = n k n−k p q . k We also could imagine repeating the experiment infinitely often and define a random variable Y that returns the number of the first trial for which a success occurs. This kind of r.v. is called a geometric r.v. Its distribution is given by (24) P (Y = k) = pqk−1. The mean value or expected value E(X) of a random variable gives us a notion of “average” value. It is also often denoted by µ instead of E(X). For a discrete r.v. X, it can be computed by the formula (25) E(x) = X xi P (xi), i where i ranges over all indices for possible values xi of X. For example, for a Bernoulli variable with success probability p we get (26) E(X) = 1 · p + 0 · q = p. MATH 387: HANDOUT ON BASIC PROBABILITY CONCEPTS 7 P P It turns out that if X = k Xk , then E(X) = k E(Xk ), regardless of whether or not the r.v.s Xk are independent. We can now calculate the expected value of a binomial r.v. X that counts the number of successes in n trials with success probability p in each trial as: (27) E(X) = n X E(Xk ) = k=0 n X p = np. k=0 For a geometric r.v. Y with distribution (24) we have E(Y ) = 1p . Using formula (23) for large n becomes cumbersome; it is better to work with approximations in this case. If n is very large and p is very small, but np is of moderate size, then we can use the approximation of X by a r.v. Y with Poisson distribution with parameter λ = np. The distribution of such Y is given by e−λ λk . k! The expected value for a r.v. Y with Poisson distribution with parameter λ is E(Y ) = λ. (28) P (Y = k) = The variance of a random variable X is a measure of dispersion, that is, a measure of how much the values of X tend to differ from its mean µ. The variance is usually denoted by V ar(X) or σ2 if X is implied by the context. Its square root σ is also a measure of dispersion and is called the standard deviation. There are two formulas that are commonly used for the variance of a discrete r.v.: (29) V ar(X) = E((X − µ)2) = X (xi − µ)2 P (xi). i (30) V ar(X) = E(X 2 ) − µ2 = ( X x2i P (xi )) − µ2. i In both formulas, xi ranges over all values that X can possibly take. While formula (29) makes it easier to see what V ar(X) is, formula (30) is usually more convenient for actual computations. Consider a Bernoulli r.v. X with success probability p. We have seen above that µ = E(X) = p. Using formula (30) we obtain for this type of r.v.: (31) V ar(X) = (02 · q + 12 p) − p2 = p − p2 = p(1 − p) = pq. Exercise 1. Consider the following two experiments: In the first experiment, flip a fair coin twice and let X be the number of times if comes up heads. Compute E(X) and then V ar(X) using (30) above. In the second experiment assume you have two biased coins; one will come up heads with probability 0.4, the other will come up heads with probability 0.6. You don’t know which is which, randomly pick one of the dice, and roll it twice. Let Y be the number of times it comes up heads. Compute E(Y ) and then V ar(Y ) using (30) above. P P Now assume X = k Xk . Recall that E(X) = k E(Xk ), regardless of whether or not the r.v.s Xk are independent. If the r.v.s Xk are independent, then we still 8 WINFRIED JUST, OHIO UNIVERSITY P have V ar(X) = k V ar(Xk ), but this formula becomes usually false if the r.v.s Xk are dependent. In the case of a binomial r.v. X that counts the number of successes in n trials with success probability p in each trial we can represent X as a sum of n independent Bernoulli variables and we get the formula V ar(X) = npq = np(1 − p). If n is large and p is small with λ = np of moderate size, then we get np(1 − p) = np − np2 ≈ np = λ. Accordingly, the variance of a random variable Y with the Poisson distribution given by (28) is λ. This should be expected, since the Poisson distribution is an approximation to the corresponding binomial distribution. The variance of a random variable Y that has the geometric distribution of formula (24) is given by V ar(Y ) = pq2 . Exercise 2. Let Y be a r.v. with the geometric distribution of (24). Find a formula for P (Y = k1 + k2|Y > k1 ). Why are geometric random variables called “memoryless?” Exercise 3. Write a MatLab program of a function that takes as input integers n ≥ k ≥ m ≥ 0 and a probability p and outputs P (m ≤ X ≤ k) where X is a binomial r.v. that has a distribution given by (23). Exercise 4. Assume that the number X of Athens residents that need an ambulance during any given hour has approximately a Poisson distribution with parameter λ = 2. Write a little MatLab program that computes, for any given positive integer k, the probability that during any given hour there will be no more than k ambulance calls from Athens residents. Use your program to determine the smallest k for which the probability that there will be more than k ambulance calls during a given hour is no more than 0.001. Assume that you are in charge of deciding the number of ambulances needed in Athens and each ambulance trip takes an hour. Assume furthermore that if an ambulance call cannot be answered, a loss of one human life will result with probability 0.05 and that it costs 120, 000 per year to maintain an ambulance (you need to pay for the ambulance and at least three drivers to be present during 8-hour shifts). If the decision on how many ambulances to maintain on call in Athens is based on the probability 0.001 mentioned above, what value is implicitly placed on the cost of a human life (think about this in terms of how things would change by maintaining one more ambulance). Exercise 5. Suppose you have a way of adjusting the probabilities of a coin at will and you perform the following experiment: In the first flip, choose the probability p1 that it comes up heads as p1 = 0.5. For coin flip number i + 1, let Xi be the number of times heads comes up in the first i trials and let the probability that it comes up heads in trial number i + 1 be pi+1 = (pi + Xi /i)/2. (a) Write a MatLab code that takes as input a positive integer n and outputs the probability distribution of Xn in the above experiment and its mean value E(Xn ). What output do you get for n = 10? How does this compare with the binomial distribution for n = 10 and p = 0.5? (b) Extend your MatLab code so that it outputs also the variance V ar(Xn ). (c) How will the picture change if we adjust the probability pi+1 according to the formula pi+1 = (pi + (i − Xi )/i)/2 instead? MATH 387: HANDOUT ON BASIC PROBABILITY CONCEPTS 9 4. Continuous random variables Example 1. Suppose X is a binomial r.v. with distribution (23). Find the most likely value that Xn can take. What happens to the probability of this value of n → ∞? For the r.v. X of Example 1, let Y = X be the proportion of successes. For very n large n, we may (roughly) treat Y as a continuous r.v. that takes arbitrary values from the interval [0, 1] with each individual value having probability zero. In this type of r.v. we can no longer compute the probability of an event E as X P (E) = P (e). e∈E Instead, we will need a continuous function g on the interval I of possible values for our r.v. called a probability density function and define (32) P (E) = Z g(x) dx. E The simplest continuous distribution is the uniform distribution over an interval 1 I = [a, b] where g(x) = b−a for all x ∈ [a, b]. Example 2. Suppose X is uniformly distributed over an interval [a, b] and a ≤ c < d ≤ b. Find P (x ∈ [b, d]). The uniform distribution is an important tool for generating random numbers in MatLab. The command >> rand(1) gives you a random number from the interval [0, 1] with the uniform distribution. You can use this to define a random value of a Bernoulli r.v. with success probability p by entering (if, for example, p = 0.3) >> p = 0.3; >> rand(1) < p The command >> p = 0.3; >> rand(10) < p now gives you a 10 × 10 matrix of random variates with this Bernoulli distribution; in order to get ten random variates for a binomial distribution with parameters n = 10 and p = 0.3 enter >> p = 0.3; >> sum(rand(10) < p) Perhaps the most important continuous distribution is the normal distribution. To be more precise, for every real number µ and positive real number σ there is one normal distribution N (µ, σ) with mean µ and standard deviation σ. Its probability density function is given by the formula The constant (x−µ)2 1 e 2σ2 . 2πσ R∞ is needed to make sure that −∞ g(x) = 1. g(x) = √ (33) √1 2πσ 10 WINFRIED JUST, OHIO UNIVERSITY The standard normal distribution N (0, 1) has mean µ = 0 and standard deviation σ = 1. Its density function is given by 1 x2 g(x) = √ e− 2 . 2π ToP see why the distribution N (0, 1) is so important, consider a random variable n X = i=1 i , where the r.v.’s Xi are independent and have PX Pn identical distributions. n Let µ = i=1 E(Xn ) be the mean of X and let σ2 = i=1 V ar(Xn ) be its variance. Define Z = X−µ σ . The new r.v. Z is called the z-score of X; it measures by how many standard deviations a given value of X differs from the mean of X. The Central Limit Theorem or CLT states that for sufficiently large n the distribution of Z will be well approximated by N (0, 1). Since tables or software for calculating the probability that a r.v. with the standard normal distribution takes on a value from a given interval are readily available, this gives us a convenient tool for answering (at least approximately) the same questions for X. For example, if Z has distribution N (0, 1), then in MatLab you can enter >> normcdf(a) to calculate the probability P (Z ≤ a). For practice, you may want to do the following practice exercises: (34) Exercise 6. Suppose X has distribution N (0, 1). Find the following probabilities: (a) P (X ≤ 0) (b) P (X > 2) (c) P (|X| > 3) (d) P (−1 < X ≤ 0.5) (e) P (1 < X ≤ 2.5) For a normally distributed r.v. the probabilities that it takes takes a specific value are exactly zero, but if X is approximately normally distributed but takes integer values, this is no longer the case. To correct for this effect, one uses a so-called continuity correction. For example, to approximate the probability that X > k, one finds the z-score z = k+0.5−µ for k + 0.5, where µ and σ denote the σ mean and standard deviation of X, and uses the normal approximation to find the probability that X > k + 0.5. To approximate the probability of X ≥ k one uses the z-score z = k−0.5−µ instead. σ Exercise 7. Suppose X has a binomial distribution with parameters n = 400 and p = 25. Use the CTL to approximate the following probabilities: (a) P (X ≤ 100) (b) P (X > 140) (d) P (90 < X ≤ 115) (e) P (110 < X ≤ 120) An important notion in statistics is the so-called p-value. Its definition uses the concept of a null hypothesis (usually an assumption that a given r.v. has a particularly simple distribution or that two r.v.s are independent), performs an experiment, and calculates the probability that X takes a value that is at least as extreme as the observed value under the assumption that the null hypothesis is true. This probability is the p-value. If the p-value turns out lower than a previously specified significance level α, one can feel entitled to reject the null hypothesis. In MATH 387: HANDOUT ON BASIC PROBABILITY CONCEPTS 11 science, one usually works with α = 0.05, but α = 0.01 or α = 0.001 are also sometimes used. The proper interpretation of the phrase “at least as extreme as the observed value” usually depends on the context. Suppose you flip a coin 10,000 times and heads comes up 5,100 times. Your “null hypothesis” is the assumption that the coin is unbiased, in which case the number X of heads is a binomial variable with parameters n = 10, 000 and p = 0.5. The observed number exceeds the mean by 100. What is the probability of obtaining this value or at least as “extreme” ones? This depends. If we are playing in a casino and heads favor the house, the null hypothesis is really: “The coin is not biased in a way that would favor the house” and the proper interpretation is to consider all values for X that are ≥ 5, 100 as “at least as extreme as the observed one.” If, however, we have no prior conception of why he coin might be biased one way or the other, we need to consider all values of X that are ≥ 5, 100 together with all values that are ≤ 4, 900 “at least as extreme as the observed one.” Exercise 8. Use the technique of the previous exercise to calculate the p-values corresponding to one and the other interpretation. Will the significance level of 0.05 allow us to reject the null hypothesis? While widely used in science, the p-value may be misleading in some bioinformatics problems. Let us return to a previous exercise and expand on it. Exercise 9. Suppose you sequence loci s1 , . . . , s3n of a DNA strand. Assume that the strand is completely random, with all four nucleotides equally likely and no dependencies between the loci. Let X count the number of i’s between 1 and n such that (s3i−2, s3i−1, s3i) is one of the sequences (tga), (taa), (tag). (a) Find a formula for P (X = k). (b) Find a formula for P (X = 0) if your sequence represents a fragment of a coding region that is 900 nucleotides long (with the last three positions representing a STOP codon), if you know that s1 represents the first position of a codon in the correct reading frame, but if you don’t know which of the 300 first positions it is, with all 300 first positions being equally likely. (c) Let Y be the r.v. that returns the smallest i such that (s3i−2, s3i−1, s3i) is one of the sequences (tga), (taa), (tag). Then Y returns the number of the trial on which the first “success” occurs. Assuming that we sequence random DNA, what is the distribution of Y ? What is the expected value of Y ? (d) Assume that Y as in (c) takes the value 65 (that is, the first time you encounter a triplet that looks like a STOP codon is at positions (193, 194, 195). If your null hypothesis is that the genome sequence is random, what p-value does this observation correspond have? Can we reject the null hypothesis at significance level 0.05? (e) Consider an idealized bacterium b. idealis by making the following slightly false assumptions: • Every nucleotide belongs to a coding region. • All coding regions are exactly 900 nucleotides long. • All six reading frames are equally likely. • A sequence that is read in the wrong reading frame is completely random, with no dependencies between different loci and all nucleotides equally likely. Assume again that Y as in (c) takes the value 65 (that is, the first time you encounter a triplet that looks like a STOP codon is at positions (193, 194, 195). What is the probability that you have been sequencing from a coding region in the 12 WINFRIED JUST, OHIO UNIVERSITY correct reading frame? Hint: Let C be the event “correct reading frame,” let I be the event “incorrect reading frame.” The a priori probabilities are P (C) = 16 , P (I) = 56 (why?) Let S be the observed occurrence of the first STOP codon in position 66. Use your work in (a) or (c) to calculate P (S|I); use (b) to calculate P (S|C). Then use Bayes Theorem to calculate the desired probability P (C|S). (f) Would you draw the same or opposite conclusions from an approach based on the p-value and from an approach based on Bayes Theorem? In each case, would you consider the available evidence compelling or rather weak? (g) Consider an idealized amoebum a. idealis by making the following slightly false assumptions: • Only 0.01% of all nucleotide belong to a coding region. • All coding regions are contiguous (no introns) and exactly 900 nucleotides long. • All six reading frames are equally likely for the coding regions. • A sequence that is read in the wrong reading frame or that belongs to an intergenic region is completely random, with no dependencies between different loci and all nucleotides are equally likely. Assume again that Y as in (c) takes the value 65 (that is, the first time you encounter a triplet that looks like a STOP codon is at positions (193, 194, 195). What is the probability that you have been sequencing from a coding region and in the correct reading frame? Hint: The new twist here is that much of the genome will no longer be coding in any of the six reading frames. (h) Would you draw the same or opposite conclusions from an approach based on the p-value and from an approach based on Bayes Theorem? In each case, would you consider the available evidence compelling or rather weak? Here is an additional practice problem: Exercise 10. Suppose you perform independent Bernoulli trials with success probability in an individual trial until you achieve a total of m successes. Let Y be the number of the trial in which the m-th success occurs. (a) Find the set of all possible values of Y . (b) Find the probability distribution of Y . Hint: You will need to use your knowledge of both the binomial and the geometric distributions. (c) Find the expected value and variance of Y . Hint: It is somewhat difficult to do this from the formula you found in (b). Instead, notice that Y = Y1 + Y2 + · · ·+ Ym , where Yi is the waiting time for success number i counting from the number of the trial where success number i − 1 occurred.
© Copyright 2024