1. Elements of Probability 1.1. Sample Space and Events Consider an experiment whose outcome is not known • Sample space S: the set of all possible outcome Flip a coin: S = {H, T} Rolling a die: S = {1, 2, 3, 4, 5, 6} Running a race among 7 horses numbered 1 thru 7: S ={ all ordering of (1,2,3,4,5,6,7)} • Event: any subset A of the sample space is known as an event Event which getting a H: A={H} Event to have a even number when rolling a die A={2,4,6} Event that the number 5 horse comes first A ={ all outcomes in S starting with 5} • For any two events, we define the new event A∪B, the union of A and B, to consists of all outcomes that in A or B or in both A and B • we can also define the intersection of A and B, AB, to consist of all outcomes that are in both A and B • For any event A we define the event Ac, referred to as the complement of A, to consist of all outcomes in the sample space S that are not in A • note that S c does not contain any outcomes and thus cannot occur. We call S c the null set and designate it by φ • If AB = φ, we say that A and B are mutually exclusive. 1.1 1.2. Axioms of Probability Axiom 1 0 ≤ P (A) ≤ 1 Axiom 2 P (S) = 1 Axiom 3 For any sequence of mutually exclusive events A1, A2, . . .,  p n [  i=1  Ai = n X  i=1 P (Ai), n = 1, 2, . . . , ∞ 1.3. Usual definition Suppose that an experiment, whose sample space is S, is repeatedly performed under exactly same conditions. For each event A of the sample space S, we define n(A) to be the number of times in the first n repetitions of the experiment that the event A occurs. Then the probability of the event A is defined by n(A) P (A) = n→∞ lim n For some experimenters it is natural to assume that all outcomes in the sample space are equally likely to occur, i.e., consider an experiment whose sample space is a finite set S is a finite set, say S = {1, 2, . . . , N } Then it is often natural to assume that P ({1}) = P ({2}) = · · · = P ({N }) which implies from Axioms 2 and 3 that 1 P ({i}) = , i = 1, 2, . . . , N. N From this, it follows from Axiom 3 that for any event E number of points in E P (E) = number of points in S 1.2 1.4. Some simple propositions • P (Ac) = 1 − P (A) • If E ⊂ F , then P (E) ≤ P (F ) • P (E ∪ F ) = P (E) + P (F ) − P (EF ) • P (E1 ∪ E2 ∪ · · · ∪ Un) = n X i=1 P (Ei) − X i1 <i2 +(−1)r+1 P (EiEj ) + · · · X i1 <i2 <···<ir n+1 + · · · + (−1) P (Ei1 Ei2 · · · Eir ) P (E1E2 · · · En) where i1 <i2 <···<ir P (Ei1 Ei2 · · · Eir ) is taken over all of the    n    possible subsets of size r of the set {1, 2 . . . , n}. r P 1.3 1.5. Conditional Probability and Independence Consider the previous example, suppose we are interested in the probability that two heads are obtained provided that head is landed on the first flip In this case we call it as conditional probability that A occurs given B has occurred and denote by P {A|B} number of elements with A and B number of elements with B 1 = 2 P (A|B) = Now the conditional probability can be computed as Pr{A|B} = Pr{AB} Pr{B} Example: An urn contains 10 white, 5 yellow, and 10 black marbles. A marble is chosen at random from the urn, and it is noted that it is not one of the black marbles. What is the probability that it is yellow? Note that P (·|F ) is a probability 1.4 Total Probability Theorem If S = Sn i=1 Bi and Bi ∩ Bj = φ, i 6= j, then Pr{A} = n X i=1 Pr{A|Bi}Pr{Bi} Example: The credit department of a bank studied the profile of their tax loan clients. There are three types of clients: 90% of them are excellent, 9% of them are good and the remaining are so-so. The default probability of the client is 0.001% if it is an excellent customer; 0.01% if it is good and 1% if it is so-so. What is the overall default probability of the clients of this bank? Bayesian Theorem: If S = Sn i=1 Bi and Bi ∩ Bj = φ, i 6= j, then Pr{Bi|A} = Pr{A|Bi}Pr{Bi} , i = 1, 2, . . . , n Pn Pr{A|B }Pr{B } i i i=1 Given the client defaulted the loan, what is the prob. that it is an excellent customer; a good customer; a so-so customer? 1.5 Independence Events A and B are independent if Pr{A|B} = Pr{A} or equivalently, Pr{AB} = Pr{A}Pr{B} Example: A card is selected at random from an ordinary deck of 52 playing cards. If E is the event that the selected card is an ace and F is the event that it is a spade. Are E and F independent? If E and F are independent, then so are E and F C . The three events E, F and G are said to be independent if P (EF G) P (EF ) P (EG) P (F G) = = = = P (E)P (F )P (G) P (E)P (F ) P (E)P (G) P (F )P (G) The events E1 and E2 are conditionally independent given F if P (E1E2|F ) = P (E1|F )P (E2|F ) 1.6 Example: An individual trial by a 3-judge panel is declared guilty if at least 2 judges cast votes of guilty. Suppose that when the defendant is, in fact, guilty, each judge will independently vote guilty with probability .7, whereas when the defendant is, in fact, innocent, the probability drops to .2. If 70 percent of defendants are guilty, compute the conditional probability that judge number 3 votes guilty given that: (a) judges 1 and 2 vote guilty; (b) judges 1 and 2 cast 1 guilty and 1 innocent vote; (c) judges 1 and 2 both cast innocent votes. Let Ei, i = 1, 2, 3 denote the event that judge i casts a guilty vote. Are these events independent? Are they conditionally independent? 1.7 1.6. Random Variable A Random variable is a numerical description of the outcome of some experiments Flipping a coin: let X = 1 if H is faced up, 0 otherwise Number of heads faced up in 10 flips of a coin Rolling a die and let X be the number faced up Pick a student in the class and let X be the height of the student In a queue, number of customers arrived in one hour The waiting time of a customer waited in the queue It can be discrete, continuous or mixed of two: A discrete random variable can have only a countable number of outcomes A continuous variable can take uncountable number of outcomes 1.8 To describe the behavior of a random variable we can use (a) Probability Distribution F (x) = Pr(X ≤ x) Properties: i. F is a nondecreasing function; that is, if a < b, then F (a) ≤ F (b) ii. limx→−∞ F (x) = 0 iii. limx→∞ F (x) = 1 iv. F is right continuous. That is, for any b and any decreasing sequence bn, n ≥ 1, that converges to b, limn→∞ F (bn) = F (b). 1.9 (b) Probability mass function or Probability density function If X is discrete, then the probability mass function f (x) is f (x) = Pr(X = x) Suppose X can takes values x1, x2, . . . , xn, . . ., then ∞ X i=1 f (xi) = 1 and F (a) = f (xi) X i|xi ≤a Suppose If X is continuous, then the probability density function is the one f such that Pr(X ∈ C) = Z C f (x)dx Because of Axiom 2, Z ∞ −∞ f (x)dx = 1 The relationship between the cumulative distribution F (·) and the probability density function f (·) is expressed by F (a) = Pr{X ∈ (−∞, a]} = Z a −∞ f (x)dx Differentiating both sides yields d F (a) = f (a) da An intuitive interpretation: ( ) Pr a − ≤ X ≤ a + ≈ f (a) 2 2 when is small 1.10 Examples: Let X be number of heads faced up in 5 flips of a coin, then P r(X = i) = Let f (x) = e−x, x > 0; 0, otherwise, then f (x) is a density function. 1.11 The distribution of function of random variable Suppose a random variable X has density function fX (x) and cdf FX (x). Now let Y = w(X) where w(cdot) is a continuos and either increasing or decreasing for a < x < b. Suppose also that a < x < b if and only if α < Y < β, and let X = w−1(Y ) be the inverse function for α < y < β. Then the cdf of Y is FY (y) = FX (w−1(y)), α < y < β and the density function of Y is dx fY (y) = fX (w (y)) ,α < y < β dy −1 1.12 Sometimes we are not only interested in an individual variable but two or more variables. To specify the relationship between two random variables, we define the joint cumulative distribution function of X and Y by F (x, y) = Pr{X ≤ x, Y ≤ y} The distribution of X can be obtained from the joint distribution FX (x) = Pr{X ≤ x} = Pr{X ≤ x, Y < ∞} = Pr{y→∞ lim X ≤ x, Y ≤ y} = y→∞ lim Pr{X ≤ x, Y ≤ y} = F (x, ∞) Similarly we can obtain the distribution of Y All joint probability statement about X and Y can be answewred by F (x, y) Example: Pr{X > a, Y > b} 1.13 If X and Y are both discrete, the we can define the joint probability mass function by f (x, y) = Pr{X = x, Y = y} Now suppose X takes values x1, x2, . . . , xn, Y takes values y1, y2, . . . , ym, The joint prob. mass function can be easily expressed in tabular form. Example: Consider an experiment: flip a fair coin and toss a die independently, let X = 1 if head comes up in the coin flip, 0, otherwise. Let Y be the number faced up in the toss of die. Then the joint probability mass function of X and Y is given by Y X 0 1 Pr{Y = j} 1 1 1 2 × 6 1 1 × 2 6 2 3 4 5 6 Pr{X = i} The probability mass function can be obtained by Pr(X = xi) = f (xi) = 1.14 m X j=1 f (xi, yj ) why? Similarly, if X and Y are both continuous, then the joint probability density function f (x, y) is the one such that Pr{X ∈ C, Y ∈ D} = Z Z {(x,y):x∈C;y∈D} f (x, y)dxdy Since F (a, b) = Pr{X ∈ (−∞, a), Y ∈ (−∞, b)} = Z a Z b f ∗ x, y)dydx −∞ −∞ Therefore, ∂2 F (a, b) f (a, b) = ∂x∂y whenever the derivative exists. if X and Y are jointly continuous, they are individually continuous and the prob. density function of X is f (x) = Z ∞ −∞ f (x, y)dy and the density function of Y is f (y) = Z 1.15 ∞ −∞ f (x, y)dx Conditional Distribution: Discrete Case: Recall the definition of conditional probability of E given F Pr(E|F ) = If X and Y are discrete random variable , define the conditional probability mass function of X given Y = y by f (x|y) = Pr(X = x|Y = y) = Pr(X = x, Y = y) f (x, y) = Pr(Y = y) f (y) for all values fo y such that f (y) > 0. Similarly, define conditional probability function of X given Y = y by FX|Y (x|y) = Pr(X ≤ x|Y = y) = X f (a|y) a≤x where f (y) > 0. Example: Suppose that f (x, y), the joint probability mass function of X and Y , is given by f (0, 0) = 0.45 f (0, 1) = 0.05 f (1, 0) = 0.05 f (1, 1) = 0.45 Find the marginal distribution of X and the conditional distribution of X given Y = 0, 1. 1.16 Continuous Case: If X and Y have the joint probability density function f (x, y), define the conditional density function of X given Y = y by f (x|y) = f (x, y) f (y) It is consistent with the discrete case. f (x, y)dxdy f (y)dy Pr(x ≤ X < x + dx, y ≤ Y < y + dy) = Pr(y ≤ Y < y + dy) = P Pr(x ≤ X < x + dx|y ≤ Y < y + dy) f (x|y)dx = The conditional cumulative distribution function of X given Y = y is FX|Y (a|y) = Pr(X ≤ a|Y = y) = Z Two random variables are independent if f (x, y) = f (x)f (y) 1.17 a −∞ f (x|y)dx Examples: Suppose X, Y have a joint density function defined as f (x, y) = 1, 0 < x < 1; 0 < y < 1. We can compute the marginal density function of X by integrating Y out: fX (x) = = Z 1 0 Z 1 0 f (x, y)dy 1dy = y|1y=0 = 1, 0 < x < 1 Similarly, the marginal density of Y is also fY (y) = 1, 0 < y < 1 Since f (x, y) = fX (x)fY (y), X and Y is independent. Now suppose we want to find Pr(X > Y ). Note that X > Y represents {(x, y) : 0 < y < x < 1}. Therefore, Pr(X > Y ) = = = = Z Z {(x,y):0<y<x<1} Z 1Z x 1dydx 0 0 1 x y|y=0dx 0 Z 1 Z 0 xdx x1 | 20 1 = 2 = 1.18 f (x, y)dxdy Let X and Y have the joint pdf fX,Y (x, y) = 2e−(x+y), 0 < x < y < ∞ Then the marginal density of X is fX (x) = = Z ∞ Zx∞ x fX,Y (x, y)dy 2e−(x+y)dy ∞ −y e dy x −x −y ∞ 2e −e |x −x −x = 2e −x Z = = 2e e = 2e−2x, 0 < x < ∞ the marginal density of Y is fY (y) = = Z y fX,Y (x, y)dx Z0y 2e−(x+y)dx 0 y −x e dx 0 −y −x y 2e −e |0 −y −y −y = 2e Z = = 2e (1 − e ) = 2e−y (1 − e−y ), 0 < y < ∞ The conditional density of X given Y = y is fX,Y (x, y) fX|Y (x|y) = fY (y) 2e−(x+y) = 2e−y (1 − e−y ) e−x ,0 < x < y = 1 − e−y Similarly, the conditional density of Y given X = x is e−y fY |X (y|x) = −x , x < y < ∞ e 1.19 We can also compute the conditional mean of X given Y = y Z y xfX|Y (x|y)dx Z y xe−x = 0 dx 1 − e−y Z y 1 −x = xe dx 1 − e−y 0 E(X|Y = y) = 0 Integration by parts by letting u = x and dv = e−xdx, du = dx and v = −e−x, then Z y 0 −x [−xe−x|y0 ] −y y xe dx = + 0 e−xdx = −ye + [−e−x|y0 ] = 1 − e−y − ye−y Z Therefore, 1 − e−y − ye−y E(X|Y = y) = 1 − e−y 1.20 (c) Expected Values and Variance • If X is discrete and taking values x1, x2, . . . , then the expectation or expected value, or the mean of X is defined by X E(X) = xiPr{X = xi} i Examples: If the probability mass function of X is given by 1 = f (1) 2 f (0) = then E(X) = If I is the indicator variable for the event A, that is, if    I=  1 if A occurs 0 otherwise Then Therefore, the expectation of the indicator variable for the event is just the probability that A occurs • If X is continuous with the pdf f (x), then the expected value of X is E(X) = Z ∞ −∞ xf (x)dx Example: If X has the pdf    3x2 if 0 < x < 1 f (x) =   0 otherwise Then the expected value of X is 1.21 • Sometimes we are not interested in the expectation of X but the expectation of a function g(X), we need the following results If X is discrete with pmf f (xi), then E(g(X)) = X i g(xi)f (xi) and if X is continuous with pdf f (x), then E(g(X)) = Z ∞ −∞ g(x)f (x)dx If a and b are constants, then E(aX + b) = aE(X) + b If X1 and X2 are two random variables, then E(X1 + X2) = E(X1) + E(X2) 1.22 • Variance To measure the variation of values in the distribution, we use Variance If X is a random variable with mean µ, then the variance of X is defined by Var(X) = E[(X − µ)2] Alternative formula Var(X) = E(X 2) − µ2 For any constant a and b Var(aX + b) = a2Var(X) 1.23 • If we have two random variables X1 and X2 and to measure the dependence structure, we can use the Covariance Cov(X1, X2) = E[(X1 − µ1)(X2 − µ2)] where µi = E(Xi), i = 1, 2. Alternative formula: Cov(X1, X2) = E(X1X2) − µ1µ2 Var(X1 + X2) = Var(X1) + Var(X2) + 2Cov(X1, X2) If X1 and X2 are independent, then Cov(X1, X2) = 0 Correlation Coefficient: Measure of linear relationship between two random variables ρ = Corr(X, Y ) = −1 ≤ ρ ≤ 1. 1.24 Cov(X, Y ) r r Var(X) Var(Y ) (d) Some Inequalities and Laws of Large Numbers Markov’s Inequality If X takes on only nonnegative values, then for any value a>0 E[X] Pr(X ≥ a) ≤ a Corollary: Chebyshev’s Inequality: If X is a random variable having mean µ and variance σ 2, then for any value k > 0, Pr(|X − µ| ≥ kσ) ≤ 1 k2 Corollary: One sided Chebyshev’s Inequality: If X is a random variable having mean 0 and variance σ 2, then for any value k > 0, σ2 Pr(X > a) ≤ 2 σ + a2 The Weak Law of Large Numbers Let X1, X2, . . . be a sequence of independent and identically distributed random variables having mean µ. The for any > 0,  X1 + · · · + Xn − µ >  → 0 as n → ∞ Pr n     A generalization: Strong Law of Large Numbers: With prob. 1, lim n→∞ X1 + · · · + Xn =µ n 1.25 (e) Some Discrete Random Variables • Bernoulli Random Variable Only two possible outcomes : Success or Failure p = Pr{success} The probability mass function of X is Pr{X = x} = px(1 − p)1−x, x = 0, 1 • Binomial Random Variable If perform the Bernoulli trials n times and count the number of successes. Then we have a binomial variable The probability mass function of a binomial variable is   n x n−x   p (1 − p) Pr{X = x} =  , x = 0, 1, . . . , n x 1.26 • Poisson Random Variable If n is very large and np → λ, a constant, in the binomial setup we have the Poisson variable Example: number of customers visit the shop The probability mass function is e−λλx Pr{X = x} = , x = 0, 1, 2, . . . . x! E(X) = λ and Var(X) = λ • Geometric Random Variable If X is the number of first trial that is a success in a sequence of Bernoulli trials, then X is geometric random variable The p.m.f. is Pr{X = n} = p(1 − p)n−1, n ≥ 1 E(X) = (1 − p) 1 and Var(X) = p p2 1.27 • Negative Binomial Random Variable If X is the number of the rth success trial in a sequence of Bernoulli trials, then X is negative binomial random variable The p.m.f. is   n−1  n−r r   (1 − p) Pr{X = n} =  p ,n ≥ r r−1 Note that the N.B. r.v. can be thought as sum of r geometric random variables, therefore, r r(1 − p) E(X) = and Var(X) = p p2 • Hypergeometric Random Variable Consider an urn containing N + M balls N of them are light coloured and M are dark coloured a sample of size n is randomly chosen X is the number of light colored balls selected Then X is hypergeometric variable with p.m.f.  Pr{X = i} =     M   n−i    N +M    n  N  i nN nN M  n−1  E(X) = and Var(X) = 1 − N +M (N + M )2 N +M −1 • Discrete Uniform random Variable Let X be a number picked out from {1, 2, . . . , m} then it is called discrete uniform variable over (1, m) with p.m.f. 1 Pr{X = i} = , i = 1, 2, . . . , m. m  1.28  (f) Some Continuous Random Variables • Uniform Random Variable X ∼ U (a, b) if its density function is given by    f (x) =   1 b−a a<x<b otherwise 0 The distribution function of X is given, for a < x < b, by F (x) = Pr{X ≤ x} = a+b E(X) = 2 1.29 Z x a (b − a)−1dx = x−a b−a (b − a)2 and Var(X) = 12 • Exponential Random Variable A random variable X is exponentially distributed if its density function is f (x) = λe−λx, x > 0 Its cumulative function is F (x) = Z x 0 λe−λy dy Memoryless Property Pr{X > s + t|X > t} = Pr{X > s} It is always used for modeling the inter-arrival of a customer in queueing theory 1.30 • Normal Random Variable Density function: f (x) = √ 1 2 2 e−(x−µ) /2σ , −∞ < x < ∞ 2πσ The cumulative function is 1.31 • Log-Normal Random Variable A random variable X is said to be log-normal distributed if Y = log(X) is normal distributed random variable The density function of X is given by     √1 x 2πσ 2  0 f (x) =   exp x−µ)2 − (ln 2σ 2 ! x>0 otherwise It is used for modeling the price of stock Very useful in finance 1.32 • Weibull Random Variable A random variable X is Weibull distributed if its density function is of the form   x−L β−1   β f (x) =    α α 0 " exp − x−L β α # x>L otherwise Useful in life and fatigue tests, equipment lifetime L is so called the guarantee parameter if L = 0 and β = 1, it is exponential distribution 1.33 • Gamma Random Variable The density function f (x) = 1 λ k−1 e−x/λ ,x > 0 (k − 1)! x λ It can be shown that this is the sum of k, k integer, independent exponential random variables with mean λ1 Very useful in queueing, insurance risk theory and inventory control • Beta Random Variable The density function (α + β − 1)! x !α−1 x !β−1 f (x) = 1− ,0 < x < s (α − 1)!(β − 1)! s s Very useful to model the variation over a fixed interval from 0 to a positive constant s • Mixture distribution A random variable Y is a k − point mixture of the random variables X1, X2, . . . , Xk if its cdf is given by FY (y) = a1FX1 (y) + a2FX2 (y) + · · · + ak FXk (y) where all ai > 0 and a1 + a2 + · · · + ak = 1. Note: ai is the mixing proportion. 1.34 • Random Walk – Defn: Let Zt, t = 0, 1, 2, . . . be a sequence of random variables such that i. Z0 = 0 ii. Zt = X1 + X2 + · · · + Xt, t = 1, 2, . . . item Pr(Xi = 1) = 21 , Pr(Xi = −1) = 12 iii. X1, X2, · · · , Xt, · · · independent Then {Zt} a symmetric Random Walk – The path 6 @ @ R @ @ @ R @ @ @ @ @ R @ R @ @ @ @ @ R @ R @ @ @ @ @ @ @ R @ R @ R @ @ @ @ @ R @ R @ @ @ @ @ R @ R @ 1.35 - – Theorem i. E(Zt) = 0 ii. Var(Zt) = t iii. If t > s, then Zs and Zt−s are independent iv.      t   1 t      t + k even and − t ≤ k ≤ t  t+k  2   2 Pr(Zt = k) =          t + k odd and |k| > t 0 – Proof: i. E(Zt) = E(X1) + · · · + E(Xt) = 0 + 0 + · · · + 0 = 0 ii. Var(Zt) = Var(X1) + Var(X2) + · · · + Var(Xt) = |1 + 1 +{z· · · + 1} = t t terms iii. Zs = X1 +X2 +· · ·+Xs and Zt−s = Xs+1 +· · ·+Xt are independent. iv. Suppose there are l up and m down within t steps, then l + m = t, l − m = k and t+k l= 2 Therefore    1 Pr(Zt = k) = Pr B t,  2    t+k t−k    2  t  1 1 2      t + k even and  t+k  2 2  2 =    −t ≤ k ≤ t       0 t + k odd and |k| > t 1.36 – Examples: Find i. Pr(Z3 = 1 ∩ Z8 = 4) ii. Pr(Z8 = 4|Z3 = 1) iii. E(Zt2) iv. E(Z3Z8) v. Cov(Z3, Z8) vi. E(ZT2 |Zt = k) given T > t vii. Pr(τ3 = 5) where τk = inf{t|Zt = k} 1.37 – Solution: i. Pr(Z3 = 1 ∩ Z8 = 4) = Pr(Z3 = 1 ∩ Z8 − Z3 = 4 − 1) = Pr(Z3 = 1)Pr(Z8 − Z3 = 3) = Pr(Z3 = 1)Pr(Z˜5 = 3)       3 5  3  1  5  1   =   2 4 2 2 where Z8 − Z3 form a new random walk Z˜8−3 ii. Pr(Z8 = 4 ∩ Z3 = 1) Pr(Z3 = 1) Pr(Z8 − Z3 = 3 ∩ Z3 = 1) = Pr(Z3 = 1) Pr(Z8 − Z3 = 3)Pr(Z3 = 1) = Pr(Z3 = 1)    5 1 5   = Pr(Z˜5 = 3) =     4 2 Pr(Z8 = 4|Z3 = 1) = iii. E(Zt2) = Var(Zt) + (E(Zt))2 = t + 0 = t iv. E(Z3Z8) = = = = E((Z8 − Z3 + Z3)Z3) E((Z8 − Z3)Z3 + Z32)) E(Z8 − Z3)E(Z3) + E(Z32) 0×0+3=3 v. Cov(Z3, Z8) = E(Z3Z8) − E(Z3)E(Z8) 3−0×0=3 1.38 or Cov(Z3, Z8) = = = = Cov(Z3, Z3 + Z8 − Z3) Cov(Z3, Z − 3) + Cov(Z3, Z8 − Z3) Var(Z3) + 0 3 vi. E(ZT2 |Zt = k) = = = = = E((k + ZT − Zt)2|Zt = k) E((k + ZT − Zt)2) E(k 2 + (ZT − Zt)2 + 2k(ZT − Zt)) k 2 + E(Z˜T2 −t) + 2kE(Z˜T −t) k2 + T − t 1.39