İnsan TUNALI Econ 311 – Econometrics I 11 February 2014 Lectures 3-5 Revised Feb. 20th REVIEW OF PROBABILITY AND STATISTICS (a) Population, random variable, and distribution (b) Moments of a distribution (mean, variance, standard deviation, covariance, correlation) Stock & Watson, Ch.2 [Goldberger Ch.3-4] 1. 2. 3. 4. The probability framework for statistical inference (c) Conditional distributions and conditional means The probability framework for statistical inference Estimation Testing Confidence Intervals (d) Distribution of a sample of data drawn randomly from a YBnB (subject of another handout). population: YB1,…, B From the syllabus: “The prerequisites for ECON 311 include MATH 201 (Statistics), and ECON 201 (Intermediate Microeconomics). Students who got a grade of C− or below in MATH 201 are strongly advised to work independently to make up for any deficiency they have during the first two weeks of the semester.” 2/36 (a) Population, random variable, and distribution Population How to envision Discrete Probability Distributions • The group or collection of all possible entities of interest Urn model: Population = Balls in an urn • We will think of populations as being “very big” (∞ is an approximation to “very big”) Outcomes... sample space... events... MATH 201 Random variable Y Each ball has a value (Y) written on it; • Numerical summary of a random outcome Population distribution of Y Y has K distinct values: y1, y2, ..., yi, ..., yK. Suppose we were to sample from this (univariate) population, with replacement, infinitely many times… • Discrete case: The probabilities of different values of Y that occur in the population • Continous case: Likelihood of particular ranges of Y 3/36 4/36 (Population) Distribution of Y: pi = Pr(Y = yi), i = 1, 2, …, K. Cumulative Distribution Function (c.d.f) of Y: F(y) = Pr(Y ≤ y) = ∑ f ( yi ) = ∑ pi . Gives the proportion of times we encounter a ball with value Y = yi, i = 1, 2, …, K. Alternate notation: f(y) = Pr(Y = y); “probability function (p.f.) of Y.” Clearly pi ≥ 0 for all i, and Σi pi = 1. Convention: Σi ≡ Examples: > Gender: M (=0)/F (=1). We prefer the numerical representation… > Standing: freshman (=1), sophomore (=2), junior (=3), senior (=4). yi ≤ y yi ≤ y That is, to find F(y) we sum the pi’s up to the value p = Pr(Y = y). Use of the c.d.f.: Pr(a < Y ≤ b) = F(b) – F(a). Important features of the distribution of Y: (Population) Mean of Y: µY = E(Y) = Σi yi pi. (Here Σi ≡ Also known as “the expected value of Y” or simply “expectation of Y.” Remark: Expectation is a weighted average of the values of Y where weights are the probabilities with which distinct values occur. > Ranges of wages (group them in intervals first). 5/36 The idea of “weighted averaging” can be extended to functions of Y. Suppose Z = h(Y), any function of Y. Then the expected value of h(Y) is: E(Z) = E[h(Y)] = Σi h(yi) pi. Thus knowledge of the probability distribution of Y is sufficient for calculating the expectation of functions of Y as well. With this choice of h(Y), we get the: (Population) Variance of Y: σ Y2 = V(Y) = E[(Y– µY)2] = Σi (yi – µY )2 pi. In words, variance equals the expected value of (or the expectation of)“the squared deviation of Y from its mean.” Example: Suppose random variable Y can take on one of two values, y0 = 0 and y1 = 1 with probabilities p0 and p1. Examples: (i) Take Z = Y2. Then Since p0 + p1 =1, we may take 2 2 6/36 E(Z) = E(Y ) = Σi yi pi. Pr(Y = 1) = p and Pr(Y = 0) = 1 – p, 0 < p < 1. 2 (ii) Take Z = (Y– µY) . Then E(Z) = E[(Y– µY)2] = Σi (yi – µY )2 pi. We say Y has a “Bernoulli distribution with parameter p” and write: Y ~ Bernoulli (p). 7/36 8/36 Useful algebra: Let Y* = Y – µY, deviation of Y from its population mean. For Y ~ Bernoulli (p): µY = E(Y) = Σi yi pi = (0)(1 – p) + (1)(p) = p; This function is linear in Y, as in (iii), where a = – µY, and b = 1. E(Y2) = Σi y2i pi = (0)2(1 – p) + (1)2(p) = p; σ Y2 = V(Y) = Σi (yi – µY )2pi = (0 – p)2(1 – p) + (1 – p)2(p) = … = p(1 – p). /// (iii) Linear functions: Take Z = a + bY, where a and b are constants. E(Z) = E(a + bY) = Σk (a + byk) pk = a Σk pk + b Σk yk pk = a + b E(Y). In words, expectation of a linear function (of Y) equals the linear function of the expectation (of Y). E(Y*) = – µY + (1)E(Y) = 0. In words, expectation of a deviation around the mean is zero. Next, examine Y*2 = (Y – µY)2 = Y2 + µY2 – 2YµY next; this function is not linear in Y. E(Y*2) = E(Y2 + µY2 – 2YµY) (*) = E(Y2) + µY2 – 2µY E(Y) 2 2 2 2 = E(Y ) + µY – 2µY = E(Y ) – [E(Y)]2. In line (*) we exploited the fact that E(.), which involves weighted averaging, is a “linear” operator; thus expectation of a sum equals the sum of expectations. 9/36 From (ii), V(Y) = E(Y*2); thus V(Y) = E(Y2) – [E(Y)]2. In words, variance of Y equals “expectation of squared Y” minus “square of expected Y”. 10/36 Exercise: Well-drilling project. Finally, let Z = a + bY as in (iii), and consider the deviation of Z from its mean: Z* = Z – E(Z) = a + bY – [a + bE(Y)] = bY*. It follows that the variance of Z is related to the variance of Y via: V(Z) = E(Z*2) = E[(bY*)2] = E(b2Y*2) = b2 E(Y*2) = b2V(Y). In words, variance of a linear function equals slope squared times the variance 11/36 Based on previous experience, a contractor believes he will find water within 1-5 days, and attaches a probability to each possible outcome. Let T denote the (random amount of) time it takes to complete drilling. The probability distribution (p.f.) of T is: t = time (days) Pr(T = t) = fT(t) 1 0.1 2 0.2 3 0.3 4 0.3 5 0.1 (i) Find the cumulative distribution function (c.d.f.) of T and interpret it. t = time (days) FT(t) = 1 2 3 4 5 12/36 (ii) Find the expected duration of the project and interpret the number you find. The contractors’s total project cost is made up of two parts: A fixed cost of TL2,000, plus TL 500 for each day taken to complete the drilling. Prediction: Consider the urn model, where population consists of balls in an urn. A ball is picked at random. Your task is to guess the value Y written on it. What would your guess be? Example: Suppose you had to predict how long a particular well drilling project would take. What would your guess be? One of the possible values of T? Some other number? (iii) Find the expected total project cost. (iv) Find the variance of the project cost. 13/36 We need more structure. Clearly, prediction is subject to error. Errors can be costly, and large errors can be more costly. 14/36 Proof: Can use calculus. E(U2) = E[(Y – c)2]) = Σi (yi – c)2 pi. What is the cost of a poor prediction? Differentiation yields: Let “c” be your guess (a number). ∂ E(U2)/∂c = Σi ∂[(yi – c)2 pi]/∂c = Σi [–2(yi – c)pi]. Define prediction error as U = Y – c. We would like to make U small. Since Y is a random variable, U is also a random variable. Setting the derivative to zero yields the first order condition (F.O.C.) for a minimum: Σi [–2(yi – c)pi] = 0. More definitions: That is, E(U) = E(Y) – c = bias of your guess (“c”). E(U2) = E[(Y – c)2]) = mean (expected) squared error of guess c. Mean Squared Prediction Error criterion: Suppose the objective is to minimize E(U2). Then the best predictor (guess) is c = µY = E(Y). 15/36 Σi yi pi = c Σi pi. We know: Σi pi = 1, and Σi yi pi = E(Y), so the solution is c = µY. (Check the second order condition to verify that we located a minimum.)/// 16/36 Non-Calculus proof: For brevity let µ = µY and reexamine the prediction error: Remarks: (i) If we use the mean squared prediction error, then the population mean (that is, the expectation of the random variable) is the best guess (predictor) of a draw from that population (distribution). (ii) Variance equals the value of the expected squared prediction error when the population mean is used as the predictor. (iii) Other criteria may yield different choices of best predictor. For example, if the criterion were minimization of the expected absolute prediction error, namely E(|U|), then the population median would be the best predictor. * U = Y – c = Y – µ – (c – µ) = Y – (c – µ), where Y* = Y – µ as usual. Square both sides and expand: U2 = [Y* – (c – µ)]2 = Y*2 + (c – µ)2 – 2Y*(c – µ). Take expectations, and recall “useful algebra”: E(U2) = E[Y*2 + (c – µ)2 – 2Y*(c – µ)] = E(Y*2) + (c – µ)2 – 2(c – µ)E(Y*) = V(Y) + (c – µ)2. Since V(Y) > 0 and (c – µ)2 ≥ 0, minimum E(U2) is obtained by setting c = µ. /// 17/36 18/36 Joint (population) distribution of X and Y: How to envision ⎧ Joint ⎫ ⎪ ⎪ ⎨ Marginal ⎬ ⎪Conditional⎪ ⎩ ⎭ pjk = Pr(X = xj, Y = yk), j = 1, 2, …, J; k = 1, 2, …, K. Gives the proportion of times we encounter a ball with paired values (xj, yk), j = 1, 2, …, J; k = 1, 2, …, K. Probability Distributions The joint distribution classifies the balls according to values of both X and Y. To obtain a “marginal” distribution, we reclassify the balls in the urn according to the distinct values of one “margin”. We ignore the distinct values of the second margin. Urn model: Population = Balls in an urn Bivariate population: Each ball has a pair of values (X, Y) written on it. Marginal (population) distribution of X: pj = Pr(X = xj), j = 1, 2,…, J. X has J distinct values: x1, x2, ..., xj, ..., xJ. Here we ignore the values of Y, and examine the proportion of times we encounter a ball with values xj, j = 1, 2,…, J. Y has K distinct values: y1, y2, ..., yk, ..., yK. 19/36 20/36 How to obtain a marginal distribution of X from the joint distribution of X and Y: pj = Σk pjk, j = 1, 2,…, J. (Stock & Watson) (Population) Mean of X: µX = E(X) = Σj xj pj. (Population) Variance of X: σ X2 = V(X) = Σj (xj – µX )2 pj. The marginal distribution of Y, its mean and variance may be obtained in analogous fashion (write down the formula!). Exercise: Consider S&W Table 2.3, Panel A. Verify the derivation of the marginal distributions of A and M; find their means and variances. 21/36 To obtain a “conditional” distribution, we first sort the balls according to one of the two values, and put them in different urns. We then examine the contents of a specific urn. 22/36 These conditional distributions may be different (hence each subpopulation may have a different mean and variance). We can distinguish between them as long as we record the distinct value of X for that urn. To obtain the conditional distributions of Y given X we sort on distinct values xj: Conditional (population) distribution of Y given X = xj: POPULATION pk|j = Pr(Y = yk | X = xj) = p jk pj , k = 1, 2, …, K. The derivation requires pj > 0. SUBPOPULATIONS .... X = x1 X = x2 Conditional (population) mean of Y given X = xj: µY|j = E(Y | X = xj) = Σk yk pk|j. .... X = xj Conditional (population) variance of Y given X = xj: X = xJ σ Y2| j = V(Y | X = xj) = Σk (yk – µY|j)2 pk|j. Each urn has a distribution of values of Y! 23/36 24/36 The conditional distributions of X given Y = yk, and their conditional means and variances may be obtained in analogous fashion (write down the formula you would use!). Exercises: Verify the derivation in S&W Table 2.3, Panel B. The Law of Iterated Expectations: We saw that “expectation” is weighted average. As a consequence: µY = E(Y) = Σj E(Y | X = xj) Pr(X = xj). We may write: E(Y) = EX[E(Y|X)]. Practical uses of conditional expectations: • Consider the conditional distributions given in S&W Table 2.3 (p.70). Suppose you have an old computer. How would you justify buying a new computer? Observe that: • The “inner” expectation E(Y|X) is a weighted average of the different values of y, weighed by conditional probabilities Pr(Y = yk| X = xj) (here X is “given”, we know the urn balls come from). Hint: Calculate the benefit (reduction in expected crashes) of switching from an old computer to a new one. • Consider the urn model. We obtain a random draw from the joint distribution of (X, Y). We tell you the value of X. What is your best guess of the value of Y? • The “outer” expectation EX[.] is a weighted average of the different values of E(Y|X = xj), weighed by probabilities Pr(X = xj). Exercise: Earlier we used the marginal distribution of M to calculate E(M). Can you think of another way to compute E(M)? (see S&W:72) 25/36 26/36 Functions of jointly distributed random variables: (Population) covariance: Let Z = h(X, Y), a function of two random variables, X and Y. Suppose the joint distribution of X and Y is known. In a joint distribution, the degree to which two random variables are related may be measured with the help of covariance: Cov(X, Y) = σXY = E(X*Y*) = E[(X – µX)( Y – µY)] Then the expectation of Z can be computed in the usual manner, as a weighted average: = Σj Σk (xj – µX )(yk – µY ) Pr(X = xj, Y = yk). Remark: We took Z = h(X, Y) = (X – µX)(Y – µY) and found E(Z)… E(Z) = E[h(X, Y)] = Σj Σj h(xi, yj) Pr(X = xj, Y = yj) = Σj Σj h(xi, yj) pij (☯) where the probability weights pij, i = 1, 2,…, I and j = 1, 2,…, J are obtained from the joint distribution. Exercise: Use S&W Table 2.3 to compute E(MA). Useful algebra: E(X*Y*) = E[(X – µX)( Y – µY)] =… = E(XY) – E(X)E(Y). ( ) In words, covariance equals the expected value of the product, minus the product of the expectations. 27/36 28/36 (Population) covariance cont’d: (Population) correlation: The “sign” of covariance is informative about the nature of the relation: The magnitude of covariance is affected by the units of measurement of the variables. For a unit-free measure, we turn to correlation: , If above average values of X go together with above average values of Y (so that below average values of X go together with below average values of Y) covariance will be positive. It can be shown that –1 ≤ ρXY ≤ 1. If above average values of one variable go together with below average values of the other, covariance will be negative. Random variables are said to be uncorrelated if ρXY = 0. Clearly for this to happen, σXY = 0 must hold. Corr(X, Y) = ρXY = = . Recall that in general E(Y|X) is a function of X; it tells us how the conditional mean of Y given X = xj changes with xj, j = 1, 2, …, J. Exercise: Suppose X = weight, Y = height of individuals in a population. Can you guess the sign of Cov(X, Y) = ? Remark: Think about the urn model. Think about prediction. 29/36 Suppose E(Y|X) = E(Y) = µY, a constant. To describe this case, we say Y is mean-independent of X. Claim 1: If Y is mean-independent of X, then σXY = 0 ( Proof: ρXY = 0). E(XY) = E(YX) = EX[E(YX|X)] = EX[E(Y|X)X]; 30/36 CAUTION: If σXY = 0, it does not follow that E(Y|X) = constant. Covariance/correlation capture the linear relation between X and Y. It could be that the relation is non-linear, so that E(Y|X) varies with X, but yet σXY = 0. Example: Modify the joint distribution in Assignment 2 Part II as: *When we “condition” on X, we set it equal to a particular value. If E(Y|X) = E(Y), the last expression simplifies: f(x, y) y=1 y=2 = EX[E(Y)X] = E(Y)E(X). We showed: If Y is mean-independent of X, then E(XY) = E(Y)E(X). x = –1 0.20 0.10 x=0 0.10 0.30 x =1 0.20 0.10 and (re)calculate Cov(X, Y). Return to ( ) and note that σXY = 0 iff E(XY) = E(X)E(Y). Thus σXY = 0… /// 31/36 32/36 Independence: Random variables X and Y are (statistically) independent, if knowledge of the value of one of the variables provides no information about the other. Formally: Claim 2: If X and Y are independently distributed, then E(X|Y) = E(X) and E(Y|X) = E(Y). Proof: E(Y | X = xj) = Σk yk pk|j I.1. X and Y are independently distributed if, for all values of x and y, Pr(Y = y | X = x) = Pr(Y = y). = Σk yk p jk pj = Σk yk (pjpk/pj) = Σk yk pk = E(Y). /// SUMMARY: From the definition of conditional probabilities, Independence Pr(Y = y, X = x) = Pr(Y = y | X = x)Pr(X = x). Thus an equivalent condition for, and implication of independence is: I.2. Pr(Y = y, X = x) = Pr(Y = y)Pr(X = x), for all values of x and y. Mean-independence Zero correlation. However: We cannot go from right to left! Stronger condition implies the weaker condition; not the other way around. 33/36 Additional Linear Function Rules: (S&W Appendix 2.1) 34/36 Generalizing to linear functions, if Suppose Z = X + Y. Then using (☯), it is easy to show Z = a + bX + cY E(Z) = E(X) + E(Y). where a, b and c are constants, then In words, expectation of a sum equals the sum of expectations. Continuing, if Z = X + Y, then Z* = X* + Y*, and Z*2 = X*2 + Y*2 + 2X*Y*, where the asterisk denotes the deviation from the expectation. So V(Z) = E(Z*2) = E(X*2) + E(Y*2) + 2E(X*Y*) E(Z) = a + bE(X) + cE(Y), so the deviation from the expectation is Z* = bX* + cY*, and the variance of Z is V(Z) = E(Z*2) = b2V(X) + c2V(Y) + 2bcC(X,Y). = V(X) + V(Y) + 2C(X,Y). In words, variance of a sum equals the sum of the variances plus twice the covariance. Still more generally, for a pair of random variables Z1 = a1 + b1X + c1Y, Z2 = a2 + b2X + c2Y, where a’s b’s and c’s are constants, the covariance of Z1 and Z2 is Exercise: Use the same logic to find the variance of a difference. 35/36 C(Z1, Z2) = b1b2V(X) + c1c2V(Y) + (b1c2 + b2c1)C(X,Y). 36/36