REVIEW OF PROBABILITY AND STATISTICS

İnsan TUNALI
Econ 311 – Econometrics I
11 February 2014
Lectures 3-5
Revised Feb. 20th
REVIEW OF PROBABILITY AND STATISTICS
(a) Population, random variable, and distribution
(b) Moments of a distribution (mean, variance, standard deviation,
covariance, correlation)
Stock & Watson, Ch.2 [Goldberger Ch.3-4]
1.
2.
3.
4.
The probability framework for statistical inference
(c) Conditional distributions and conditional means
The probability framework for statistical inference
Estimation
Testing
Confidence Intervals
(d) Distribution of a sample of data drawn randomly from a
YBnB (subject of another handout).
population: YB1,…,
B
From the syllabus: “The prerequisites for ECON 311 include MATH 201
(Statistics), and ECON 201 (Intermediate Microeconomics). Students who got a
grade of C− or below in MATH 201 are strongly advised to work independently to
make up for any deficiency they have during the first two weeks of the semester.”
2/36
(a) Population, random variable, and distribution
Population
How to envision
Discrete Probability Distributions
• The group or collection of all possible entities of interest
Urn model: Population = Balls in an urn
• We will think of populations as being “very big” (∞ is an
approximation to “very big”)
Outcomes... sample space... events... MATH 201
Random variable Y
Each ball has a value (Y) written on it;
• Numerical summary of a random outcome
Population distribution of Y
Y has K distinct values: y1, y2, ..., yi, ..., yK.
Suppose we were to sample from this (univariate) population, with
replacement, infinitely many times…
• Discrete case: The probabilities of different values of Y that
occur in the population
• Continous case: Likelihood of particular ranges of Y
3/36
4/36
(Population) Distribution of Y:
pi = Pr(Y = yi), i = 1, 2, …, K.
Cumulative Distribution Function (c.d.f) of Y:
F(y) = Pr(Y ≤ y) = ∑ f ( yi ) = ∑ pi .
Gives the proportion of times we encounter a ball with value Y = yi,
i = 1, 2, …, K.
Alternate notation: f(y) = Pr(Y = y); “probability function (p.f.) of Y.”
Clearly pi ≥ 0 for all i, and Σi pi = 1.
Convention: Σi ≡
Examples:
> Gender: M (=0)/F (=1). We prefer the numerical representation…
> Standing: freshman (=1), sophomore (=2), junior (=3), senior
(=4).
yi ≤ y
yi ≤ y
That is, to find F(y) we sum the pi’s up to the value p = Pr(Y = y).
Use of the c.d.f.:
Pr(a < Y ≤ b) = F(b) – F(a).
Important features of the distribution of Y:
(Population) Mean of Y:
µY = E(Y) = Σi yi pi. (Here Σi ≡
Also known as “the expected value of Y” or simply “expectation of Y.”
Remark: Expectation is a weighted average of the values of Y where
weights are the probabilities with which distinct values occur.
> Ranges of wages (group them in intervals first).
5/36
The idea of “weighted averaging” can be extended to functions of Y.
Suppose Z = h(Y), any function of Y. Then the expected value of h(Y)
is:
E(Z) = E[h(Y)]
= Σi h(yi) pi.
Thus knowledge of the probability distribution of Y is sufficient for
calculating the expectation of functions of Y as well.
With this choice of h(Y), we get the:
(Population) Variance of Y:
σ Y2 = V(Y) = E[(Y– µY)2]
= Σi (yi – µY )2 pi.
In words, variance equals the expected value of (or the expectation
of)“the squared deviation of Y from its mean.”
Example: Suppose random variable Y can take on one of two values,
y0 = 0 and y1 = 1 with probabilities p0 and p1.
Examples:
(i) Take Z = Y2. Then
Since p0 + p1 =1, we may take
2
2
6/36
E(Z) = E(Y ) = Σi yi pi.
Pr(Y = 1) = p and Pr(Y = 0) = 1 – p, 0 < p < 1.
2
(ii) Take Z = (Y– µY) . Then
E(Z) = E[(Y– µY)2] = Σi (yi – µY )2 pi.
We say Y has a “Bernoulli distribution with parameter p” and
write: Y ~ Bernoulli (p).
7/36
8/36
Useful algebra: Let Y* = Y – µY, deviation of Y from its population
mean.
For Y ~ Bernoulli (p):
µY = E(Y) = Σi yi pi = (0)(1 – p) + (1)(p) = p;
This function is linear in Y, as in (iii), where a = – µY, and b = 1.
E(Y2) = Σi y2i pi = (0)2(1 – p) + (1)2(p) = p;
σ Y2 = V(Y) = Σi (yi – µY )2pi
= (0 – p)2(1 – p) + (1 – p)2(p) = … = p(1 – p).
///
(iii) Linear functions: Take Z = a + bY, where a and b are constants.
E(Z) = E(a + bY)
= Σk (a + byk) pk = a Σk pk + b Σk yk pk
= a + b E(Y).
In words, expectation of a linear function (of Y) equals the linear
function of the expectation (of Y).
E(Y*) = – µY + (1)E(Y) = 0.
In words, expectation of a deviation around the mean is zero.
Next, examine Y*2 = (Y – µY)2 = Y2 + µY2 – 2YµY next; this
function is not linear in Y.
E(Y*2) = E(Y2 + µY2 – 2YµY)
(*)
= E(Y2) + µY2 – 2µY E(Y)
2
2
2
2
= E(Y ) + µY – 2µY = E(Y ) – [E(Y)]2.
In line (*) we exploited the fact that E(.), which involves weighted
averaging, is a “linear” operator; thus expectation of a sum equals
the sum of expectations.
9/36
From (ii), V(Y) = E(Y*2); thus
V(Y) = E(Y2) – [E(Y)]2.
In words, variance of Y equals “expectation of squared Y” minus
“square of expected Y”.
10/36
Exercise: Well-drilling project.
Finally, let Z = a + bY as in (iii), and consider the deviation of Z from
its mean:
Z* = Z – E(Z)
= a + bY – [a + bE(Y)] = bY*.
It follows that the variance of Z is related to the variance of Y via:
V(Z) = E(Z*2)
= E[(bY*)2] = E(b2Y*2) = b2 E(Y*2) = b2V(Y).
In words, variance of a linear function equals slope squared times
the variance
11/36
Based on previous experience, a contractor believes he will find
water within 1-5 days, and attaches a probability to each possible
outcome. Let T denote the (random amount of) time it takes to
complete drilling. The probability distribution (p.f.) of T is:
t = time (days)
Pr(T = t) = fT(t)
1
0.1
2
0.2
3
0.3
4
0.3
5
0.1
(i) Find the cumulative distribution function (c.d.f.) of T and interpret
it.
t = time (days)
FT(t) =
1
2
3
4
5
12/36
(ii) Find the expected duration of the project and interpret the number
you find.
The contractors’s total project cost is made up of two parts: A fixed
cost of TL2,000, plus TL 500 for each day taken to complete the
drilling.
Prediction: Consider the urn model, where population consists of
balls in an urn. A ball is picked at random. Your task is to guess the
value Y written on it. What would your guess be?
Example: Suppose you had to predict how long a particular well
drilling project would take. What would your guess be?
One of the possible values of T?
Some other number?
(iii) Find the expected total project cost.
(iv) Find the variance of the project cost.
13/36
We need more structure. Clearly, prediction is subject to error.
Errors can be costly, and large errors can be more costly.
14/36
Proof: Can use calculus.
E(U2) = E[(Y – c)2]) = Σi (yi – c)2 pi.
What is the cost of a poor prediction?
Differentiation yields:
Let “c” be your guess (a number).
∂ E(U2)/∂c = Σi ∂[(yi – c)2 pi]/∂c = Σi [–2(yi – c)pi].
Define prediction error as U = Y – c.
We would like to make U small. Since Y is a random variable, U is also a
random variable.
Setting the derivative to zero yields the first order condition (F.O.C.)
for a minimum: Σi [–2(yi – c)pi] = 0.
More definitions:
That is,
E(U) = E(Y) – c = bias of your guess (“c”).
E(U2) = E[(Y – c)2]) = mean (expected) squared error of guess c.
Mean Squared Prediction Error criterion: Suppose the objective is to
minimize E(U2). Then the best predictor (guess) is c = µY = E(Y).
15/36
Σi yi pi = c Σi pi.
We know: Σi pi = 1, and Σi yi pi = E(Y), so the solution is c = µY.
(Check the second order condition to verify that we located a minimum.)///
16/36
Non-Calculus proof: For brevity let µ = µY and reexamine the
prediction error:
Remarks:
(i)
If we use the mean squared prediction error, then the
population mean (that is, the expectation of the random
variable) is the best guess (predictor) of a draw from that
population (distribution).
(ii) Variance equals the value of the expected squared prediction
error when the population mean is used as the predictor.
(iii) Other criteria may yield different choices of best predictor.
For example, if the criterion were minimization of the
expected absolute prediction error, namely E(|U|), then the
population median would be the best predictor.
*
U = Y – c = Y – µ – (c – µ) = Y – (c – µ),
where Y* = Y – µ as usual. Square both sides and expand:
U2 = [Y* – (c – µ)]2 = Y*2 + (c – µ)2 – 2Y*(c – µ).
Take expectations, and recall “useful algebra”:
E(U2) = E[Y*2 + (c – µ)2 – 2Y*(c – µ)]
= E(Y*2) + (c – µ)2 – 2(c – µ)E(Y*)
= V(Y) + (c – µ)2.
Since V(Y) > 0 and (c – µ)2 ≥ 0, minimum E(U2) is obtained by
setting c = µ. ///
17/36
18/36
Joint (population) distribution of X and Y:
How to envision
⎧ Joint ⎫
⎪
⎪
⎨ Marginal ⎬
⎪Conditional⎪
⎩
⎭
pjk = Pr(X = xj, Y = yk), j = 1, 2, …, J; k = 1, 2, …, K.
Gives the proportion of times we encounter a ball with paired values
(xj, yk), j = 1, 2, …, J; k = 1, 2, …, K.
Probability Distributions
The joint distribution classifies the balls according to values of both
X and Y. To obtain a “marginal” distribution, we reclassify the balls
in the urn according to the distinct values of one “margin”. We
ignore the distinct values of the second margin.
Urn model: Population = Balls in an urn
Bivariate population: Each ball has a pair of values (X, Y) written on it.
Marginal (population) distribution of X:
pj = Pr(X = xj), j = 1, 2,…, J.
X has J distinct values: x1, x2, ..., xj, ..., xJ.
Here we ignore the values of Y, and examine the proportion of times
we encounter a ball with values xj, j = 1, 2,…, J.
Y has K distinct values: y1, y2, ..., yk, ..., yK.
19/36
20/36
How to obtain a marginal distribution of X from the joint distribution
of X and Y:
pj = Σk pjk, j = 1, 2,…, J.
(Stock & Watson)
(Population) Mean of X:
µX = E(X) = Σj xj pj.
(Population) Variance of X:
σ X2 = V(X) = Σj (xj – µX )2 pj.
The marginal distribution of Y, its mean and variance may be
obtained in analogous fashion (write down the formula!).
Exercise: Consider S&W Table 2.3, Panel A. Verify the derivation
of the marginal distributions of A and M; find their means and
variances.
21/36
To obtain a “conditional” distribution, we first sort the balls
according to one of the two values, and put them in different urns.
We then examine the contents of a specific urn.
22/36
These conditional distributions may be different (hence each
subpopulation may have a different mean and variance). We can
distinguish between them as long as we record the distinct value of X
for that urn.
To obtain the conditional distributions of Y given X we sort on
distinct values xj:
Conditional (population) distribution of Y given X = xj:
POPULATION
pk|j = Pr(Y = yk | X = xj) =
p jk
pj
, k = 1, 2, …, K.
The derivation requires pj > 0.
SUBPOPULATIONS
....
X = x1
X = x2
Conditional (population) mean of Y given X = xj:
µY|j = E(Y | X = xj) = Σk yk pk|j.
....
X = xj
Conditional (population) variance of Y given X = xj:
X = xJ
σ Y2| j = V(Y | X = xj) = Σk (yk – µY|j)2 pk|j.
Each urn has a distribution of values of Y!
23/36
24/36
The conditional distributions of X given Y = yk, and their conditional
means and variances may be obtained in analogous fashion (write
down the formula you would use!).
Exercises: Verify the derivation in S&W Table 2.3, Panel B.
The Law of Iterated Expectations:
We saw that “expectation” is weighted average. As a consequence:
µY = E(Y) = Σj E(Y | X = xj) Pr(X = xj).
We may write: E(Y) = EX[E(Y|X)].
Practical uses of conditional expectations:
• Consider the conditional distributions given in S&W Table 2.3
(p.70). Suppose you have an old computer. How would you
justify buying a new computer?
Observe that:
• The “inner” expectation E(Y|X) is a weighted average of the
different values of y, weighed by conditional probabilities Pr(Y =
yk| X = xj) (here X is “given”, we know the urn balls come from).
Hint: Calculate the benefit (reduction in expected crashes) of
switching from an old computer to a new one.
• Consider the urn model. We obtain a random draw from the joint
distribution of (X, Y). We tell you the value of X. What is your
best guess of the value of Y?
• The “outer” expectation EX[.] is a weighted average of the
different values of E(Y|X = xj), weighed by probabilities Pr(X = xj).
Exercise: Earlier we used the marginal distribution of M to calculate
E(M). Can you think of another way to compute E(M)? (see S&W:72)
25/36
26/36
Functions of jointly distributed random variables:
(Population) covariance:
Let Z = h(X, Y), a function of two random variables, X and Y.
Suppose the joint distribution of X and Y is known.
In a joint distribution, the degree to which two random variables
are related may be measured with the help of covariance:
Cov(X, Y) = σXY = E(X*Y*) = E[(X – µX)( Y – µY)]
Then the expectation of Z can be computed in the usual manner,
as a weighted average:
= Σj Σk (xj – µX )(yk – µY ) Pr(X = xj, Y = yk).
Remark: We took Z = h(X, Y) = (X – µX)(Y – µY) and found E(Z)…
E(Z) = E[h(X, Y)] = Σj Σj h(xi, yj) Pr(X = xj, Y = yj)
= Σj Σj h(xi, yj) pij
(☯)
where the probability weights pij, i = 1, 2,…, I and j = 1, 2,…, J are
obtained from the joint distribution.
Exercise: Use S&W Table 2.3 to compute E(MA).
Useful algebra:
E(X*Y*) = E[(X – µX)( Y – µY)]
=…
= E(XY) – E(X)E(Y). ( )
In words, covariance equals the expected value of the product, minus
the product of the expectations.
27/36
28/36
(Population) covariance cont’d:
(Population) correlation:
The “sign” of covariance is informative about the nature of the
relation:
The magnitude of covariance is affected by the units of measurement
of the variables. For a unit-free measure, we turn to correlation:
,
If above average values of X go together with above average
values of Y (so that below average values of X go together with
below average values of Y) covariance will be positive.
It can be shown that –1 ≤ ρXY ≤ 1.
If above average values of one variable go together with below
average values of the other, covariance will be negative.
Random variables are said to be uncorrelated if ρXY = 0. Clearly
for this to happen, σXY = 0 must hold.
Corr(X, Y) = ρXY =
=
.
Recall that in general E(Y|X) is a function of X; it tells us how the
conditional mean of Y given X = xj changes with xj, j = 1, 2, …, J.
Exercise: Suppose X = weight, Y = height of individuals in a
population. Can you guess the sign of Cov(X, Y) = ?
Remark: Think about the urn model. Think about prediction.
29/36
Suppose E(Y|X) = E(Y) = µY, a constant. To describe this case, we say
Y is mean-independent of X.
Claim 1: If Y is mean-independent of X, then σXY = 0 (
Proof:
ρXY = 0).
E(XY) = E(YX) = EX[E(YX|X)] = EX[E(Y|X)X];
30/36
CAUTION: If σXY = 0, it does not follow that E(Y|X) = constant.
Covariance/correlation capture the linear relation between X and Y.
It could be that the relation is non-linear, so that E(Y|X) varies
with X, but yet σXY = 0.
Example: Modify the joint distribution in Assignment 2 Part II as:
*When we “condition” on X, we set it equal to a particular value.
If E(Y|X) = E(Y), the last expression simplifies:
f(x, y)
y=1
y=2
= EX[E(Y)X] = E(Y)E(X).
We showed:
If Y is mean-independent of X, then E(XY) = E(Y)E(X).
x = –1
0.20
0.10
x=0
0.10
0.30
x =1
0.20
0.10
and (re)calculate Cov(X, Y).
Return to ( ) and note that σXY = 0 iff E(XY) = E(X)E(Y).
Thus σXY = 0… ///
31/36
32/36
Independence: Random variables X and Y are (statistically)
independent, if knowledge of the value of one of the variables
provides no information about the other. Formally:
Claim 2: If X and Y are independently distributed, then
E(X|Y) = E(X) and E(Y|X) = E(Y).
Proof: E(Y | X = xj) = Σk yk pk|j
I.1. X and Y are independently distributed if, for all values of x and y,
Pr(Y = y | X = x) = Pr(Y = y). = Σk yk
p jk
pj
= Σk yk (pjpk/pj) = Σk yk pk = E(Y). ///
SUMMARY:
From the definition of conditional probabilities,
Independence
Pr(Y = y, X = x) = Pr(Y = y | X = x)Pr(X = x).
Thus an equivalent condition for, and implication of independence is:
I.2. Pr(Y = y, X = x) = Pr(Y = y)Pr(X = x), for all values of x and y. Mean-independence
Zero correlation.
However: We cannot go from right to left!
Stronger condition implies the weaker condition; not the other way
around.
33/36
Additional Linear Function Rules: (S&W Appendix 2.1)
34/36
Generalizing to linear functions, if
Suppose Z = X + Y. Then using (☯), it is easy to show
Z = a + bX + cY
E(Z) = E(X) + E(Y).
where a, b and c are constants, then
In words, expectation of a sum equals the sum of expectations.
Continuing, if Z = X + Y, then Z* = X* + Y*, and Z*2 = X*2 + Y*2 + 2X*Y*,
where the asterisk denotes the deviation from the expectation. So
V(Z) = E(Z*2) = E(X*2) + E(Y*2) + 2E(X*Y*)
E(Z) = a + bE(X) + cE(Y),
so the deviation from the expectation is Z* = bX* + cY*, and the
variance of Z is
V(Z) = E(Z*2) = b2V(X) + c2V(Y) + 2bcC(X,Y).
= V(X) + V(Y) + 2C(X,Y).
In words, variance of a sum equals the sum of the variances plus twice
the covariance.
Still more generally, for a pair of random variables
Z1 = a1 + b1X + c1Y,
Z2 = a2 + b2X + c2Y,
where a’s b’s and c’s are constants, the covariance of Z1 and Z2 is
Exercise: Use the same logic to find the variance of a difference.
35/36
C(Z1, Z2) = b1b2V(X) + c1c2V(Y) + (b1c2 + b2c1)C(X,Y).
36/36