Sample Exam Questions: True or False? 1. If the sample means of x and y are zero, then the estimated y-intercept is zero. 2. The slope of the simple regression model indicates how the actual value of y changes as x changes. 3. If the sample covariance between x and y is zero, then the slope of the LS regression line zero. Keys to Success: 1. Read reading assignments before you come to the class 2. Attend classes, and ask questions if you do not understand 3. Do your homework independently IV. SLM: Properties of Least Squares Estimators 4.1 Unbiased LS Estimators 4.2 Best LS Estimators - variances and covariance - the Gauss-Markov Theorem (BLUE) 4.3 The Probability Distribution of the LS Estimators 4.4 Estimating the Variance of the Error Term 4.5 The Coefficient of Determination R2, Adj. R2, AIC, SC 4.6 LS predictor and its Variance (Reminder!) Assumptions of the Simple Linear Regression Model SR1. yt = β1 + β2 xt + et SR2. E (et ) = 0 ⇔ E ( yt ) = β1 + β2 xt SR3. var(et ) = σ2 = var( yt ) SR4. cov(ei , e j ) = cov( yi , y j ) = 0 SR5. {xt , t = 1....T } is a set of fixed variables and takes at least two different values 4.1. The Unbiased Estimators – sampling property Quantity = β1 + β2 Price + ε Sample Number b1 b2 1 70.2034 -0.0657 2 69.0453 -0.0557 3 73.2357 -0.0478 4 71.3232 -0.1098 5 65.3365 -0.0987 6 68.6789 -0.0655 7 69.0037 -0.0789 8 34.3387 -0.0599 9 57.0098 -0.0768 10 64.4455 -0.0699 • E(b) = β • OLS procedure is unbiased, but we cannot say that the individual estimate is unbiased OLS estimator is unbiased but we cannot say estimate is unbiased. Estimator: when the formulas for b1 and b2, are taken to be rules that are used whatever the sample data turn out to be, then b1 and b2 are random variables. In this context we call b1 and b2 the least squares estimators. Estimate: when actual sample values, numbers, are substituted into the formulas, we obtain numbers that are values of random variables. In this context, we call b1 and b2 the least squares estimates • Proof: E(b) = β b2 ( x − x )( y − y ) ∑ = ∑( x − x ) t t 2 t (x − x)E( y − y) ∑ =β E (b ) = ∑ (x − x) t t 2 2 t b1 = y − b2 x yt = β1 + β 2 xt + et ⎫ => E ( yt ) = β1 + β 2 xt ⎪⎪ ⎪ ⎬ E ( yt − y ) = β 2 ( xt − x ) ⎪ y = β1 + β 2 x + e ⎪ => E ( y ) = β1 + β 2 x ⎪⎭ E (b1 ) = E ( y ) − E (b2 ) x = β1 + β 2 x − β 2 x = β1 2 4.2 The Best Estimators • If the regression model assumptions SR1-SR5 are correct (SR6 is not required), then the variances and covariance of b1 and b2 are derived from: Var (b1 ) = E[b1 − E (b1 )]2 Cov(b1 , b2 ) = E{[b1 − E (b1 )][b2 − E (b2 )]} v a r ( b1 ) = σ v a r(b2 ) = 2 ∑ c o v ( b1 , b 2 ) = σ 2 ⎡ ⎢ ⎢⎣ T ∑ ∑ x t2 ( xt − x )2 σ2 ( xt − x )2 ⎡ ⎢ ⎢⎣ ∑ − x ( xt − x )2 ⎤ ⎥ ⎥⎦ ⎤ ⎥ ⎥⎦ Derivations • • Var (b2 ) = E[b2 − E (b2 )]2 = E[b2 − β 2 ]2 ( x − x )( y − y ) ∑ ( x − x )[ β ( x − x ) + (e − e )] ∑ b = = ∑ (x − x) ∑ (x − x) ( x − x )e e ∑ ( x − x ) ( x − x )e ∑ ∑ =β + − =β + ∑ (x − x) ∑ (x − x) ∑ (x − x) t t 2 t 2 t 2 t t 2 t b2 − β 2 = t 2 t • t 2 ∑ ( xt − x )et 2 ( − ) x x ∑ t t t 2 2 t ⇒ (b2 − β 2 ) 2 = t 2 t [∑ ( xt − x )et ]2 [∑ ( xt − x ) 2 ]2 • E(b2 − β2 )2 = = E[∑( xt − x)et ]2 [∑( xt − x)2 ]2 E[{(x1 − x)e1 + ( x2 − x)e2 + ....+ ( xT − x)eT }{(x1 − x)e1 + ( x2 − x)e2 + ....+ ( xT − x)eT }] [∑( xt − x)2 ]2 E[∑( xt − x)2 et ] 2 = [∑( xt − x)2 ]2 2 ( x − x ) ∑ t E(et ) 2 = = [∑( xt − x) ] 2 2 = 2 2 ( x − x ) ∑ t σ [∑( xt − x)2 ]2 σ2 2 ( x − x ) ∑ t “Need to prove that this is the best” The Gauss-Markov Theorem Gauss-Markov Theorem: Under the assumptions SR1-SR5 of the linear regression model the estimators b1 and b2 have the smallest variance of all linear and unbiased estimators of β1 and β2. They are the Best Linear Unbiased Estimators (BLUE) of β1 and β2 1. The estimators b1 and b2 are “best” when compared to similar estimators, those that are linear and unbiased. Note that the Theorem does not say that b1 and b2 are the best of all possible estimators. 2. The estimators b1 and b2 are best within their class because they have the minimum variance. 3. In order for the Gauss-Markov Theorem to hold, the assumptions (SR1-SR5) must be true. If any of the assumptions 1-5 are not true, then b1 and b2 are not the best linear unbiased estimators of β1 and β2. 4. The Gauss-Markov Theorem does not depend on the assumption of normality 5. The Gauss-Markov theorem applies to the least squares estimators. It does not apply to the least squares estimates from a single sample. Proof of the Gauss-Markov Theorem: Step 1. Define a generalized linear unbiased estimator form which can includes our OLS estimator. Step 2. Derive the variance of the generalized linear unbiased estimator Step 3. Show the generalized estimator becomes the OLS estimator when its variance is the smallest. 4.3 The Probability Distribution of the LSEs • If we make the normality assumption, assumption SR6 about the error term, then the least squares estimators (a linear combination of the error term) are normally distributed. ⎛ σ 2 ∑ x t2 b1 ~ N ⎜ β 1 , 2 ⎜ − T x x ( ) ∑ t ⎝ ⎛ b2 ~ N ⎜ β 2 , ⎜ ⎝ ∑ σ2 ( xt − x ) 2 ⎞ ⎟⎟ ⎠ ⎞ ⎟⎟ ⎠ • If assumptions SR1-SR5 hold, and if the sample size T is sufficiently large, then the least squares estimators have a distribution that approximates the normal distributions shown above. 4.5 Estimating the Variance of the Error Term The variance of the random variable et is var(et ) = σ2 = E[et − E (et )]2 = E (et2 ) if the assumption E(et)=0 is correct. Since the “expectation” is an average value we might consider estimating σ2 as the average of the squared errors, σˆ 2 = 2 e ∑t T • Recall that the random errors are et = yt − β1 − β2 xt • The least squares residuals are obtained by replacing the unknown parameters by their least squares estimators, eˆt = yt − b1 − b2 xt σˆ 2 = 2 ˆ e ∑t T • There is a simple modification that produces an unbiased estimator, and that is σˆ 2 = 2 ˆ e ∑t T −2 E (σˆ 2 ) = σ 2 • Need to show that σˆ 2 = 2 ˆ e ∑t σˆ 2 = 2 ˆ e ∑t “Unbiased” T −2 1 E ( ∑ eˆt2 ) = 1 E (∑ eˆt2 ) ≠ σ 2 T T • Evaluate “Biased” T E( 1 eˆt2 ) = 1 E (∑ eˆt2 ) = σ 2 ∑ T −2 T −2 E (∑ eˆt2 ) eˆt = yt − b1 − b2 xt = β1 + β 2 xt + et − (b1 + b2 xt ) = et + ( β1 − b1 ) + ( β 2 − b2 ) xt = (et − e ) − (b2 − β 2 )( xt − x ) because β1 = y − β 2 x − e , b1 = y − b2 x From eˆt = (et − e ) − (b2 − β 2 )( xt − x ) 2 ˆ e ∑ t = ∑ (et − e ) 2 + (b2 − β 2 ) 2 ∑ ( xt − x ) 2 − 2(b2 − β 2 )∑ (et − e )( xt − x ) E (∑ eˆt2 ) Remember we are looking for this guy! (1) E{∑ (et − e ) 2 } = E{∑ et + Te 2 − 2e ∑ et } 2 = ∑ E (et ) + T { 2 ∑ E (et ) 2 2 } − 2{ E (∑ et ∑ et ) T = Tσ 2 + σ 2 − 2σ 2 = (T − 1)σ 2 (2) E{(b2 − β 2 ) 2 ∑ ( xt − x ) } = 2 σ2 2 ( x − x ) ∑ t T } 2 2 ( x − x ) = σ ∑ t (3) E{−2(b2 − β 2 )∑ et ( xt − x ) } = −2 E{ ∑ ( xt − x )( yt − y ) − β 2 ∑ ( xt − x ) ∑ (x t (x ∑ = −2 E{ t − x) 2 ∑ e (x t 2 t − x ){β 2 ( xt − x ) + et } − β 2 ∑ ( xt − x ) ∑ (x − x) β ∑ (x − x) − β ∑ (x − x) + ∑ e (x = −2 E{ ∑ (x − x) e (x − x) E (e ) ∑ ( x − x ) ∑ = −2 E{ } = −2 ∑ (x − x) ∑ (x − x) 2 2 − x )} ∑ e (x t − x )} t t 2 2 2 2 t t t 2 t 2 t 2 t 2 t 2 t 2 t = −2 E (et2 ) = −2σ 2 2 t t − x) ∑ e (x t t − x )} E (∑ eˆt2 ) = (1) + (2) + (3) = (T − 1)σ 2 + σ 2 − 2σ 2 = (T − 2)σ 2 1 (T − 2) 2 2 2 1 ˆ ˆ σ ≠σ2 E ( ∑ et ) = E (∑ et ) = T T T 1 2 2 2 2 2 1 1 ˆ ˆ ˆ E( e ) = E ( σ ) = E ( e ) = • ( T − 2 ) σ = σ ∑ ∑ t t T − 2 T −2 T −2 σˆ 2 = 2 ˆ e ∑t T −2 E (σˆ 2 ) = σ 2 Estimating the Variances and Covariances of the Least Squares Estimators • Replace the unknown error variance σ 2 in earlier formulas with σˆ 2 for variances and covariance estimators: 2 ⎡ ⎤ x ∑ t 2 ˆ b1 ) = σˆ ⎢ var( , 2 ⎥ ⎢⎣ T ∑ ( xt − x ) ⎥⎦ σˆ 2 ˆ b2 ) = var( , 2 ∑ ( xt − x ) ⎡ ⎤ −x ˆ b1 , b2 ) = σˆ ⎢ cov( 2 ⎥ ⎢⎣ ∑ ( xt − x ) ⎥⎦ 2 ˆ b1 ) se(b1 ) = var( ˆ b2 ) se(b2 ) = var( 4.6 The Coefficient of Determination, R2 Two major reasons for analyzing the model yt = β1 + β 2 xt + et are 1. Estimation: to explain how the dependent variable (yt) changes as the independent variable (xt) changes, and 2. Prediction: to predict y0 given an x0. • For the “prediction” purpose, we introduce the “explanatory” variable xt in hope that its variation will “explain” the variation in yt. How well do the explanatory variables explain the variation in yt? How to compute the coefficient of determination, R2 ? 1. To develop a measure of the variation in yt that is explained by the model, we begin by separating yt into its explainable and unexplainable components. yt = E ( yt ) + et • E ( yt ) = β1 + β2 xt is the explainable, “systematic” component of yt , • et is the random, unsystematic, unexplainable noise component of yt. 2. We can estimate the unknown parameters β1 and β2 and decompose the value of yt into yt = E ( yt ) + et => yt = b1 + b2 xt + eˆt = yˆ t + eˆt y = b1 + b2 x + eˆ yt − y = b2 ( xt − x ) + (eˆt − eˆ ) = b2 ( xt − x ) + eˆt = ( yˆ t − y ) + eˆt 3. SST = SSR + SSE = 2 2 2 2 ˆ ( ) ( ) y − y = b x − x + e ∑ t ∑t t 2∑ = ∑ ( yˆ t − y ) 2 + ∑ eˆt2 2 ( y − y ) = Sum of Squares for Total Variation (SST) ∑ t 2 ˆ ( y − y ) = Sum of Squares from Regression (SSR) ∑ t 2 ˆ e ∑t = Sum of Squares from Error (SSE) yt y • • • SST = ∑ ( yt − y ) ⇐ ( yt − y ) 2 t • • } ( yt − yˆ t ) ⇒ ∑ eˆt2 = SSE } yˆ t = b1 + b2 xt t ( yˆ t − y ) ⇒ ∑ ( yˆ t − y ) 2 = SSR t y 2 ˆ − ( y y ) ∑ t SSR R = = 2 ∑ ( yt − y ) SST 2 t t xt 4. R2, a measure of the proportion of variation in y explained by x within the regression model: SSR SSE 2 R = = 1− = SST SST b22 ∑ ( xt − x ) ∑ ( yt − y ) 2 2 = 1− 2 ˆ e ∑t 2 y y ( − ) ∑ t • coefficient of determination. • The closer it is to one, the better the job we have done in explaining the variation in yt with yˆt = b1 + b2 xt ; and the greater it is the predictive ability of our model over all the sample observations. • R 2 =1, SSE=0 vs. R 2 =0, SSR=0 vs. 0 < R 2<1 Uncentered vs. centered R2 1. Centered 2 2 2 ˆ ˆ ( y − y ) = ( y − y ) + e ∑ t ∑ t ∑t t t ⇒ R2 = t 2 ˆ ( y − y ) ∑ t t 2 ( y − y ) ∑ t t 2. Uncentered 2 2 2 ˆ ˆ y y e = + ∑ t ∑ t ∑t t t t ⇒ R2 = 2 ˆ y ∑ t t 2 y ∑ t t yt y • • • • • } ( yt − yˆ t ) ⇒ ∑ eˆt2 = SSE t yˆ t = b1 + b2 xt ( yˆ t − 0) ⇒ ∑ yˆ t2 = SSR t SST = ∑ yt2 ⇐ ( yt − 0) t R2 = 2 ˆ y ∑ t t 2 y ∑ t t xt 1. Uncentered R2 y = Xb + eˆ = yˆ + eˆ y′y = (yˆ + eˆ)′(yˆ + eˆ) = yˆ′yˆ + eˆ′eˆ = (b′X′Xb) + eˆ′eˆ ⇐ yˆ′eˆ = b′X′eˆ = 0 = {(X′X) -1 X′y}′ X′X{(X′X) -1 X′y} + eˆ′eˆ = y′X(X′X) -1 X′X{(X′X) -1 X′y} + eˆ′eˆ = y′X(X′X) -1 X′y + eˆ′eˆ −1 ′ ′ y X( X X) X′y 2 R = y′y 2 2 2 ˆ ˆ = + y y e ∑ t ∑ t ∑t t t t ⇒ R2 = 2 ˆ y ∑ t t 2 y ∑ t t 2. Centered R2 2 2 2 ˆ ˆ ( y − y ) = ( y − y ) + e ∑ t ∑ t ∑t t t t ⇒ R2 ∑ ( yˆ = ∑(y t − y)2 t − y)2 t t ⎛ ∑ yt ⎞ yt ∑ y t ∑ ∑t ( yt − y ) =∑ ( y + y − 2 yt y ) = ∑ y + T ⎜⎜ T ⎟⎟ − 2 T ⎠ ⎝ 2 2 ( y ) ′ ( i y) ∑ t = ∑ yt2 − = y′y − T T 2 2 2 t 2 ˆ − ( y y ) ∑ t 2 2 t 2 −1 ′ ′ ′ ′ − y X( X X) X y ( i y) /T b′X′AXb 2 t = = R = ; 2 2 y′y − (i′y) /T y′Ay ∑ (y t − y) t A TxT 1 = [I T − ii′] T R-square is a descriptive measure. By itself it does not measure the quality of the regression model. It is not the objective of regression analysis to find the model with the highest. Following a regression strategy focused solely on maximizing is not a good idea. Why is it not the objective of regression? Conceptually…. R2 has to do with predictability only. R2 measures linear relationship between y and E(y). Empirically…. The more explanatory variables are in the regression, the higher R2 is 5. Adjusted R2, Coefficient of Determination for Degree of Freedom 2 ˆ SSE / T − 2 σ R 2 = 1− = 1− 2 SST / T − 1 ( y − y ) / T −1 ∑ t R <R 2 2 6. Akaike Information Criterion eˆ′eˆ 2k AIC = ln + T T 7. Schwarz Criterion eˆ′eˆ k SC = ln + ln T T T The computer output usually contains the Analysis of Variance. For a simple regression analysis with obs. 40 it is: Analysis of Variance Table Source Explained Unexplained Total DF 1 38 39 R-square Sum of Squares 25221.2229 54311.3314 79532.5544 0.3171 K-1 T-k T-1 Sample Computer Output Dependent Variable: FOODEXP Method: Least Squares Sample: 1 40 Included observations: 40 Variable Coefficient Std. Error C 40.76756 22.13865 INCOME 0.128289 0.030539 t-Statistic 1.841465 4.200777 Prob. 0.0734 0.0002 R-squared 0.317118 Mean dependent var 130.3130 Adjusted R-squared 0.299148 S.D. dependent var 45.15857 S.E. of regression 37.80536 Akaike info criterion 10.15149 Sum squared error 54311.33 Schwarz criterion 10.23593 F-statistic 17.64653 Log likelihood Durbin-Watson stat -201.0297 2.370373 Prob(F-statistic) 0.000155 4.7 The Least Squares Predictor We want to predict for a given value of the explanatory variable x0 the value of the dependent variable y0, which is given by y0 = β1 + β2 x0 + e0 where e0 is a random error. This random error has mean E(e0)=0 2 and variance var(e0)= σ. We also assume that cov(e0, et)=0. The least squares predictor of y0, yˆ 0 = b1 + b2 x0 The forecast error is f = yˆ 0 − y0 = b1 + b2 x0 − (β1 + β2 x0 + e0 ) = (b1 − β1 ) + (b2 − β2 ) x0 − e0 The expected value of f is: E ( f ) = E ( yˆ 0 − y0 ) = E (b1 − β1 ) + E (b2 − β2 ) x0 − E (e0 ) =0+0−0=0 yˆ 0 is an unbiased linear predictor of y0 Variance of forecast error • f = yˆ 0 − y0 = b1 + b2 x0 − (β1 + β2 x0 + e0 ) = (b1 − β1 ) + (b2 − β2 ) x0 − e0 2 ( ) var( ) Var f = b + x • 1 0 var(b2 ) + var(e0 ) + 2 x0 cov(b1 , b2 ) 2 σ 2 x02 2 σ x x σ2 x 2σ 2 2 0 σ = + + + − 2 2 2 ( ) ( ) ( ) x − x T x − x x − x ∑ t ∑ t ∑ t 2 2 2 x x0 x 1 x 2 0 ] =σ [ + + +1− 2 2 2 ∑ ( xt − x ) T ∑ ( xt − x ) ∑ ( xt − x ) 2 ( ) x − x 1 0 ] = σ 2 [1 + + 2 T ∑ ( xt − x ) • cov(b1 , b2 ) = E (b1 − E (b1 ))(b2 − E (b2 )) = E ( y − b2 x − β1 )(b2 − β 2 ) = E ( β1 + β 2 x + e − b2 x − β1 )(b2 − β 2 ) = E{− x (b2 − β 2 )(b2 − β 2 )} = − x E (b2 − β 2 ) 2 − xσ 2 = 2 x x ( − ) ∑ t • 2 x x ( ) − 1 0 var( yˆ 0 ) = var(b1 + b2 x0 ) = σ 2 [ + ] 2 T ∑ ( xt − x ) “Variance of Predicted Value” Estimated variance of forecast error 2 ⎤ ⎡ − 1 ( x x ) 0 var( f ) = var( yˆ 0 − y0 ) = σ2 ⎢1 + + 2⎥ ⎢⎣ T ∑ ( xt − x ) ⎥⎦ The forecast error variance is estimated by replacing σ 2 by its estimator, σ ˆ2 2 ⎤ ⎡ 1 ( x x ) − 0 ˆ f ) = σˆ 2 ⎢1 + + var( 2⎥ ⎢⎣ T ∑ ( xt − x ) ⎥⎦ The square root of the estimated variance is the standard error of the forecast, ˆ (f) se ( f ) = var