Correlation vs. Regression A scatter plot can be used to show the relationship between two variables Correlation analysis is used to measure the strength of the association (linear relationship) between two variables Correlation is only concerned with strength of the relationship No causal effect is implied with correlation Stat 104: Quantitative Methods for Economists In correlation, the two variables are treated as equals. In regression, one variable is considered independent (=predictor) variable (X) and the other the dependent (=outcome) variable Y. Class 7: Regression, a first look 1 Regression Analysis 2 Some Notes and Terms In Simple Linear Regression, one X variable is used to explain the variable Y In Multiple Regression, more than one X variable is used to explain the variable Y. Regression analysis is a statistical technique that attempts to explain movements in one variable, the dependent variable, as a function of movements in a set of other variables, called independent (or explanatory) variables through the quantification of a single equation. However, a regression result no matter how statistically significant, cannot prove causality. All regression analysis can do is test whether a significant quantitative relationship exists. Model Assumption: Y and the X’s are linearly related. For now we will concentrate on simple regression. 3 4 Example Notation Suppose we want to predict the sale price of used Honda Accords. X = mileage of the car Many factors influence the price of a used car; model year, condition, transmission type, 2 or 4 door, color, mileage, how badly owner wants to sell, etc.... We suspect that the price of the Accord depends on its mileage to some extent. So if you just knew an Accord’s mileage but not it’s price maybe you could guess (or predict) it’s price. Y = price of the car We will choose just the variable mileage and see if price can be predicted from the mileage of the car. 5 To see if this would work, let’s collect some data where you know both price and size and see if there is a relationship. 6 1 For any given Accord, an observation consists of the pair Obtain data from n=100 Y = price X = mileage ( X i , Yi ) How can we see what is going on? In general: Let n be the number of observations in a sample. Y will denote the variable we want to predict X will denote the explanatory variable. 8 7 There appears to be a “linear relationship” between price and mileage. 15500 16000 Scatter Plot of Car Data Price 14500 15000 You might draw a different one 14000 13500 What’s going on? Where did the line come from? I drew it. 20000 30000 Odometer 40000 50000 Is there an exact linear relationship between mileage Important Point and price ? How can you tell ? (simple) Regression is a method for fitting a line to data. 10 9 Review: Equation of a Line How do you interpret the slope? Y = b0 + b1 X Y b0 We are interested in what happens when X changes by one unit: b1 Y ( x ) = b0 + b1 X Y ( x + 1) = b0 + b1 ( X + 1) { Subtract the bottom from the top: b1 = Y ( x + 1 ) − Y ( x ) 1 2 That is, if X changes by 1 unit, Y changes by b1 X 11 12 2 How do you interpret the intercept? Let’s Now Fit a Line to Our Data The intercept b0 is simply the value of Y when X=0. The equation of our line is given by Y$ = b0 + b1 X Y = b0 + b1 X Note: we use the symbol Y$ (Yhat) to stand for the fitted line; Y will always stand for the observed observations. In summary: The slope of the line, is the amount by which y increases when x increase by 1 unit. The intercept is the value of y when x = 0. 13 14 The fitted value: We have data: ( X i , Yi ) i = 1,2 ,... n Yi We have the components of a line: b0 b1 Y$i * For the ith observation the fitted value is defined to be: Y$i = b0 + b1 X i Xi Y$i = b0 + b1 X i fitted value 16 15 For the ith observation the residual is defined to be: residual Yi Y$i Y ** ei = Yi − Y$i * ** * } * * * * X - length of dashed lines are the residuals, below the line the residual is negative and above the line the residual is positive. Xi 17 18 3 How do we fit a line to the data ? Ideally, we want the residual to be zero for all the points. If the residual is zero, it means that the fitted line passes through the observed point. We need to choose a “criterion” which measures how small all the residuals are. The smaller the criterion the better. But it is clear from the scatter diagram that a straight line cannot pass through all the points so this will be impossible to do. We then choose the line that makes the criterion smallest. We can’t get the residuals to be 0, but we want them “small”. What criterion should we use? So, a good choice of line is one for which the residuals are small. 19 The most popular criterion for fitting a line is called the least squares method. This method says to Find b0 b1 and 20 I’m MAD Again These two values define a line As a quick aside, (least squares) regresssion fits a line by solving the equation that makes this sum as small as possible n n n n m in ∑ (Y i − b 0 − b1 X i ) 2 b 0 , b1 ∑(Y − b − b X ) = ∑(Y −Yˆ ) = ∑e 2 i 0 1 i 2 i i =1 2 i i i =1 i =1 The farther away a point is from the estimated line, the more serious the error. By squaring the errors, we “penalize” large residuals so that we can avoid them. i =1 Why not solve the following equation? This is called LAD (least absolute deviation) regression. n m in ∑ | Y i − b 0 − b1 X i | b 0 , b1 i =1 21 Performing Regression in Stata The values of b0 and b1 which minimize the residual sum of squares are: n ∑ (X b1 = 22 or i − X )(Yi − Y ) i =1 n ∑ (X i b1 = r − X )2 sy sx Ignore Ignore i =1 b 0 = Y − b1X Ignore These formulas can be derived using calculuswe pass. b0 b1 These formulas are the intercept and slope for the “best fitting line”. We will eventually explain this complete printout-ignore most of it for now. 23 24 4 Regression Plot Fitted Line Plot in Stata Price = 17066.8 - 0.0623155 Odometer S = 303.138 R-Sq = 65.0 % R-Sq(adj) = 64.7 % Interpretation of the slope: For each additional mile on the odometer, the price decreases by an average of $0.062 15000 15000 15500 Price 16000 16000 14000 14500 14000 20000 40000 50000 Odometer 13500 Editorial commenthad to google these commands; stata is a bit goofy to use 30000 20000 30000 Odometer Fitted values 40000 50000 Do not interpret the intercept as cars that have not been driven cost $17066.8 Price R-sq : 65% of the variation in the selling price is explained by the variation in odometer reading. The rest (35%) remains unexplained by this model. 26 25 Prediction Is the effect of mileage on price important ? The regression line says that price = 17066.8 − 0.0623(odometer ) So the predicted price for an Accord with 40000 miles is price = 17066.8 − 0.0623(40000) = $14574.8 27 28 Prediction in Stata Another Example Run the regression command in Stata The run the following command Some critics of television complain that the amount of violence shown on television contributes to violence in our society. Others point out that television also contributes to the high level of obesity among children. We may have to add financial problems to the list. A sociologist theorized that people who watch television frequently are exposed to many commercials, which in turn leads them to buy more, finally resulting in increasing debt. To test this belief, a sample of 430 families was drawn. For each the total debt and the number of hours the television is turned on per week were recorded. There is a little round off error on the previous slide-that’s why the predictions don’t match exactly. 29 30 5 The Regression Results -100000 0 Debt 100000 200000 300000 The Scatter Plot 0 20 Televisi 40 What is the interpretation? 60 31 A friend of your family tells you that his family watches 40 hours of television per week. What would you predict your friend’s family’s debt to? 32 The Affleck Hypothesis From Boston Magazine: the further from Boston an Affleck movie is, the worse the movie. We would predict the debt to be: 48040+2581.8(40)=151312 33 34 The Regression Negative correlation between Rotten Tomato Movie score and mileage from Boston. Interpret 20 40 rtscore 60 80 100 The Analysis 0 1000 2000 miles 3000 4000 5000 35 36 6 Predict Compare 37 38 Market Model Regression Example: Market Model Stockreturnt = α + β Indexreturnt In finance, a popular model is to regress stock returns against returns of some market index, such as the S&P 500. Beta=0 : cash under the mattress Beta=1 : same risk as the market The slope of the regression line, referred to as “beta”, is a measure of how sensitive a stock is to movements in the market. Beta<1 : safer than the market Beta >1: riskier than the market Stockreturnt = α + β Indexreturnt 39 Leveraged ETFs are the RAGE 40 ETFs are Very Popular FAZ = -3 Financial Index FAS = +3 Financial Index SDS = -2 S&P500 SSO = +2 S&P500 DDM = +2 DJ30 DXD = -2 DJ30 Hey! 41 42 7 Market Model for SDS and SSO Market Model for SDS SPY is the index, SSO is +2 and SDS is -2 (theoretically). Correlation is NOT Beta Beta 43 Market Model for SSO 44 The Search for Alpha In the market model, what is the stock (mutual fund) return if the index does nothing? Stockreturnt = α + β Indexreturnt People talk about “buying someone’s alpha”; i.e. what does the fund manager bring to the table above the index returns. Beta 45 Things you should know 46 Sections Covered from the Book Chapter 7 (just 7.1-7.2 for now) The least squares estimates Interpretation and Prediction 47 48 8
© Copyright 2025