Correlation vs. Regression

Correlation vs. Regression
A scatter plot can be used to show the relationship
between two variables
Correlation analysis is used to measure the strength of
the association (linear relationship) between two
variables
Correlation is only concerned with strength of the
relationship
No causal effect is implied with correlation
Stat 104: Quantitative Methods for Economists
In correlation, the two variables are treated as
equals. In regression, one variable is considered
independent (=predictor) variable (X) and the other
the dependent (=outcome) variable Y.
Class 7: Regression, a first look
1
Regression Analysis
2
Some Notes and Terms
In Simple Linear Regression, one X variable is
used to explain the variable Y
In Multiple Regression, more than one X variable
is used to explain the variable Y.
Regression analysis is a statistical technique that
attempts to explain movements in one variable, the
dependent variable, as a function of movements in
a set of other variables, called independent (or
explanatory) variables through the quantification of a
single equation.
However, a regression result no matter how
statistically significant, cannot prove causality. All
regression analysis can do is test whether a
significant quantitative relationship exists.
Model Assumption: Y and the X’s are linearly related.
For now we will concentrate on simple regression.
3
4
Example
Notation
Suppose we want to predict the sale price of
used Honda Accords.
X = mileage of the car
Many factors influence the price of a used car;
model year, condition, transmission type, 2 or 4
door, color, mileage, how badly owner wants to
sell, etc....
We suspect that the price of the Accord depends on its
mileage to some extent. So if you just knew an Accord’s
mileage but not it’s price maybe you could guess (or
predict) it’s price.
Y = price of the car
We will choose just the variable mileage and see
if price can be predicted from the mileage of the
car.
5
To see if this would work, let’s collect some data where
you know both price and size and see if there is a
relationship.
6
1
For any given Accord, an observation consists
of the pair
Obtain data from
n=100
Y = price
X = mileage
( X i , Yi )
How can we see
what is going on?
In general:
Let n be the number of observations in a sample.
Y will denote the variable we want to predict
X will denote the explanatory variable.
8
7
There appears to be
a “linear relationship”
between price and
mileage.
15500
16000
Scatter Plot of Car Data
Price
14500
15000
You might draw a different one
14000
13500
What’s going
on?
Where did the line come from?
I drew it.
20000
30000
Odometer
40000
50000
Is there an exact linear relationship between mileage
Important Point
and price ? How can you tell ?
(simple) Regression is a method for fitting a line to data.
10
9
Review: Equation of a Line
How do you interpret the slope?
Y = b0 + b1 X
Y
b0
We are interested in what happens when X
changes by one unit:
b1
Y ( x ) = b0 + b1 X
Y ( x + 1) = b0 + b1 ( X + 1)
{
Subtract the bottom from the top:
b1 = Y ( x + 1 ) − Y ( x )
1
2
That is, if X changes by 1 unit, Y changes by b1
X
11
12
2
How do you interpret the intercept?
Let’s Now Fit a Line to Our Data
The intercept b0 is simply the value of Y when
X=0.
The equation of our line is given by
Y$ = b0 + b1 X
Y = b0 + b1 X
Note: we use the symbol Y$ (Yhat) to
stand for the fitted line; Y will always stand
for the observed observations.
In summary:
The slope of the line, is the amount by which y increases when x
increase by 1 unit.
The intercept is the value of y when x = 0.
13
14
The fitted value:
We have data:
( X i , Yi ) i = 1,2 ,... n
Yi
We have the components of a line:
b0
b1
Y$i
*
For the ith observation the fitted value is
defined to be:
Y$i = b0 + b1 X i
Xi
Y$i = b0 + b1 X i
fitted value
16
15
For the ith observation the residual is defined to be:
residual
Yi
Y$i
Y
**
ei = Yi − Y$i
*
**
*
}
*
*
*
*
X
- length of dashed lines are the residuals,
below the line the residual is negative and
above the line the residual is positive.
Xi
17
18
3
How do we fit a line to the data ?
Ideally, we want the residual to be zero for all the points.
If the residual is zero, it means that the fitted line passes
through the observed point.
We need to choose a “criterion” which measures
how small all the residuals are. The smaller the
criterion the better.
But it is clear from the scatter diagram that a straight
line cannot pass through all the points so this will be
impossible to do.
We then choose the line that makes the criterion
smallest.
We can’t get the residuals to be 0, but we want them
“small”.
What criterion should we use?
So, a good choice of line is one for which the
residuals are small.
19
The most popular criterion for fitting a line is
called the least squares method. This method
says to
Find
b0
b1
and
20
I’m MAD Again
These two values define a line
As a quick aside, (least squares) regresssion
fits a line by solving the equation
that makes this sum as small as possible
n
n
n
n
m in ∑ (Y i − b 0 − b1 X i ) 2
b 0 , b1
∑(Y − b − b X ) = ∑(Y −Yˆ ) = ∑e
2
i
0
1 i
2
i
i =1
2
i
i
i =1
i =1
The farther away a point is from the estimated line,
the more serious the error. By squaring the errors,
we “penalize” large residuals so that we can avoid
them.
i =1
Why not solve the following equation? This is
called LAD (least absolute deviation)
regression.
n
m in ∑ | Y i − b 0 − b1 X i |
b 0 , b1
i =1
21
Performing Regression in Stata
The values of b0 and b1 which minimize the
residual sum of squares are:
n
∑ (X
b1 =
22
or
i
− X )(Yi − Y )
i =1
n
∑ (X
i
b1 = r
− X )2
sy
sx
Ignore
Ignore
i =1
b 0 = Y − b1X
Ignore
These formulas can be derived using calculuswe pass.
b0
b1
These formulas are the intercept and slope for the “best fitting line”.
We will eventually explain this complete printout-ignore most of it for now.
23
24
4
Regression Plot
Fitted Line Plot in Stata
Price = 17066.8 - 0.0623155 Odometer
S = 303.138
R-Sq = 65.0 %
R-Sq(adj) = 64.7 %
Interpretation of the slope:
For each additional mile on
the odometer,
the price decreases by an
average of $0.062
15000
15000
15500
Price
16000
16000
14000
14500
14000
20000
40000
50000
Odometer
13500
Editorial commenthad to google these
commands; stata is
a bit goofy to use
30000
20000
30000
Odometer
Fitted values
40000
50000
Do not interpret the intercept as cars that have
not been driven cost $17066.8
Price
R-sq : 65% of the variation in
the selling price is explained by
the variation in odometer
reading. The rest (35%)
remains unexplained by this
model.
26
25
Prediction
Is the effect of mileage on price important ?
The regression line says that
price = 17066.8 − 0.0623(odometer )
So the predicted price for an Accord with
40000 miles is
price = 17066.8 − 0.0623(40000) = $14574.8
27
28
Prediction in Stata
Another Example
Run the regression command in Stata
The run the following command
Some critics of television complain that the amount of
violence shown on television contributes to violence in
our society. Others point out that television also
contributes to the high level of obesity among children.
We may have to add financial problems to the list.
A sociologist theorized that people who watch television
frequently are exposed to many commercials, which in
turn leads them to buy more, finally resulting in
increasing debt.
To test this belief, a sample of 430 families was drawn.
For each the total debt and the number of hours the
television is turned on per week were recorded.
There is a little round off error on the previous slide-that’s why the predictions don’t match exactly.
29
30
5
The Regression Results
-100000
0
Debt
100000
200000
300000
The Scatter Plot
0
20
Televisi
40
What is the interpretation?
60
31
A friend of your family tells you that his family watches 40
hours of television per week. What would you predict your
friend’s family’s debt to?
32
The Affleck Hypothesis
From Boston Magazine: the further from
Boston an Affleck movie is, the worse the
movie.
We would predict the debt to be:
48040+2581.8(40)=151312
33
34
The Regression
Negative correlation between Rotten Tomato
Movie score and mileage from Boston.
Interpret
20
40
rtscore
60
80
100
The Analysis
0
1000
2000
miles
3000
4000
5000
35
36
6
Predict
Compare
37
38
Market Model
Regression Example: Market Model
Stockreturnt = α + β Indexreturnt
In finance, a popular model is to regress
stock returns against returns of some market
index, such as the S&P 500.
Beta=0 : cash under the mattress
Beta=1 : same risk as the market
The slope of the regression line, referred to as
“beta”, is a measure of how sensitive a stock is
to movements in the market.
Beta<1 : safer than the market
Beta >1: riskier than the market
Stockreturnt = α + β Indexreturnt
39
Leveraged ETFs are the RAGE
40
ETFs are Very Popular
FAZ = -3 Financial Index
FAS = +3 Financial Index
SDS = -2 S&P500
SSO = +2 S&P500
DDM = +2 DJ30
DXD = -2 DJ30
Hey!
41
42
7
Market Model for SDS and SSO
Market Model for SDS
SPY is the index, SSO is +2 and SDS is -2
(theoretically).
Correlation is NOT Beta
Beta
43
Market Model for SSO
44
The Search for Alpha
In the market model, what is the stock
(mutual fund) return if the index does
nothing?
Stockreturnt = α + β Indexreturnt
People talk about “buying someone’s alpha”;
i.e. what does the fund manager bring to the
table above the index returns.
Beta
45
Things you should know
46
Sections Covered from the Book
Chapter 7 (just 7.1-7.2 for now)
The least squares estimates
Interpretation and Prediction
47
48
8