Download Report

Christina Tran
California State University, Fullerton
Spring 2015
Math 437
3.3-3.5 Summary
3.3.1 Qualitative Predictors
Not all data is quantitative; data can be qualitative such as gender.
Predictors with Only Two Levels
Qualitative predictors are also known as factors. If a factor has two levels (possible
values), then incorporating it into a regression is simple. We create a dummy variable that
takes on two possible numerical values, namely 0 and 1. For example, we can write:
(
1 if ith person is female
xi =
0 if ith person is male
We can use this variable as a predictor in the regression model:
(
β0 + β1 xi + i if ith person is female
yi =
β0 + i if ith person is male
So, β0 can be interpreted as the average credit card balance among males and β0 + β1 as
the average credit card balance among females where β1 is the average difference in credit
card balance between females and males. Or, we can create another dummy variable where
xi = 1 if ith person is female and -1 if ith person is male. Then,
(
β0 + β1 + i if ith person is female
yi = β0 + β1 xi + i =
β0 − β1 + i if ith person is male
Here, β0 would be the average of the total debt (halfway between the intercept+β1 divided
by two) and β1 would be the average of the difference in debt between males and females
(β1 divided by two). Both methods will give the same result. The only difference is in the
interpretation of the coefficients.
Qualitative Predictors with More than Two Levels
We create additional variables for qualitative predictors with more than two levels. For
example, take ethnicity. We use the above method but for Asian, for example, Caucasian,
African American, etc. Then, the model would look something like:


β0 + β1 + 1 if ith person is Asian
yi = β0 + β1 x1 + β2 x2 + i = β0 + β2 + i if ith person is Caucasian


β0 + i if ith person is African American
1
Here, β0 can be the average credit card balance for African Americans, β1 as the difference
between Asian and African American, and β2 the difference between Caucasian and African
Americans. There will always be one fewer dummy variable than the number of levels. The
level with no dummy variable - African American in this example - is called the baseline. If
the table shows high p-values associated with the ethnicity, this implies there is no statistical
evidence of a real difference in credit card balance between the ethnicities.
Extensions of the Linear Model
The models mentioned before suggest an additive and linear relationship between the
predictors and responses; a huge assumption we wish not to make! This means that the
effect of changes in a predictor Xj on the response Y is independent of the values of the
other predictors.
Removing the Additive Assumption
A larger increase/decrease in the response from allocating different amounts to each
predictor rather than allocating all amounts to a single predictor is called the synergy effect
and in statistics is called the interaction effect. Consider Y = β0 +β1 X1 +β2 X2 +. According
to this model, if we increase X1 by one unit, then Y will increase by an average of β1 units.
So, the presence of X2 doesn’t alter the statement; that is, regardless of X2 , a one-unit
increase in X1 will lead to a β1 -increase in Y . One way of extending this model to allow
for interaction effects is to include a third predictor, called an interaction term, which is
constructed by computing the product of X1 and X2 . Then we have:
Y = β0 + β1 X1 + β2 X2 + β3 X1 X2 + = β0 + (β1 + β3 X2 )X1 + β2 X2 + Since β1 + β3 X2 changes with X2 , the effect of X1 on Y is no longer constant: adjusting
X2 will change the impact of X1 on Y .
The hierarchical principle states that if we include an interaction in a model, we should
also include the main effects, even if the p-values associated with their coefficients are not
significant. In other words, if the interaction between X1 and X2 seems important, then we
should include them both in the model even if their coefficient estimates have large p-values.
Non-linear Relationships Polynomial regression accommodates for non-linear relationships. A simple approach for incorporating non-linear associations in a linear model is
to include transformed versions of the predictors in the model. For example, to induce a
quadratic shape, we could make a model Y = β0 + β1 X1 + β2 X12 + .
3.3.3 Potential Problems
Problems include:
1. Non-linearity of the response-predictor relationships
2
2. Correlation of error terms
3. Non-constant variance of error terms
4. Outliers
5. High-leverage points
6. Collinearity
1. Non-linearity of the Data
Residual plots are useful for identifying non-linearity. We can plot the residuals, ei =
yi − yˆi versus the predictor xi . In multiple regression, we instead plot the residuals versus
the predicted (or fitted ) values yˆi .
2. Correlation of Error Terms
In linear regression, we assume that the error terms are uncorrelated, which means that
the fact that ei is positive provides little or no information about the sign of ei+1 . The
standard errors computed arebased on the assumption of uncorrelated error terms. If there is
a correlation, then the estimated standard errors will underestimate the true standard errors.
This results in narrower confidence and prediction intervals. Also, p-values associated with
the model will be lower than they should be.
Correlations in the error terms may occur in the context of time series data, which
consists of observations for which measurements are obtained at discrete points in time. So,
observations that are obtained at adjacent time points will have positively correlated errors.
To determine if this is the case, we can plot the residuals from our model as a function of
time. If the errors are uncorrelated, then there should be no discernible pattern If they are
positively correlated, then they may see tracking in the residuals–that is, adjacent residuals
may have similar values.
3. Non-constant Variance of Error Terms
The variances of the error terms may be non-constant. For instance, the variances of
the error terms may increase with the value of the response. One can identify non-constant
variances in the errors, or heteroscedasticity, from the presence of a funnel shape in the
residual plot. A √
possible solution is to transform the response Y using a concave function
such as log Y or Y .
4. Outliers
An outlier is a point for which yi is far from the value predicted by the model. Outliers
can screw up the residual standard error and/or the R2 statistic. Residual plots can be used
to identify outliers. It can be difficult to to decide how large a residual needs to be before
we consider the point to be an outlier. To address this, instead of plotting the residuals, we
can plot the studentized residuals, computed by dividing each residual ei by its estimated
3
standard error. Observations whose studentized residuals greater than 3 in absolute value
are possible outliers.
5. High Leverage Points
Observations with high leverage have an unusual value of xi . For example, an observation
could have high leverage because the predictor value for the observation is large relative to
the other observations. High-leverage points have high impact on the estimated regression
line. Thus, it’s important to find these points!
In simple linear regression, we can simply look for observations for which the predictor
value is outside the normal range of observations. In multiple regression, it can be more
difficult. Consider an example with two predictors, X1 , X2 . Plotting X1 versus X2 , suppose
we see a predictor that falls within the range of X1 and X2 . Then, it’s hard to identify it as a
high-leverage point. So, we compute the high-leverage point (for simple linear regression) as
(xi − x)2
1
. So, hi increases with the distance of xi from x. The leverage statistic
hi = + P
n
n
(xi0 − x)2
i0 =1
1
and 1 and the average leverage statistic for all the observations is
n
p+1
(p + 1)
always equal to
. So, if a leverage statistic for an observation greatly exceeds
,
n
n
then we may suspect that the point has high leverage.
hi is always between
6. Collinearity
Collinearity refers to the situation in which two or more predictor variables are closely
related to one another. The presence of collinearity can pose problems in the regression
context since it can be difficult to separate out the individual effects of collinear variables
on the response. In other words, since two predictors tend to increase or decrease together,
it can be difficult to determine how one separately is associated wit the response.
Contour plots can help show some of the difficulties that result from collinearity. Each
ellipse represents a set of coefficients that correspond to the same RSS, with ellipses nearest
to the center taking on the lowest values of RSS. The black dots are the least squares
estimates. For a data set that is not collinear, the rings are circles and are evenly spaced
by a factor of .25. Each ring shows a .25 increase in standard error. Three rings means four
increases in standard error. On the other hand, predictors that are collinear give a narrow
ellipse and therefore gives a broad range of values for the coefficient estimates that result in
equal values for RSS. So, a small change in the data could cause the pair of coefficient values
that yield the smallest RSS to move anywhere along the ellipse; resulted is a great deal of
uncertainty in the coefficient estimates. Collinearity reduces the accuracy of the estimates of
the regression coefficients. Thus, it causes the standard error for βˆj to grow. Recall that the
t-statstic for each predictor is calculated by dividing βˆj by its standard error. Consequently,
collinearty results in a decline in the t-statistic and so in the presence of collinearity, we may
fail to reject the null hypothesis and so the power of the hypothesis test is reduced! When
this happens, be sure to note collinearity issues in the model.
4
A simple way to detect collinearity is to look at the correlation matrix of the predictors. The larger the elements in absolute value, the more highly correlated the variables
are. Unfortunately, not all collinearity problems can be detected using the matrix since it’s
possible for collinearity to exist between three or more variables even if no pair of variables
has a particularly high correlation. We call this multicollinearity. Instead of looking at
the correlation matrix, we compute the variance inflation factor (VIF) which is the ratio
of the variance of βˆj when fitting the full model divided by the variance of βˆj if fit on its
own. The smallest value for VIF is 1, which indicates the complete absence of collinearity.
A VIF over 5 or 10 indicates a problematic amount of collinearity. We compute the VIF
2
1
2
ˆ
, where RX
is the R2 from regression of Xj onto all other predictors.
(βj ) =
j |X−j
1 − RXj |X−j
2
If RX
is close to one, then collinearity is present and so VIF will be large. Thus, when
j |X−j
faced with collinearity, we can drop one of the problematic variables from the regression
fit. This can be done without compromising too much of the regression fit since collinearity
implies that information that this variable provides about the response is redundant in the
presence of other variables. Another solution is to combine the colinear variables together
into a single predictor.
The Marketing Plan
1. Is there a relationship between advertising sales and budget? Fit a multiple regression
model of sales onto TV, radio, and newspaper, and testing H0 . The F-statistic can be
used to determine whether or not to reject the null hypothesis
2. How strong is the relationship? The RSE estimates the standard deviation of the
response from the population regression line. For advertising data, the RSE is 1681
units while the mean value for the response is 14,022, indicating a percentage error of
roughly 12%. The R2 statistic records the percentage of variability in te response that
is explained by the predictors. The predictors explain almost 90% of the variance in
sales.
3. Which media contribute to sales? We look at the p-values associated with each predictor’s t-statistic.
4. How large is the effect of each medium on sales? The standard error of βˆj can be used
to construct confidence intervals for βj . If the confidence intervals are narrow and far
from zero which provides evidence that those media are related to the response. If an
interval includes zero, then the variable is not statistically significant.
5. How accurately can we predict future sales? The accuracy associated with this estimate
depends on whether we wish to predict an individual response, Y = f (x) + , or
average response f (X). If the former, we use a prediction interval. If the latter, we use
confidence interval. Prediction intervals will always be wider than a confidence interval
because they account for the uncertainty associated with , the irreducible error.
6. Is the relationship linear? Residual plots can be used to identify non-linearity. If the
relationships are linear, then the residual plots should display no pattern. Remember
5
that the inclusion of transformations of the predictors in the linear regression model
can be done to accommodate non-linear relationships.
7. Is there synergy among the advertising media? The standard linear regression model
assumes an additive, linear relationship between the predictors and the response. An
additive model is easy to interpret because the effect of each predictor on the response
is unrelated to the values of the other predictors. However, the additive assumption
may be unrealistic. Including an interaction in the regression model in order to accommodate non-additive relationships will help.
Comparison of Linear Regression with K-Nearest Neighbors
The KNN regression method is closely related to the KNN classifier. Given a value for
K and a prediction point x0 , KNN regression first identifies the K training observations
1 P
. The optimal value of K
that are closest to x0 , represented by N0 . So, fˆ(x0 ) =
K xi ∈N0 yi
will depend on the bias-variance tradeoff. A small K gives a flexible, slutty fit which will
have low bias but high variance. This variance is due to the fact that the prediction in a
given region is entirely dependent on just one observation. Larger values of K provide a
smoother and less variable fit. The prediction in a region is an average of several points and
so changing one observation has a smaller effect. Though, the smoothing may cause bias by
masking some of the structure in f (X). Note: the parametric approach will outperform the
non-parametric approach if the parametric form that has been selected is close to the true
form of f.
6