Download Report

Example: Evaporation at Different Air Velocities
ENCN305 Regression Modelling
Measurements from experiment considering the evaporation
coefficient of a liquid at fixed air velocities
Velocity (cm/s)
20 60 100 140 180 220 260 300 340 380
Evaporation (mm2 /s) 0.18 0.37 0.35 0.78 0.56 0.75 1.18 1.36 1.17 1.65
Lecturer: Dr Raazesh Sainudiin (Raaz)
Notes by Dr Carl Scarrott
University of Canterbury
May 3 2013
What is Regression?
Regression analysis use various models and techniques for
describing the relationships between two or more variables, which
can potentially be used for prediction purposes.
We will consider two basic regression models:
◮
Simple linear regression has a linear relationship between a
single explanatory variable and a single response variable.
◮
Multiple linear regression extends this basic model to allow
for multiple explanatory variables.
Alternative terminologies:
◮
explanatory variables - predictor, input or independent
variable
◮
response variable - predictand, output or dependent variable.
Data for Regression Modelling
Observed “data vectors” of these variables are used to derive the
form and quantify the strength of the linear relationship. The
fitted model can then be used to predict the response given a value
of the explanatory variable.
For example, if you want to predict your exam mark on ENCN305
using your test and assignment marks, then the:
◮
explanatory variables - test marks, assignment 1 - 3 marks,
(possibly EMTH210 or MATH103 final grades?)
◮
response variable - ENCN305 exam mark.
We’d expect a strong positive association between the exam &
test marks, though associations with assignments are often weaker
So you’d expect to be able to reliably predict your exam marks
using the test marks, but often the assignment marks do not
“significantly improve the model fit”.
What Will We Cover In This Course?
On this course you will learn how to:
◮
specify appropriate simple/multiple linear regression models;
◮
fit the model to data;
◮
evaluate the model performance;
◮
choose between alternative models;
◮
predict using the model; and
◮
quantify the model uncertainties and evaluate it’s statistical
usefulness.
You will also see some of the common pitfalls of regression
models.
Simple Linear Regression
A simple linear regression model with explanatory variable X and
response Y can be written:
Y = β0 + β1 X + ǫ
where ǫ is the error term. The properties of the errors are
discussed further below.
In this course, the response and explanatory variables are assumed
to be continuous, i.e. can take any value over a range.
Although extensions to discrete explanatory variables or alternative
response distributions are available (e.g. discrete distributions or
restriction to range [0,1] for predicting probabilities).1
1
Fixed Versus Conditional Modelling
Need to differentiate two situations for data collection of the
explanatory variable(s):
let’s draw it →
Google “generalised linear model” if you want to know more
A Bit of Terminology
The primary parameters of the model Y = β0 + β1 X + ǫ are the
intercept β0 and gradient β1 (often called regression coefficients)
◮
fixed X values, typically resulting from experiments where the
values can be controlled in order to explore possible causal
relationships; and
Usually estimated using a sample dataset consisting of pairs of
observations (x1 , y1 ), (x2 , y2 ), . . . , (xn , yn ).
◮
random X values, in which case the modelling is conditional
on the observed values and care must be taken as formally the
results cannot be used to explore causal relationships.
Notice that the sample size n is the number of observation pairs
Note: in the latter case it is further assumed these measurement
are not subject to error (e.g. measurement error), otherwise a
more sophisticated “errors-in-variables” model is required.
Interpretation of coefficients:
◮
β1 is effect of unit increase in the explanatory variable X on
response Y
◮
β0 is value of response Y when explanatory variable X = 0
What Is Meant By “Linear” Regression Model?
“Linear” regression models are linear in the unknown
parameters, but there may be a non-linear relationship between
the X and Y . This is a common misconception.
For example, the following quadratic form is a linear model:
2
Y = β0 + β1 X + ǫ,
and this is a non-linear model:
Y = β0 X1β1 X2β2 ǫ.
In this example, it is possible to turn this non-linear model into a
linear model by taking the logarithm of both sides of the equation.
However, general techniques required for solving non-linear
models are beyond the scope of this course.
Example: Evaporation at Different Air Velocities
Descriptive Terms
The first step in regression is to determine whether or not there is
an observed association between the variables of interest.
The term “association” is used to avoid complications of causal
interpretations of the term relationship.
A scatterplot can be a useful tool to examine:
◮
strength - weak, moderate or strong association;
◮
direction - positive or negative;
◮
form - e.g. linear or quadratic.
Example: Evaporation Scatterplot
Let’s return to the measurements from experiment considering the
evaporation coefficient of a liquid at fixed air velocities
Velocity (cm/s)
20 60 100 140 180 220 260 300 340 380
Evaporation (mm2 /s) 0.18 0.37 0.35 0.78 0.56 0.75 1.18 1.36 1.17 1.65
Use MATLAB menu option File > Import Data... to import
data (pretty intuitive interface). Then plot the data:
◮
coded in ../Datasets/Evaporation.m
◮
It looks as though there is a strong positive linear association
between air velocity and evaporation
But how strong is it? Could it be spurious (just due to natural
sample variability with no underlying correlation)?
Correlation Coefficient
Problem With Covariance Interpretation
Strength of linear association can be measured using the Pearson
product moment “correlation coefficient”:
where:
Cov (X , Y )
ρXY = p
Var (X )Var (Y )
Var (X ) = E {(X − µX )2 }
2
The covariance is a measure of how X and Y linearly vary with
each other where:
◮
if Y tends to increase as X increases then the covariance
will be positive;
◮
if Y tends to decrease as X increases then the covariance
will be negative; and
◮
the unit of measurement is X times Y .
The lack of invariance to scale of measurement makes the
covariance hard to interpret.
Var (Y ) = E {(Y − µY ) },
and the covariance measures how X and Y vary with each other:
Cov (X , Y ) = E {(X − µX )(Y − µY )}
.
◮
There are many different measures of correlation, this is the
most commonly used
Calculating Correlation For Sample Data
An Aside: Alternative Formulae for Sample Correlation
Sample correlation coefficient is estimated using:
ρˆXY =
σ
ˆXY
σ
ˆX σ
ˆY
where the sample covariance and variances estimators are:
P
P
xi yi − n¯
x y¯
(xi − x¯)(yi − y¯ )
or
σ
ˆXY =
n−1
n−1
P 2
P
2
xi − n¯
x2
(xi − x¯)
2
σ
ˆX =
or
n−1
n−1
P 2
P
2
yi − n¯
y2
(yi − y¯ )
2
or
σ
ˆY =
n−1
n−1
and the sample mean of the random variables are x¯ and y¯
respectively.
The correlation is on standardised scale so interpretation does not
depend on the measurement scale of X and Y . The division by the
standard deviations (square root of variances) ensure the
correlations are on the range -1 to 1.
Denominators (n − 1) on the numerator and denominator of
equation (1) cancel, so correlation is often simplified to:
(1)
ρˆXY = √
SSXY
SSXX SSYY
where the sum of square (SS) of mean corrected data are:
X
SSXX =
(xi − x¯)2
X
SSYY =
(yi − y¯ )2
and the sum of the mean corrected cross products:
X
SSXY =
(xi − x¯)(yi − y¯ ).
Correlation Terminology
Correlation coefficient −1 ≤ ρXY ≤ 1 where:
◮
0 indicates no association
◮
+1 and −1 represents perfect positive and negative
correlation
Samples from bivariate Normal (Joint & Marginal)
Densities
Simulated samples of bivariate normal (µ1 = µ2 = 0 and
σ1 = σ2 = 1) of size 1000 with correlation = 0, 0.3, 0.7, −0.7)
Commonly used descriptors are:
◮
◮
◮
|ρXY | > 0.7 strong correlation
0.3 < |ρXY | < 0.7 moderate correlation
|ρXY | < 0.3 weak correlation
Simulated Data Using mvnrnd
Simulated samples of bivariate normal (µ1 = µ2 = 0 and
σ1 = σ2 = 1) of size 100 (using mvnrnd) with different
correlations:
Calculating Correlation in MATLAB
Correlation coefficient calculated in MATLAB using:
>> corr(Velocity,Evaporation) % if separate vectors
>> corr([Velocity Evaporation]) % if matrix with column per variable
Sample correlation coefficient for Evaporation data is 0.95.
Bootstrap simulation are obtained from:
>> nsim=10000;
>> corrstats=bootstrp(nsim,@corr,Velocity,Evaporation)
A bootstrap based 95% confidence interval can be obtained from:
>> prctile(corrstats,[2.5 97.5]) % using bootstrap correlations
or more directly using bootci function:
>> corrxy = @(x,y) corr(x,y)
>> bootci(nsim,{corrxy,Velocity,Evaporation},’alpha’,0.05,’type’,’per’)
Note: the ‘type’ option specifies basic percentile method and
’alpha’ is the significance level.
Let’s DIY ../../Datasets/Evaporation PoorMansBoot.m
See ./ mfiles/Plot2DGaussianSamplesAndMarginals.m, shall we?
Bootstrapped Correlations
Bootstrap simulations below and estimated 95% confidence interval
of (0.88, 0.99) confirm the true correlation between evaporation
and air velocity is statistically significantly different from zero
Note: highly skewed sampling distribution (would be symmetric for
normal populations with zero correlation). Traditional confidence
interval estimation approaches typically use transformations to
adjust for this (e.g. see Fisher’s Z transform on Wikipedia)
Correlation Is Not Full Story. . .
◮
Correlation is only a statistical summary, so always plot the
data
Famous example of datasets with same correlation (and many other
statistics) is Anscombe’s Quartet (see Wikipedia for more details):
Pitfalls of Correlation
Some pitfalls of correlation:
◮ We all know “correlation DOES NOT imply causation”
◮ Independence implies zero correlation, but zero correlation
does not imply independence
◮ In particular, Pearson product moment correlation coefficient
only measures linear association
Source: http://en.wikipedia.org/wiki/File:Correlation examples2.svg
Simple Linear Regression Matrix Representation
The simple linear regression equations for the observation pairs can
be concisely written using matrix notation:
y = Xβ + ǫ
(2)
where:
 

y1
1
 y2 
1
 

y =  .  X = .
 .. 
 ..
yn
1

β=
β0
β1

x1
x2 

.. 
.
xn
ǫ1
 ǫ2 
 
ǫ=.
 .. 
.
ǫn
◮
Source: http://en.wikipedia.org/wiki/File:Anscombe’s quartet 3.svg
Design matrix X has column of ones for intercept term and
second column of explanatory variable observations
corresponding to elements of response vector y
System Of Equations
Ordinary Least Squares Estimation
Each row represents one of an overdetermined system of linear
equations:

  
y1
β 0 + β 1 x1 + ǫ 1
 y2   β 0 + β 1 x 2 + ǫ 2 

  

 ..  = 
..

. 
.
Ordinary least squares (OLS) is a commonly used criteria for
fitting regression models
◮
Find β to minimise the sum of the square of the errors:
S
=
n
X
ǫ2i
i=1
β 0 + β 1 xn + ǫ n
yn
= ǫ′ ǫ
to which an approximately solution is calculated to find the
regression coefficients β0 and β1 .
ˆ
◮ The estimated regression coefficient vector is β
◮
◮
= (y − Xβ)′ (y − Xβ)
Error vector ǫ will also be estimated, giving the residuals
vector ǫˆ
OLS Solution
(3)
◮
which is a quadratic in the unknown vector β
◮
Therefore, minimum is found by setting partial derivative of S
(w.r.t β) equal to zero
Simple Linear Regression Assumptions
The partial derivative of S with respect to β is:
∂S
∂β
The OLS criterion is intuitively sensible and the same solution (5)
can be derived in many different ways under various assumptions.
The assumptions underlying OLS are:
= −2X′ (y − Xβ),
◮
weak exogeneity - X values are fixed, or can be treated as
fixed (i.e. conditioned upon), otherwise an
“errors-in-variables” model is needed. This also implies that
errors are uncorrelated with explanatory variable;
◮
linearity - response is linear function of explanatory variables;
◮
homoscedastic - errors have a constant variance (essentially all
errors are equally important) so Var (ǫi ) = σ 2 ; and
◮
uncorrelated - errors are uncorrelated with each other, so
Corr (ǫi , ǫj ) = 0 for {i, j = 1, . . . , n : i 6= j}.
which when set equal to zero gives:
βˆ = (X′ X)−1 X′ y.
(4)
Note that:
P xi
P
XX =
xi2
P 2
1
′
−1
Pxi
P
(X X)
=
n (xi − x¯)2 − xi
P y
′
Xy = P i
xi yi
.
′
Pn
xi
−
P xi
n
Or in matrix terms: E (ǫ) = 0, Var (ǫ) = σ 2 I, and Corr (X, ǫ) = 0.
It is also commonly assumed that the errors are normally
distributed, but we’ll get to this later.
Fitted Regression Model
Vector Geometry of Least Squares
There is a nice geometric interpretation of the OLS solution:
The fitted regression model is then:
yˆ = Xβˆ
yˆ = Xβˆ = X(X′ X)−1 X′ y = Hy,
(5)
where H is called the hat or projection matrix which has the
following properties:
The estimated prediction error (residuals) can be written:
ǫˆ = y − yˆ
= y − Xβˆ
(6)
An unbiased estimate of the error variance is given by:
P
(yi − yˆi )2
2
σ
ˆ =
n−2
ǫˆ′ ǫˆ
.
=
n−2
(8)
◮
Symmetric: H′ = H
◮
Idempotent: H2 = H.
The prediction error can also be written:
ǫˆ = y − yˆ
= y − Xβˆ
(7)
Vector Geometry of Least Squares Diagram
The prediction vector is then simply an orthogonal projection of
the original response vector onto the subspace spanned by the
explanatory variables, span(X), and the residuals are the leftovers
which are perpendicular to this subspace:
= y − Hy
= (I − H)y.
(9)
Example: Evaporation Model Fitting in MATLAB
The design matrix X must be created first:
>> X=[ones(length(Velocity),1) Velocity];
y
There are various ways to fit regression model in MATLAB, with an
extensive set of options available using the regress function:
x0
εˆ =( I − H)y
ˆ +x β
ˆ
Hy = x1 β
1
0 0
ˆ
x0 β
0
>> [B,BINT,R,RINT,STATS] = regress(Evaporation,X,0.05)
But we will explain all the results from this function later.
First lets see the how to fit the model and obtain the previous
results by doing the explicit calculations manually in MATLAB.
y
ˆ
ˆ
x1 β
1
x1
Example: Evaporation Model Fitting (Manually) in MATLAB
The simplest inbuilt option to solve the regression equations for βˆ
is to use the backslash operator (left matrix divide):
Example: Evaporation Model Fit
The OLS estimate of the regression is then:
◮ Evaporation = 0.0692 + 0.0038 Velocity
which is plotted below including marking on the residuals:
>> B = X\Evaporation
ˆ
which gives estimated regression coefficient vector β:
^
^y = ^
β0 + β1 X
Calculating the estimated coefficient vector βˆ explicitly gives the
same result:
^
ε
>> B = inv(X’*X) * X’*Evaporation
Hint: the backslash operator is very numerically stable, so is
generally more reliable than using the latter explicit solution.
◮
Example: Evaporation Model Predictions (Manually) in
MATLAB
The estimated coefficient can then be used to calculate predictions
and residuals:
>> yhat = X*B
>> resids = Evaporation - yhat
these are the same as vector R from the regress function above.
The sample standard deviation of the errors is then:
>> sigma2 = resids’*resids / (length(resids)-2)
which is automatically provided as the fourth element in the
STATS(4) component of regress function above.
A subjective examination indicates the least square line seems
to provide good fit to the observed response.
Example: Evaporation Model Fit Diagnostic in MATLAB
A common diagnostic to assess the model fit (and in particular the
assumptions) is to plot residuals against the predicted values:
>>
>>
>>
>>
>>
>>
evapstats=regstats(Evaporation,Velocity,’linear’,{’yhat’,’r’});
clf;
scatter(evapstats.yhat,evapstats.r)
hold on;
plot([0 1.6],[0 0],’r-’) % add horizontal line at zero
xlabel(’Fitted Values’);ylabel(’Residuals’);
Regression Diagnostic - What to Look For?
random with constant variance
increasing variance
Assessing Model Performance
◮
Variability (in terms of sum of squares around mean) in the
response can be decomposed into 2 components:
X
X
X
(yi − y¯ )2 =
(ˆ
yi − y¯ )2 +
(yi − yˆi )2
Total
◮
◮
= Regression + Residual.
A commonly used measure of performance is the coefficient
of determination R 2
Percentage of the total response sum of squares explained by
the model:
Explained (Regression) variation
Total Variation
P
!
(ˆ
yi − y¯ )2
=
or ρˆ2xy
Syy
P
(yi − yˆi )2
= 1−
Syy
R2 =
changing variance
non-zero mean over X
Example: Evaporation Model Fit Diagnostic in MATLAB
It is easy to calculate R 2 directly:
>> R2 = 1 - (resids’*resids /
sum((Evaporation-mean(Evaporation)).^ 2))
But it also provided as the first part of the STATS(1) output from
the regress function:
>> [B,BINT,R,RINT,STATS] = regress(Evaporation,X,0.05)
>> STATS(1)
Resampling Methods for Simple Linear Regression
The two types of explanatory variables (fixed or random) impacts
on the resampling scheme to explore the properties of the model:
◮
◮
random X (called observation resampling) - resample
observation pairs as with correlation; or
fixed X (called residual resampling) - resample residuals and
reapply fitted model:
◮
◮
Or you could use the regstats function above.
(10)
◮
fit model and compute residuals
generate bootstrap sample residual vector ǫ˜
obtain bootstrap simulations of response by applying fitted
model: y˜ = yˆ + ǫ˜
Comments on Bootstrap for Regression
Example: Evaporation Model Bootstrapping in MATLAB
Obtain the bootstrap simulated response vectors y˜1 , . . . , y˜NBOOT
˜ 1, . . . , X
˜N
and corresponding explanatory variable matrices X
BOOT
(either resampled under “observation resampling” or same X for
“residual resampling”). For each one:
◮
refit the linear regression model;
◮
calculate the statistic of interest (e.g. coefficients, predictions,
leverages).
Thus the bootstrap simulations provide a “sampling distribution”
for any statistic of interest.
Generally, “observation resampling” approach leads to wider
confidence intervals (i.e. indicating more uncertainty) than residual
resampling, as the latter ignores uncertainty due to explanatory
variable levels and relies on the original model fit.
Revision: Idea Underlying Bootstrapping
◮
◮
◮
Unknown population parameter (some feature or
characteristic of interest)
Estimate using sample statistic
SRS - Simple Random Sampling
The evaporation example has fixed X values for the explanatory
variables as it was a controlled experiment. So therefore, “residual
resampling” is required.
The bootstrap regression coefficients (and corresponding estimated
95% confidence interval) can be calculated in MATLAB using
>> bootbetas = bootstrp(nsim, @(bootr)regress(yhat+bootr,X), resids);
>> prctile(bootbetas,[2.5 97.5],1)
which gives a CI for β1 of (0.0031, 0.0046) which is a little thinner
than the (0.0028, 0.0048) given by BINT result from the regress
function above.
If “observation resampling” was appropriate then the following
MATLAB code will do this:
>> bootbetas = bootstrp(nsim,@(y,x) regress(y,x),Evaporation,X);
>> prctile(bootbetas,[2.5 97.5],1)
Idea Underlying Sample Estimators
◮
◮
How reliable is that estimator?
Some key properties of estimators1 :
◮
◮
◮
SRS of size n
x–
SRS of size n
x–
SRS of size n
·
·
·
POPULATION
unknown mean m
◮
x–
·
·
·
bias (accuracy) - how far is “average” of estimator from the
true parameter;
variance (precision) - variability of estimator across samples;
consistency - does estimator converge (in probability) to true
parameter as sample size increases;
efficiency - is this the best estimator, is there one that makes
better use of data
◮
Bootstrap simulations can be used in exploring these
properties of estimators
◮
On this course, bootstrapping is used to explore the variance
(precision) of the estimator, i.e. how much variability do we
get in samples of that sample size?
Sampling distribution
Source: Moore & McCabe. Introduction to Practice of Statistics. Chapter 14 on
”Bootstrap Methods and Permutation Tests” freely available from
http://bcs.whfreeman.com/ips5e/content/cat 080/pdf/moore14.pdf
1
See Wikipedia page on “Estimator” for formal definitions