Example: Evaporation at Different Air Velocities ENCN305 Regression Modelling Measurements from experiment considering the evaporation coefficient of a liquid at fixed air velocities Velocity (cm/s) 20 60 100 140 180 220 260 300 340 380 Evaporation (mm2 /s) 0.18 0.37 0.35 0.78 0.56 0.75 1.18 1.36 1.17 1.65 Lecturer: Dr Raazesh Sainudiin (Raaz) Notes by Dr Carl Scarrott University of Canterbury May 3 2013 What is Regression? Regression analysis use various models and techniques for describing the relationships between two or more variables, which can potentially be used for prediction purposes. We will consider two basic regression models: ◮ Simple linear regression has a linear relationship between a single explanatory variable and a single response variable. ◮ Multiple linear regression extends this basic model to allow for multiple explanatory variables. Alternative terminologies: ◮ explanatory variables - predictor, input or independent variable ◮ response variable - predictand, output or dependent variable. Data for Regression Modelling Observed “data vectors” of these variables are used to derive the form and quantify the strength of the linear relationship. The fitted model can then be used to predict the response given a value of the explanatory variable. For example, if you want to predict your exam mark on ENCN305 using your test and assignment marks, then the: ◮ explanatory variables - test marks, assignment 1 - 3 marks, (possibly EMTH210 or MATH103 final grades?) ◮ response variable - ENCN305 exam mark. We’d expect a strong positive association between the exam & test marks, though associations with assignments are often weaker So you’d expect to be able to reliably predict your exam marks using the test marks, but often the assignment marks do not “significantly improve the model fit”. What Will We Cover In This Course? On this course you will learn how to: ◮ specify appropriate simple/multiple linear regression models; ◮ fit the model to data; ◮ evaluate the model performance; ◮ choose between alternative models; ◮ predict using the model; and ◮ quantify the model uncertainties and evaluate it’s statistical usefulness. You will also see some of the common pitfalls of regression models. Simple Linear Regression A simple linear regression model with explanatory variable X and response Y can be written: Y = β0 + β1 X + ǫ where ǫ is the error term. The properties of the errors are discussed further below. In this course, the response and explanatory variables are assumed to be continuous, i.e. can take any value over a range. Although extensions to discrete explanatory variables or alternative response distributions are available (e.g. discrete distributions or restriction to range [0,1] for predicting probabilities).1 1 Fixed Versus Conditional Modelling Need to differentiate two situations for data collection of the explanatory variable(s): let’s draw it → Google “generalised linear model” if you want to know more A Bit of Terminology The primary parameters of the model Y = β0 + β1 X + ǫ are the intercept β0 and gradient β1 (often called regression coefficients) ◮ fixed X values, typically resulting from experiments where the values can be controlled in order to explore possible causal relationships; and Usually estimated using a sample dataset consisting of pairs of observations (x1 , y1 ), (x2 , y2 ), . . . , (xn , yn ). ◮ random X values, in which case the modelling is conditional on the observed values and care must be taken as formally the results cannot be used to explore causal relationships. Notice that the sample size n is the number of observation pairs Note: in the latter case it is further assumed these measurement are not subject to error (e.g. measurement error), otherwise a more sophisticated “errors-in-variables” model is required. Interpretation of coefficients: ◮ β1 is effect of unit increase in the explanatory variable X on response Y ◮ β0 is value of response Y when explanatory variable X = 0 What Is Meant By “Linear” Regression Model? “Linear” regression models are linear in the unknown parameters, but there may be a non-linear relationship between the X and Y . This is a common misconception. For example, the following quadratic form is a linear model: 2 Y = β0 + β1 X + ǫ, and this is a non-linear model: Y = β0 X1β1 X2β2 ǫ. In this example, it is possible to turn this non-linear model into a linear model by taking the logarithm of both sides of the equation. However, general techniques required for solving non-linear models are beyond the scope of this course. Example: Evaporation at Different Air Velocities Descriptive Terms The first step in regression is to determine whether or not there is an observed association between the variables of interest. The term “association” is used to avoid complications of causal interpretations of the term relationship. A scatterplot can be a useful tool to examine: ◮ strength - weak, moderate or strong association; ◮ direction - positive or negative; ◮ form - e.g. linear or quadratic. Example: Evaporation Scatterplot Let’s return to the measurements from experiment considering the evaporation coefficient of a liquid at fixed air velocities Velocity (cm/s) 20 60 100 140 180 220 260 300 340 380 Evaporation (mm2 /s) 0.18 0.37 0.35 0.78 0.56 0.75 1.18 1.36 1.17 1.65 Use MATLAB menu option File > Import Data... to import data (pretty intuitive interface). Then plot the data: ◮ coded in ../Datasets/Evaporation.m ◮ It looks as though there is a strong positive linear association between air velocity and evaporation But how strong is it? Could it be spurious (just due to natural sample variability with no underlying correlation)? Correlation Coefficient Problem With Covariance Interpretation Strength of linear association can be measured using the Pearson product moment “correlation coefficient”: where: Cov (X , Y ) ρXY = p Var (X )Var (Y ) Var (X ) = E {(X − µX )2 } 2 The covariance is a measure of how X and Y linearly vary with each other where: ◮ if Y tends to increase as X increases then the covariance will be positive; ◮ if Y tends to decrease as X increases then the covariance will be negative; and ◮ the unit of measurement is X times Y . The lack of invariance to scale of measurement makes the covariance hard to interpret. Var (Y ) = E {(Y − µY ) }, and the covariance measures how X and Y vary with each other: Cov (X , Y ) = E {(X − µX )(Y − µY )} . ◮ There are many different measures of correlation, this is the most commonly used Calculating Correlation For Sample Data An Aside: Alternative Formulae for Sample Correlation Sample correlation coefficient is estimated using: ρˆXY = σ ˆXY σ ˆX σ ˆY where the sample covariance and variances estimators are: P P xi yi − n¯ x y¯ (xi − x¯)(yi − y¯ ) or σ ˆXY = n−1 n−1 P 2 P 2 xi − n¯ x2 (xi − x¯) 2 σ ˆX = or n−1 n−1 P 2 P 2 yi − n¯ y2 (yi − y¯ ) 2 or σ ˆY = n−1 n−1 and the sample mean of the random variables are x¯ and y¯ respectively. The correlation is on standardised scale so interpretation does not depend on the measurement scale of X and Y . The division by the standard deviations (square root of variances) ensure the correlations are on the range -1 to 1. Denominators (n − 1) on the numerator and denominator of equation (1) cancel, so correlation is often simplified to: (1) ρˆXY = √ SSXY SSXX SSYY where the sum of square (SS) of mean corrected data are: X SSXX = (xi − x¯)2 X SSYY = (yi − y¯ )2 and the sum of the mean corrected cross products: X SSXY = (xi − x¯)(yi − y¯ ). Correlation Terminology Correlation coefficient −1 ≤ ρXY ≤ 1 where: ◮ 0 indicates no association ◮ +1 and −1 represents perfect positive and negative correlation Samples from bivariate Normal (Joint & Marginal) Densities Simulated samples of bivariate normal (µ1 = µ2 = 0 and σ1 = σ2 = 1) of size 1000 with correlation = 0, 0.3, 0.7, −0.7) Commonly used descriptors are: ◮ ◮ ◮ |ρXY | > 0.7 strong correlation 0.3 < |ρXY | < 0.7 moderate correlation |ρXY | < 0.3 weak correlation Simulated Data Using mvnrnd Simulated samples of bivariate normal (µ1 = µ2 = 0 and σ1 = σ2 = 1) of size 100 (using mvnrnd) with different correlations: Calculating Correlation in MATLAB Correlation coefficient calculated in MATLAB using: >> corr(Velocity,Evaporation) % if separate vectors >> corr([Velocity Evaporation]) % if matrix with column per variable Sample correlation coefficient for Evaporation data is 0.95. Bootstrap simulation are obtained from: >> nsim=10000; >> corrstats=bootstrp(nsim,@corr,Velocity,Evaporation) A bootstrap based 95% confidence interval can be obtained from: >> prctile(corrstats,[2.5 97.5]) % using bootstrap correlations or more directly using bootci function: >> corrxy = @(x,y) corr(x,y) >> bootci(nsim,{corrxy,Velocity,Evaporation},’alpha’,0.05,’type’,’per’) Note: the ‘type’ option specifies basic percentile method and ’alpha’ is the significance level. Let’s DIY ../../Datasets/Evaporation PoorMansBoot.m See ./ mfiles/Plot2DGaussianSamplesAndMarginals.m, shall we? Bootstrapped Correlations Bootstrap simulations below and estimated 95% confidence interval of (0.88, 0.99) confirm the true correlation between evaporation and air velocity is statistically significantly different from zero Note: highly skewed sampling distribution (would be symmetric for normal populations with zero correlation). Traditional confidence interval estimation approaches typically use transformations to adjust for this (e.g. see Fisher’s Z transform on Wikipedia) Correlation Is Not Full Story. . . ◮ Correlation is only a statistical summary, so always plot the data Famous example of datasets with same correlation (and many other statistics) is Anscombe’s Quartet (see Wikipedia for more details): Pitfalls of Correlation Some pitfalls of correlation: ◮ We all know “correlation DOES NOT imply causation” ◮ Independence implies zero correlation, but zero correlation does not imply independence ◮ In particular, Pearson product moment correlation coefficient only measures linear association Source: http://en.wikipedia.org/wiki/File:Correlation examples2.svg Simple Linear Regression Matrix Representation The simple linear regression equations for the observation pairs can be concisely written using matrix notation: y = Xβ + ǫ (2) where: y1 1 y2 1 y = . X = . .. .. yn 1 β= β0 β1 x1 x2 .. . xn ǫ1 ǫ2 ǫ=. .. . ǫn ◮ Source: http://en.wikipedia.org/wiki/File:Anscombe’s quartet 3.svg Design matrix X has column of ones for intercept term and second column of explanatory variable observations corresponding to elements of response vector y System Of Equations Ordinary Least Squares Estimation Each row represents one of an overdetermined system of linear equations: y1 β 0 + β 1 x1 + ǫ 1 y2 β 0 + β 1 x 2 + ǫ 2 .. = .. . . Ordinary least squares (OLS) is a commonly used criteria for fitting regression models ◮ Find β to minimise the sum of the square of the errors: S = n X ǫ2i i=1 β 0 + β 1 xn + ǫ n yn = ǫ′ ǫ to which an approximately solution is calculated to find the regression coefficients β0 and β1 . ˆ ◮ The estimated regression coefficient vector is β ◮ ◮ = (y − Xβ)′ (y − Xβ) Error vector ǫ will also be estimated, giving the residuals vector ǫˆ OLS Solution (3) ◮ which is a quadratic in the unknown vector β ◮ Therefore, minimum is found by setting partial derivative of S (w.r.t β) equal to zero Simple Linear Regression Assumptions The partial derivative of S with respect to β is: ∂S ∂β The OLS criterion is intuitively sensible and the same solution (5) can be derived in many different ways under various assumptions. The assumptions underlying OLS are: = −2X′ (y − Xβ), ◮ weak exogeneity - X values are fixed, or can be treated as fixed (i.e. conditioned upon), otherwise an “errors-in-variables” model is needed. This also implies that errors are uncorrelated with explanatory variable; ◮ linearity - response is linear function of explanatory variables; ◮ homoscedastic - errors have a constant variance (essentially all errors are equally important) so Var (ǫi ) = σ 2 ; and ◮ uncorrelated - errors are uncorrelated with each other, so Corr (ǫi , ǫj ) = 0 for {i, j = 1, . . . , n : i 6= j}. which when set equal to zero gives: βˆ = (X′ X)−1 X′ y. (4) Note that: P xi P XX = xi2 P 2 1 ′ −1 Pxi P (X X) = n (xi − x¯)2 − xi P y ′ Xy = P i xi yi . ′ Pn xi − P xi n Or in matrix terms: E (ǫ) = 0, Var (ǫ) = σ 2 I, and Corr (X, ǫ) = 0. It is also commonly assumed that the errors are normally distributed, but we’ll get to this later. Fitted Regression Model Vector Geometry of Least Squares There is a nice geometric interpretation of the OLS solution: The fitted regression model is then: yˆ = Xβˆ yˆ = Xβˆ = X(X′ X)−1 X′ y = Hy, (5) where H is called the hat or projection matrix which has the following properties: The estimated prediction error (residuals) can be written: ǫˆ = y − yˆ = y − Xβˆ (6) An unbiased estimate of the error variance is given by: P (yi − yˆi )2 2 σ ˆ = n−2 ǫˆ′ ǫˆ . = n−2 (8) ◮ Symmetric: H′ = H ◮ Idempotent: H2 = H. The prediction error can also be written: ǫˆ = y − yˆ = y − Xβˆ (7) Vector Geometry of Least Squares Diagram The prediction vector is then simply an orthogonal projection of the original response vector onto the subspace spanned by the explanatory variables, span(X), and the residuals are the leftovers which are perpendicular to this subspace: = y − Hy = (I − H)y. (9) Example: Evaporation Model Fitting in MATLAB The design matrix X must be created first: >> X=[ones(length(Velocity),1) Velocity]; y There are various ways to fit regression model in MATLAB, with an extensive set of options available using the regress function: x0 εˆ =( I − H)y ˆ +x β ˆ Hy = x1 β 1 0 0 ˆ x0 β 0 >> [B,BINT,R,RINT,STATS] = regress(Evaporation,X,0.05) But we will explain all the results from this function later. First lets see the how to fit the model and obtain the previous results by doing the explicit calculations manually in MATLAB. y ˆ ˆ x1 β 1 x1 Example: Evaporation Model Fitting (Manually) in MATLAB The simplest inbuilt option to solve the regression equations for βˆ is to use the backslash operator (left matrix divide): Example: Evaporation Model Fit The OLS estimate of the regression is then: ◮ Evaporation = 0.0692 + 0.0038 Velocity which is plotted below including marking on the residuals: >> B = X\Evaporation ˆ which gives estimated regression coefficient vector β: ^ ^y = ^ β0 + β1 X Calculating the estimated coefficient vector βˆ explicitly gives the same result: ^ ε >> B = inv(X’*X) * X’*Evaporation Hint: the backslash operator is very numerically stable, so is generally more reliable than using the latter explicit solution. ◮ Example: Evaporation Model Predictions (Manually) in MATLAB The estimated coefficient can then be used to calculate predictions and residuals: >> yhat = X*B >> resids = Evaporation - yhat these are the same as vector R from the regress function above. The sample standard deviation of the errors is then: >> sigma2 = resids’*resids / (length(resids)-2) which is automatically provided as the fourth element in the STATS(4) component of regress function above. A subjective examination indicates the least square line seems to provide good fit to the observed response. Example: Evaporation Model Fit Diagnostic in MATLAB A common diagnostic to assess the model fit (and in particular the assumptions) is to plot residuals against the predicted values: >> >> >> >> >> >> evapstats=regstats(Evaporation,Velocity,’linear’,{’yhat’,’r’}); clf; scatter(evapstats.yhat,evapstats.r) hold on; plot([0 1.6],[0 0],’r-’) % add horizontal line at zero xlabel(’Fitted Values’);ylabel(’Residuals’); Regression Diagnostic - What to Look For? random with constant variance increasing variance Assessing Model Performance ◮ Variability (in terms of sum of squares around mean) in the response can be decomposed into 2 components: X X X (yi − y¯ )2 = (ˆ yi − y¯ )2 + (yi − yˆi )2 Total ◮ ◮ = Regression + Residual. A commonly used measure of performance is the coefficient of determination R 2 Percentage of the total response sum of squares explained by the model: Explained (Regression) variation Total Variation P ! (ˆ yi − y¯ )2 = or ρˆ2xy Syy P (yi − yˆi )2 = 1− Syy R2 = changing variance non-zero mean over X Example: Evaporation Model Fit Diagnostic in MATLAB It is easy to calculate R 2 directly: >> R2 = 1 - (resids’*resids / sum((Evaporation-mean(Evaporation)).^ 2)) But it also provided as the first part of the STATS(1) output from the regress function: >> [B,BINT,R,RINT,STATS] = regress(Evaporation,X,0.05) >> STATS(1) Resampling Methods for Simple Linear Regression The two types of explanatory variables (fixed or random) impacts on the resampling scheme to explore the properties of the model: ◮ ◮ random X (called observation resampling) - resample observation pairs as with correlation; or fixed X (called residual resampling) - resample residuals and reapply fitted model: ◮ ◮ Or you could use the regstats function above. (10) ◮ fit model and compute residuals generate bootstrap sample residual vector ǫ˜ obtain bootstrap simulations of response by applying fitted model: y˜ = yˆ + ǫ˜ Comments on Bootstrap for Regression Example: Evaporation Model Bootstrapping in MATLAB Obtain the bootstrap simulated response vectors y˜1 , . . . , y˜NBOOT ˜ 1, . . . , X ˜N and corresponding explanatory variable matrices X BOOT (either resampled under “observation resampling” or same X for “residual resampling”). For each one: ◮ refit the linear regression model; ◮ calculate the statistic of interest (e.g. coefficients, predictions, leverages). Thus the bootstrap simulations provide a “sampling distribution” for any statistic of interest. Generally, “observation resampling” approach leads to wider confidence intervals (i.e. indicating more uncertainty) than residual resampling, as the latter ignores uncertainty due to explanatory variable levels and relies on the original model fit. Revision: Idea Underlying Bootstrapping ◮ ◮ ◮ Unknown population parameter (some feature or characteristic of interest) Estimate using sample statistic SRS - Simple Random Sampling The evaporation example has fixed X values for the explanatory variables as it was a controlled experiment. So therefore, “residual resampling” is required. The bootstrap regression coefficients (and corresponding estimated 95% confidence interval) can be calculated in MATLAB using >> bootbetas = bootstrp(nsim, @(bootr)regress(yhat+bootr,X), resids); >> prctile(bootbetas,[2.5 97.5],1) which gives a CI for β1 of (0.0031, 0.0046) which is a little thinner than the (0.0028, 0.0048) given by BINT result from the regress function above. If “observation resampling” was appropriate then the following MATLAB code will do this: >> bootbetas = bootstrp(nsim,@(y,x) regress(y,x),Evaporation,X); >> prctile(bootbetas,[2.5 97.5],1) Idea Underlying Sample Estimators ◮ ◮ How reliable is that estimator? Some key properties of estimators1 : ◮ ◮ ◮ SRS of size n x– SRS of size n x– SRS of size n · · · POPULATION unknown mean m ◮ x– · · · bias (accuracy) - how far is “average” of estimator from the true parameter; variance (precision) - variability of estimator across samples; consistency - does estimator converge (in probability) to true parameter as sample size increases; efficiency - is this the best estimator, is there one that makes better use of data ◮ Bootstrap simulations can be used in exploring these properties of estimators ◮ On this course, bootstrapping is used to explore the variance (precision) of the estimator, i.e. how much variability do we get in samples of that sample size? Sampling distribution Source: Moore & McCabe. Introduction to Practice of Statistics. Chapter 14 on ”Bootstrap Methods and Permutation Tests” freely available from http://bcs.whfreeman.com/ips5e/content/cat 080/pdf/moore14.pdf 1 See Wikipedia page on “Estimator” for formal definitions
© Copyright 2024