1/9 Multiple Linear Regression Objective: This example shall show how to perform a multiple li- Main points: near regression analysis on the basis of existing Performing a regression process data (field data). After completing the examanalysis ple, you will be able to do those regression analysis studies on your own and you will be able to evaluate the results based on different statistical values and Pre-Requisites: Basics of regraphical charts. gression Important: Please switch to the qs-STAT Analysis of Regression/Variance program module in order to be able to access the function as described below (menu item Module|qs-STAT Analysis of Regression/Variance). MODULE| ANALYSIS REGRESSION-/ VARIANCE OF Initial situation: Ammonium sulphate is filled into sacks. Very often during this process, agglutinations can occur that block the filling system. The observation (measurement) of possible causes shall show hints about on which influence factors the response flow rate of the filling system depends on the most. The following potential influence factors were analyzed: x1 = Humidity of the ammonium sulphate (in 0.01%), x2 = ration length/width of the crystals and x3 = contaminations in the ammonium sulphate (in 0.01%) 48 data sets were recorded. Task: The recorded data shall be analyzed using a multiple linear regression analysis in order to determine the significance of the individual influence factors. Remark: You can find the data in the FLOWRATE_REGRESSI ON.DFQ file. Alternatively, you can also create a Procedure: new file using the 1. Select the File|Open menu function and open data in the table below. the FLOWREATE_REGRESSION.DFQ file. REGRESSION ANALYSIS 2. Select the REGRESSION ANALYSIS function from the ANALYSIS / PROCEDURE menu item and then chose the MULTIPLE REGRESSION / LINEAR REGRESSION. Version: 1 © 2008 Q-DAS GmbH & Co. KG, 69469 Weinheim Doc: S-FB 154 E 2/9 Multiple Linear Regression Icon Next 3. The characteristic selection dialog can be opened up with a mouse-click on “Linear regression” or on the Next button. The response and the influence factors can be determined by setting “crosses” per mouse-click. The interpretation of the results will be discussed in the following. This example will discuss the most important circumstances. Parameter estimation The first results are the estimations for the regression coefficients. The coefficient of determination is not very high with R=57.493%. It indicates how well the variation of the response can be explained by the influence factors. Behind the descriptions of the response and influence magnitudes, the regression coefficients and their conDoc-No.: S-FB 154 E © 2008 Q-DAS GmbH & Co. KG, 69469 Weinheim Version: 1 3/9 Multiple Linear Regression fidence areas (bi) are listed. For the evaluation of the coefficients, their standard deviation (sci) and the tstatistic are output as well. The bar chart shows - analog to the t-values - whether the effect of the influence magnitude is significant. In addition to that, the VIF values shown are a measure for the dependency of the individual influence factors to each other. Interpretation: The influence factors can only explain the variation of the flow rate by 57.5 %. This means that important effects have not been considered. The contamination is the most important influence factor, followed by the ration of length to width. Model evaluation The next results serve for the evaluation of the overall model approach. It is tested whether the basic requirements for the model are met. The first test is used to double-check the (quasi-) linear relation of the selected approach. The second test checks whether all influence factors together have a significant effect on the response. Version: 1 © 2008 Q-DAS GmbH & Co. KG, 69469 Weinheim Doc: S-FB 154 E 4/9 Multiple Linear Regression Interpretation: The selected linear relation can be used for this problem. The hypothesis for the linear relation is not rejected (green color). The overall influence of all influence factors on the response is significant (red color). This means the selected regression approach is usable. Response Surface Plots Overview A meaningful graphical interpretation of the relations is possible with this graphical chart. The graphical display shows the influence of two influence factors onto the response under certain conditions of the other influences. Interpretation: The chart shows the effect of the humidity [%] and the length-width ratio under a given contamination of ca. 2 %. The surface changes depending on the setting for the contamination [%]. The highest flow rate for the selected setting can be detected in the top left corner (red coloring). Doc-No.: S-FB 154 E © 2008 Q-DAS GmbH & Co. KG, 69469 Weinheim Version: 1 5/9 Multiple Linear Regression Single-factor plots overview The Single-factor plots overview shows which influence factors have which effect, and what prognosis value can be expected for a user-defined setting of the influence factors. Interpretation: For the selected setting for the influence factors (red lines), we can expect a flow rate of ca. 4, with a prediction interval of ± 1.7 which indicates the accuracy of the prognosis. Analysis of the residuals We can determine graphically whether the assumptions of the regression approach are full-filled using the residuals (deviation of the calculated value and the measured value for the response). Version: 1 © 2008 Q-DAS GmbH & Co. KG, 69469 Weinheim Doc: S-FB 154 E 6/9 Multiple Linear Regression 1. The value chart of the residuals (chart up top) displays the behavior of the residuals per value number. Ideally, the behavior of residuals is random. 2. The probability plot (chart in the lower left corner) serves for the assessment of the assumption that the residuals are normally distributed. 3. The chart in the lower right corner compares the residuals with the estimated values (fitted values) in a scatter plot. Ideally, the data points are distributed randomly in the coordinate-system. Interpretation: The residual do not seem to be random, at least not after the 35th value. It should be double-checked whether something specials has happened during the data recording process. The assumption of a normal distribution of the residuals cannot be rejected based on the probability plot. The scatter chart of the fitted values and the residuals does not show anything special either. Doc-No.: S-FB 154 E © 2008 Q-DAS GmbH & Co. KG, 69469 Weinheim Version: 1 7/9 Multiple Linear Regression Other graphical evaluations of the model Leverage The leverage determines how far the individual values of the influence factors deviate from their average (average vector). These values serve for double-checking whether the individual value sets of the influence factors can be interpreted as outliers. When working with small sample sizes, a few extreme values can strongly influence the evaluation of the regression coefficients. Interpretation: Value number 22, 35 and 42 can be regarded as outliers that could influence the evaluations. For testing purposes the regression could be re-evaluated without these values and be compared to the existing approach. In this example, this would lead to a slightly smaller coefficient of determination but the meaning of the influence factors would not change. Cook’s distance The Cook’s Distances evaluate the significance of individual data sets for the estimation of the model parameters. It is tested how much the estimation of the response changes if a data set is removed from the sample. Comparing the Cook’s distances to the leveragevalues in a scatter plot can indicate whether certain Version: 1 © 2008 Q-DAS GmbH & Co. KG, 69469 Weinheim Doc: S-FB 154 E 8/9 Multiple Linear Regression outliers of the influence factors are responsible for the estimation of the parameters. Interpretation: Nothing out of the ordinary can be spotted in the chart up top. The individual Cook’s distances are small. The extreme outliers (high leverage-value: data sets no. 22, 35 and 42) show a relatively large Cook’s distance. Therefore, they have an important influence on the estimation of the model parameters. Data sets: No. 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 Doc-No.: S-FB 154 E Flow rate 5,00 4,81 4,46 4,81 4,46 3,85 3,21 3,25 4,55 4,85 4,00 3,62 5,15 3,76 4,90 Humidity 21 20 16 18 16 18 12 12 13 13 17 24 11 10 17 Length to width 2,40 2,40 2,40 2,50 3,20 3,10 3,20 2,70 2,70 2,70 2,70 2,80 2,50 2,60 2,00 © 2008 Q-DAS GmbH & Co. KG, 69469 Weinheim Contamination 0 0 0 0 0 1 1 0 0 0 0 0 0 0 0 Version: 1 9/9 Multiple Linear Regression No. 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 Flow rate 4,13 5,10 5,05 4,27 4,90 4,55 5,32 4,39 4,85 4,59 5,00 3,82 3,68 5,15 2,94 3,18 2,28 5,00 2,43 0,00 4,10 3,70 3,36 3,79 3,40 1,51 0,00 1,72 2,33 2,38 3,68 4,20 5,00 Version: 1 Humidity 14 14 14 20 12 11 10 10 16 17 17 17 15 17 21 23 22 21 24 37 21 28 29 23 32 26 28 21 22 34 29 17 11 Length to width 2,00 2,00 1,90 2,10 1,90 2,00 2,00 2,00 2,00 2,20 2,40 2,40 2,40 2,20 2,20 2,20 2,00 1,90 2,10 2,30 2,40 2,40 2,40 3,60 3,30 3,50 3,50 3,00 3,00 3,00 3,50 3,50 3,20 Contamination 0 1 0 2 1 2 7 2 2 3 4 0 2 3 4 10 7 4 8 14 2 5 7 7 8 4 12 3 6 8 5 3 2 © 2008 Q-DAS GmbH & Co. KG, 69469 Weinheim Doc: S-FB 154 E
© Copyright 2024