
Recall that Correlation coefficient (COC) indicates the amount of linear association that exists between two quantitative variables with a value between 1.0 to 1.0. An example value of 0.2636 indicates a weak negative correlation.
Regression and correlation are similar in that they both involve testing a relationship rather than testing of means or variances. They are used to find out the variables and to the degree the impact the response so that the team can control the key inputs. Controlling these key inputs is done to shift a mean and reduce variation of an overall Project "Y".
Regression provides an equation describing the nature of relationship such as y=mx+b, where m is the slope of this equation. This is most commonly used formula but not always the best fit.
In this case, m, represents the slope (rise/run or change in Y per unit increase in X). And, b, represents the yintercept or where the line crosses at x=0.
Assumption:
Continuous data with interval or ratio measurement level
Applications:
Is and what is that relationship equation between driving speed and miles per gallon achieved?
Is there a relationship between money spent on commercials and product sales?
Is there a relationship between training and job performance?
Jargon:
Response Variable  dependent, uncontrolled, "Y", output variable
Regressor Variable  independent, controlled, "X" variables, input variables, which affect the Response
Noise Variable  input variables that are not controlled
Regression Equation  describes nature of the relationship between independent variables and dependent variable
Residuals  difference between predicted response values and observed response values. These are assessed for normality to ensure the equation is applicable.
There are various types of Regression:
Simple Linear Regression
Single regressor (x) variable such as x1 and model linear with respect to coefficients.
Multiple Linear Regression
Multiple regressor (x) variables such as x1, x2...xn and model linear with respect to coefficients.
Simple NonLinear Regression
Single regressor (x) variable such as x and model nonlinear with respect to coefficients.
Multiple NonLinear Regression
Multiple regressor (x) variables such as x1, x2...xn and model nonlinear with respect to coefficients.
A Six Sigma Black Belt is interested in the relationship of the (input) batch size and its impact on the output of Machine Efficiency. The Predictor variable (x) is the Batch Size and the Response variable (output) is the Machine Efficiency.
The following data was gathered with as much caution to keep other variables constant. The same part number was ran on the same machine with the same operator under similar operating conditions. The data can be in any order and does not have to be normal (the residuals should be normal but the data itself does not have to be normal).
The data was modeled with a Linear, Cubic, and Quadratic fitted line. See the charts below. Notice the best fit is with the Quadratic fitted line plot that has an Rsquared value of nearly 98%.
The Rsquared value will be in a range from 0%100%. The represents the % of explained variation. This is the % variation of the yvalues that are explained by the linear relationship with x.
The scatter diagram isn't obvious that the relationship may not be best explained with linear fitted line plot so the Black Belt decides to model if various ways.
Note: Using the Method of Least Squares which determines a line minimizes the sum of squares of residuals.
Using the pvalue refers to the hypothesis test of the slope of the best fit line. The pvalue is the probability that the slope is significant.
If the pvalue is < alpha risk (usually 0.05) then the regression is statistical significant and X is linearly related to Y.
Ho: Regression model is not significant
Ha: Regression model is significant and can be used with the data range.
Use the 95% prediction bands as shown in the example below. With 95% confidence the response "Y "with input "X" will fall within the 95% prediction band range.
Rsquared refers to quantity of Y variation explained due to model. This represent at % of the variation in "Y" that can be explained by variation in the input variable (remember only testing this one input variable).
It is possible to have a pvalue below 0.05 however the Rsquared value is also relatively low which indicates there are other inputs and sources of variation (x’s) that should be included in the model.
There are a few important points and takeaways from the results.
Producing in larger batches contradicts Lean principles but can impact Working Capital and other company metrics. Produce in larger batches if approved and necessary. There is a justification and Economic Order Quantity that should be targeted and agree upon. Focus on SMED and reducing the reasons to increase batch sizes in the first place.
When those efforts are complete and the team has agreed that the reasons for increasing batches are all addressed with improvements in place, then batch sizes can be justified with buy in from management.
The best equation to use to predict the efficiency is of batch sizes between 25,000 and 115,000 is:
Y = 65.07  0.8923X + 0.02565X^2  0.000136X^3
The pvalue is 0.000 which is < alpha risk of 0.05 so the equation can be used to model estimations within the inference range.
The Rsquared value = SS Regression / SS Total = 5268.98/5381.88 = 0.979 = 97.9%.
Think about the inference range and why it is critical. If the batch size is 0, then Y would be 65.07% machine efficiency. This is obviously not possible or realistic. If the Batch Size was 1, this still represents and nonrealistic output. So, it is important to use the Regression equation only within the Inference Range of 25,000 to 115,000 in this case.
Use the 95% prediction bands as shown in the example below. With 95% confidence the response "Y "with input "X" will fall within the 95% prediction band range.
Before performing a complete regression study there are assumptions about residuals must be satisfied. Residuals represent the error in the fit of regression line and is difference between the observed value of response variable and best fit value
The residuals must be:
• independent
• follow a normal distribution (or be able to assume normality), mean of 0.
• with equal variance
Statistical software programs often have the capability of handling these reviews to determine whether the regression results can be used.
In the above, "Fitted Linear Linear chart, a residual is the difference in the blue dot from the red fitted line.
This module of slides provides additional insight into Correlation and Regression. This is critical component of statistical analysis and can quickly provide answers about the inputs and their effect on the outputs. These tools are frequently used in the DMAIC journey.
Click here to purchase the Correlation and Regression module and view others that are available.
The charts below are typical results using a software program to analyze the residuals which are the error in the fit of the regression line.
This is the difference between the observed value of the response variable and fitted value.
Examining a set of data, the visual results of the residuals are shown below.
The behavior appears to be independent, normal, and exhibit random behavior which is acceptable to proceed with developing a regression equation.
The top two charts show the residuals do not exhibit any obvious trends or patters. They follow a random distribution with equal variance
The bottom two charts of the histogram and "fat pencil" normality test indicate roughly that the residuals resemble a normal distribution.
If all the assumptions PASS, then the regression model is valid.
However, there may be instances where the residuals indicate an underlying behavior that needs further evaluation.
If there is a:
Pattern or Trend: Likely another underlying input variable that needs to be separated or studied further for its impact.
Outlier: Examine each case of special cause and correct or explain.
Non normal distribution: There may be bimodal or other skewness with data that requires further understanding.
In any event, you and the team will learn more by having this discussion. It is important NOT to ignore the residual evaluation. It is better to understand the lurking variables and special causes now in order to make appropriate improvements and prevent surprises in the CONTROL phase.
Six Sigma
Six Sigma Modules
Green Belt Program (1,000+ Slides)
Basic Statistics
SPC
Process Mapping
Capability Studies
MSA
Cause & Effect Matrix
FMEA
Multivariate Analysis
Central Limit Theorem
Confidence Intervals
Hypothesis Testing
T Tests
1Way Anova Test
ChiSquare Test
Correlation and Regression
Control Plan
Kaizen
Error Proofing