# Regression

Recall that Correlation coefficient (COC) indicates the amount of linear association that exists between two quantitative variables with a value between -1.0 to 1.0. An example value of -0.2636 indicates a weak negative correlation.

Regression and correlation are similar in that they both involve testing a relationship rather than testing of means or variances. They are used to find out the variables and to the degree the impact the response so that the team can control the key inputs. Controlling these key inputs is done to shift a mean and reduce variation of an overall Project "Y".

Regression provides an equation describing the nature of relationship such as y=mx+b, where m is the slope of this equation. This is most commonly used formula but not always the best fit.

In this case, m, represents the slope (rise/run or change in Y per unit increase in X). And, b, represents the y-intercept or where the line crosses at x=0.

Assumption:

Continuous data with interval or ratio measurement level

Applications:

Is and what is that relationship equation between driving speed and miles per gallon achieved?

Is there a relationship between money spent on commercials and product sales?

Is there a relationship between training and job performance?

Jargon:

Response Variable - dependent, uncontrolled, "Y", output variable

Regressor Variable - independent, controlled, "X" variables, input variables, which affect the Response

Noise Variable - input variables that are not controlled

Regression Equation - describes nature of the relationship between independent variables and dependent variable

Residuals - difference between predicted response values and observed response values. These are assessed for normality to ensure the equation is applicable.

There are various types of Regression:

Simple Linear Regression
Single regressor (x) variable such as x1 and model linear with respect to coefficients.

Multiple Linear Regression
Multiple regressor (x) variables such as x1, x2...xn and model linear with respect to coefficients.

Simple Non-Linear Regression
Single regressor (x) variable such as x and model non-linear with respect to coefficients.

Multiple Non-Linear Regression
Multiple regressor (x) variables such as x1, x2...xn and model nonlinear with respect to coefficients.

## Example

A Six Sigma Black Belt is interested in the relationship of the (input) batch size and its impact on the output of Machine Efficiency. The Predictor variable (x) is the Batch Size and the Response variable (output) is the Machine Efficiency.

The following data was gathered with as much caution to keep other variables constant. The same part number was ran on the same machine with the same operator under similar operating conditions. The data can be in any order and does not have to be normal (the residuals should be normal but the data itself does not have to be normal).

The data was modeled with a Linear, Cubic, and Quadratic fitted line. See the charts below. Notice the best fit is with the Quadratic fitted line plot that has an R-squared value of nearly 98%.

The R-squared value will be in a range from 0%-100%. The represents the % of explained variation. This is the % variation of the y-values that are explained by the linear relationship with x.

The scatter diagram isn't obvious that the relationship may not be best explained with linear fitted line plot so the Black Belt decides to model if various ways.

Note: Using the Method of Least Squares which determines a line minimizes the sum of squares of residuals.

Using the p-value refers to the hypothesis test of the slope of the best fit line. The p-value is the probability that the slope is significant.

If the p-value is < alpha risk (usually 0.05) then the regression is statistical significant and X is linearly related to Y.

Ho: Regression model is not significant

Ha: Regression model is significant and can be used with the data range.

Use the 95% prediction bands as shown in the example below. With 95% confidence the response "Y "with input "X" will fall within the 95% prediction band range.

R-squared refers to quantity of Y variation explained due to model. This represent at % of the variation in "Y" that can be explained by variation in the input variable (remember only testing this one input variable).

It is possible to have a p-value below 0.05 however the R-squared value is also relatively low which indicates there are other inputs and sources of variation (x’s) that should be included in the model.

## Cubic Fitted Line Plot

There are a few important points and takeaways from the results.

1. The Cubic Fitted Line Plot has highest R-squared fit.
2. The formula can be used to assess outcomes based on the inference space (or batch sizes) from the lowest batch size to the largest.
3. Also, adding the Confidence Intervals and Predictor Intervals at 95% can help the team determine most likely best and worst case scenarios. Obviously, the machine efficiency can never exceed 100%.
4. Once the Batch Size reaches 80,000 then the Machine Efficiency gains stabilize and appear to plateau.
5. The greatest improvement comes 55,000 - 80,000 Batch Size.
6. If the inventory is justified, produce in batches of at least 80,000 to maximize the efficiency of the machine.

Producing in larger batches contradicts Lean principles but can impact Working Capital and other company metrics. Produce in larger batches if approved and necessary. There is a justification and Economic Order Quantity that should be targeted and agree upon. Focus on SMED and reducing the reasons to increase batch sizes in the first place.

When those efforts are complete and the team has agreed that the reasons for increasing batches are all addressed with improvements in place, then batch sizes can be justified with buy in from management.

## Statistical Analysis in Minitab

The best equation to use to predict the efficiency is of batch sizes between 25,000 and 115,000 is:

Y = 65.07 - 0.8923X + 0.02565X^2 - 0.000136X^3

The p-value is 0.000 which is < alpha risk of 0.05 so the equation can be used to model estimations within the inference range.

The R-squared value = SS Regression / SS Total = 5268.98/5381.88 = 0.979 = 97.9%.

Think about the inference range and why it is critical. If the batch size is 0, then Y would be 65.07% machine efficiency. This is obviously not possible or realistic. If the Batch Size was 1, this still represents and non-realistic output. So, it is important to use the Regression equation only within the Inference Range of 25,000 to 115,000 in this case.

Use the 95% prediction bands as shown in the example below. With 95% confidence the response "Y "with input "X" will fall within the 95% prediction band range.

## Assumptions for using Regression

Before performing a complete regression study there are assumptions about residuals must be satisfied.  Residuals represent the error in the fit of regression line and is difference between the observed value of response variable and best fit value

The residuals must be:

• independent

• follow a normal distribution (or be able to assume normality), mean of 0.

• with equal variance

Statistical software programs often have the capability of handling these reviews to determine whether the regression results can be used.

In the above, "Fitted Linear Linear chart, a residual is the difference in the blue dot from the red fitted line.

This module of slides provides additional insight into Correlation and Regression. This is critical component of statistical analysis and can quickly provide answers about the inputs and their effect on the outputs. These tools are frequently used in the DMAIC journey.

Click here to purchase the Correlation and Regression module  and view others that are available.

## Example of Residual Testing

The charts below are typical results using a software program to analyze the residuals which are the error in the fit of the regression line.

This is the difference between the observed value of the response variable and fitted value.

Examining a set of data, the visual results of the residuals are shown below.

The behavior appears to be independent, normal, and exhibit random behavior which is acceptable to proceed with developing a regression equation.

The top two charts show the residuals do not exhibit any obvious trends or patters. They follow a random distribution with equal variance

The bottom two charts of the histogram and "fat pencil" normality test indicate roughly that the residuals resemble a normal distribution.

If all the assumptions PASS, then the regression model is valid.

However, there may be instances where the residuals indicate an underlying behavior that needs further evaluation.

If there is a:

Pattern or Trend: Likely another underlying input variable that needs to be separated or studied further for its impact.

Outlier: Examine each case of special cause and correct or explain.

Non normal distribution: There may be bi-modal or other skewness with data that requires further understanding.

In any event, you and the team will learn more by having this discussion. It is important NOT to ignore the residual evaluation. It is better to understand the lurking variables and special causes now in order to make appropriate improvements and prevent surprises in the CONTROL phase.

Templates and Calculators

Search Six Sigma job openings

Six Sigma

Templates & Calculators

Six Sigma Modules

Green Belt Program (1,000+ Slides)

Basic Statistics

SPC

Process Mapping

Capability Studies

MSA

Cause & Effect Matrix

FMEA

Multivariate Analysis

Central Limit Theorem

Confidence Intervals

Hypothesis Testing

T Tests

1-Way Anova Test

Chi-Square Test

Correlation and Regression

SMED

Control Plan

Kaizen

Error Proofing