
Correlation measures the relationship of the process inputs (x) on the output (y). It is the degree or extent of the relationship between two variables. These studies are used to examine if there is a predictive relationship of the input on the process.
Correlation and Regression studies are normally done together as part of the ANALYZE phase of a DMAIC project.
Couple notes:
Correlation studies and dependencies tend to be stronger with more data and the maximum range being applied (be aware this can also hide areas of correlation or unique relationships with subsets of the data).
However, visualization of the data set can also show that there may exist varying relationships within the range of samples. Within a smaller specific range there could be a relationship, and then another range could show a different relationship.
The picture to the left shows that there is very little, if any, correlation of the variables.
They are independent variables at least within the range of inputs studied and the "r" value is approximately zero.
A correlation value may be close to zero but closer review will indicate enlightening information. As mentioned earlier, be aware that sometimes too much data can hide relationships.
The point is to run the correlation visually and mathematically.
Regression and correlation involve testing a
relationship rather than testing of means or variances. They are used to find out the variables and to the degree the impact the response so that the team can control the key inputs. Controlling these key inputs is done to shift the mean and reduce variation of an overall Project "Y".
There are several correlation coefficients in use but the most frequently used is the Pearson Product Moment Correlation, also referred to as the Coefficient of Correlation (COC) that measures only a linear relationship between two variables and is denoted by an "r" value. The formula is shown below.
The "r" value is used to measure the linear correlation and it will always range from 1.0 (anticorrelation) to +1.0. As the value approaches 0 there is less linear correlation, or dependence, of the variables.
If the value:
The degree of linear association between two variables is quantified by the COC.
Pearson's Correlation does NOT assume that the data is normally distributed but is strongly influenced by outliers anywhere in the data set. It is most accurate when the data sets are normally distributed.
As expected, an outlier is likely to take away from the linear association of the other nonoutlying variables whether the association is negative or positive.
The data classification for each of the variables must be ratio or interval types and the relationship must be monotonic.
The "r" value represents a unitless translation of Covariance, meaning the closer the
value is to +1, the closer the linear relationship is between the x and y
random variables.
As the value of "r" approaches zero from
either side, the correlation is weaker. That is the input, x, has a
lower correlation on the output, y.
This is normally shown by a
xy plot referred to as a Scatter Graph. This graph shows all the data
points where the input, x, is varied systematically and the output, or
the effect, of y is measured.
A "r" value of +1.0 indicates a perfect and strong POSITIVE correlation.
A "r" value of 1.0 indicates a perfect and strong NEGATIVE correlation or anticorrelation.
A
data set that does not have a slope (slope = 0) will have a correlation
coefficient that is undefined because the variance of Y is zero. In
other words, the output is not affected by any of the input values.
Shown below in the video is an example starting with a set of
data and the progessive steps to manually calculate the LINEAR
correlation coefficient, "r".
This is a study between the number of caterpillars in a cabbage patch and the quantity of cabbages destroyed.
The picture below indicates a strong relationship that would not be evident by simply analyzing the "r" value. The "r" value is going to be close to zero which means the variables are independent. Recall, the "r" value is measure of linear association only.
There is another measurement that explains association. Visit Spearman's Rho Correlation Coefficient for an explanation of the monotonic association strength between two variables.
However,
when it comes to data similar to the picture below there is strong indication that an association exist but it is nonlinear.
This module doesn't investigate nonlinear mathematical relationships but it is
important to understand they exist as the picture below shows (which is
nonlinear and nonmonotonic).
What is the difference between the Coefficient of Correlation (COC) and Coefficient of Determination (COD)?
The COD ranges from 01 (0%100%).
The COD is the proportion of variability of the dependent variable (Y) accounted for or explained by the independent variable (x) equal to the COC value squared.
In other words, it is the percentage of variation in Y explained by the linear relationship with X.
The COC is a value from 1 to +1 that describes the linear correlation of the dependent and independent variable. A value near zero indicates no linear relationship.
The sign is necessary to see if relationship is positive or negative so solving for COR by taking the square root of COD may not give the correct correlation since the sign can be positive or negative.
CAUTION:
Correlation interpretations from data or graphs can be wrong if it is purely coincidental.
Regardless of how strong (positive or negative) it may appear, Correlation never implies causation. There could be other variables behind the one charted that could be a factor.
For
example, a chart or correlation value may indicate a strong
relationship (linear or nonlinear) but in reality there may be no
relationship or dependency at all.
Just like most statistical results they must be reviewed subjectively with consideration of common sense. This is done with the Six Sigma team. The GB/BB is responsible for sharing the results in any way to help the team make the right decisions.
It
is possible to have the same "r" value and have several different
graphical representations, another reason to review the scatter plot and
"r" value together.
Below is an example of monthly results of cereal sales related to marketing dollars. The intention is to determine the degree of linear correlation between marketing dollars spent to cereal sales. The data was compiled and is shown below.
Visually depicting the data is recommended whether is it timeseries
charts, scatter plots, or box plots. This helps in seeing trends and
overall behavioral relationships between data. A couple graphs of the data are shown below. The scatter plot shows quickly that there appears to be a strong linear correlation.
Establish the Practical Problem
Is there a relationship between the amount of money dedicated to marketing to the sales of cereal and what is the strength of the relations
Establish the Statistical Problem
Ho: Sales and Marketing dollars spent are not correlated
Ha: Sales and Marketing dollars spent are correlated
Choose a Level of Significance
Alpha risk selected is 0.05
If the calculated pvalue is <0.05, then the reject the Ho (null) and infer the Ha.
The sample size = 12
Find Correlation from the pulldown menu and enter both continuous sets of data and use Pearson Correlation, then the results are shown.
RESULT:
P value = 0.000
r = 0.9851 or 98.51%
With those results, reject the null and infer that there is a statistically significant correlation (which is the alternative hypothesis).
The linear correlation between the marketing dollars
spent and resulting cereal sales is strong within a given month. The
correlation coefficient (r) = 0.9851 within the inference range of
$2,548 to $8,023 marketing dollars analyzed. This is a strong positive
correlation. The more marketing, the higher the cereal sales.
Likely, at some point, that cereal sales would level off regardless of how high the amount of marketing dollars. That is why it is very important to keep your conclusion with the inference range.
Another method to perform the statistical evaluation is by comparing the r calculated value of 0.9851 to the rcritical value.
The rcritical value for a sample size of 12 at alpha risk of 0.05 is 0.4973.
If the value of rcalculated is >0.4973, then there is a statistical significant correlation and in this example that is clearly the case.
Regression takes it a step further and develops a formula to describe the nature of the relationship. Visit the Regression module for more information.
CAUTION:
As indicated earlier, the scatter plot should be visually examined. Even if the correlation coefficient was very low (linear relationship) there may be a nonlinear relationship such as cubic or quadratic that could be very strong.
Finding the Pearson Correlation
Coefficient of two sets of data is done in Excel as shown below. The
data does not have to be normally distributed but do have to be equal
sample sizes.
The Pearson Correlation Coefficient between these two sets of data is 0.2636, a weak negative correlation.
Recall that
Correlation indicates the amount of linear association that exists
between two variables in the form of a value between 1.0 to 1.0.
Such
as the linear correlation from earlier example where the value of
0.2636 was found and indicates a negative correlation but it is not
very strong.
Regression provides an equation describing the nature of relationship such as y=mx+b.
There are various types of Regression:
Simple Linear Regression
Single regressor (x) variable such as x1 and model linear with respect to coefficients.
Multiple Linear Regression
Multiple regressor (x) variables such as x1, x2...xn and model linear with respect to coefficients.
Simple NonLinear Regression
Single regressor (x) variable such as x and model nonlinear with respect to coefficients.
Multiple NonLinear Regression
Multiple regressor (x) variables such as x1, x2, x3 and model nonlinear with respect to coefficients.
This module of slides provides additional insight into Correlation and Regression. This is critical component of statistical analysis and can quickly provide answers about the inputs and their effect on the outputs. These tools are frequently used in the DMAIC journey. Click here to purchase the Correlation and Regression module and view others that are available. 
Return to BASIC STATISTICS
Return to the ANALYZE Phase
Templates and Calculators
Shop at SixSigmaMaterial for additional related material
Return to SixSigmaMaterial Home Page
Six Sigma
Six Sigma Modules
The following presentations are available to download.
Green Belt Program (1,000+ Slides)
Basic Statistics
SPC
Process Mapping
Capability Studies
MSA
Cause & Effect Matrix
FMEA
Multivariate Analysis
Central Limit Theorem
Confidence Intervals
Hypothesis Testing
T Tests
1Way Anova Test
ChiSquare Test
Correlation and Regression
SMED
Control Plan
Kaizen
Error Proofing