Correlation measures the relationship of the process inputs (x) on the output (y). It is the degree or extent of the relationship between two variables. These studies are used to examine if there is a predictive relationship of the input on the process.
Correlation studies and dependencies tend to be stronger with more data and the maximum range being applied (be aware this can also hide areas of correlation or unique relationships with subsets of the data).
However, visualization of the data set can also show that there may exist varying relationships within the range of samples. Within a smaller specific range there could be a relationship, and then another range could show a different relationship.
The picture to the left shows that there is very little, if any, correlation of the variables.
They are independent variables at least within the range of inputs studied and the "r" value is approximately zero.
A correlation value may be close to zero but closer review will indicate enlightening information. As mentioned earlier, be aware that sometimes too much data can hide relationships.
The point is to run the correlation visually and mathematically.
Regression and correlation involve testing a
relationship rather than testing of means or variances. They are used to find out the variables and to the degree the impact the response so that the team can control the key inputs. Controlling these key inputs is done to shift the mean and reduce variation of an overall Project "Y".
There are several correlation coefficients in use but the most frequently used is the Pearson Product Moment Correlation, also referred to as the Coefficient of Correlation (COC) that measures only a linear relationship between two variables and is denoted by an "r" value. The formula is shown below.
The "r" value is used to measure the linear correlation and it will always range from -1.0 (anticorrelation) to +1.0. As the value approaches 0 there is less linear correlation, or dependence, of the variables.
If the value:
The degree of linear association between two variables is quantified by the COC.
Pearson's Correlation does NOT assume that the data is normally distributed but is strongly influenced by outliers anywhere in the data set. It is most accurate when the data sets are normally distributed.
As expected, an outlier is likely to take away from the linear association of the other non-outlying variables whether the association is negative or positive.
The data classification for each of the variables must be ratio or interval types and the relationship must be monotonic.
The "r" value represents a unitless translation of Covariance, meaning the closer the
value is to +1, the closer the linear relationship is between the x and y
As the value of "r" approaches zero from either side, the correlation is weaker. That is the input, x, has a lower correlation on the output, y.
This is normally shown by a x-y plot referred to as a Scatter Graph. This graph shows all the data points where the input, x, is varied systematically and the output, or the effect, of y is measured.
A "r" value of +1.0 indicates a perfect and strong POSITIVE correlation.
A "r" value of -1.0 indicates a perfect and strong NEGATIVE correlation or anticorrelation.
A data set that does not have a slope (slope = 0) will have a correlation coefficient that is undefined because the variance of Y is zero. In other words, the output is not affected by any of the input values.
Shown below in the video is an example starting with a set of
data and the progessive steps to manually calculate the LINEAR
correlation coefficient, "r".
This is a study between the number of caterpillars in a cabbage patch and the quantity of cabbages destroyed.
The picture below indicates a strong relationship that would not be evident by simply analyzing the "r" value. The "r" value is going to be close to zero which means the variables are independent. Recall, the "r" value is measure of linear association only.
There is another measurement that explains association. Visit Spearman's Rho Correlation Coefficient for an explanation of the monotonic association strength between two variables.
However, when it comes to data similar to the picture below there is strong indication that an association exist but it is non-linear.
This module doesn't investigate non-linear mathematical relationships but it is
important to understand they exist as the picture below shows (which is
non-linear and non-monotonic).
What is the difference between the Coefficient of Correlation (COC) and Coefficient of Determination (COD)?
The COD ranges from 0-1 (0%-100%).
The COD is the proportion of variability of the dependent variable (Y) accounted for or explained by the independent variable (x) equal to the COC value squared.
In other words, it is the percentage of variation in Y explained by the linear relationship with X.
The COC is a value from -1 to +1 that describes the linear correlation of the dependent and independent variable. A value near zero indicates no linear relationship.
The sign is necessary to see if relationship is positive or negative so solving for COR by taking the square root of COD may not give the correct correlation since the sign can be positive or negative.
Correlation interpretations from data or graphs can be wrong if it is purely coincidental.
Regardless of how strong (positive or negative) it may appear, Correlation never implies causation. There could be other variables behind the one charted that could be a factor.
For example, a chart or correlation value may indicate a strong relationship (linear or non-linear) but in reality there may be no relationship or dependency at all.
Just like most statistical results they must be reviewed subjectively with consideration of common sense. This is done with the Six Sigma team. The GB/BB is responsible for sharing the results in any way to help the team make the right decisions.
It is possible to have the same "r" value and have several different graphical representations, another reason to review the scatter plot and "r" value together.
Below is an example of monthly results of cereal sales related to marketing dollars. The intention is to determine the degree of linear correlation between marketing dollars spent to cereal sales. The data was compiled and is shown below.
Visually depicting the data is recommended whether is it time-series
charts, scatter plots, or box plots. This helps in seeing trends and
overall behavioral relationships between data. A couple graphs of the data are shown below. The scatter plot shows quickly that there appears to be a strong linear correlation.
Establish the Practical Problem
Is there a relationship between the amount of money dedicated to marketing to the sales of cereal and what is the strength of the relations
Establish the Statistical Problem
Ho: Sales and Marketing dollars spent are not correlated
Ha: Sales and Marketing dollars spent are correlated
Choose a Level of Significance
Alpha risk selected is 0.05
If the calculated p-value is <0.05, then the reject the Ho (null) and infer the Ha.
The sample size = 12
Find Correlation from the pull-down menu and enter both continuous sets of data and use Pearson Correlation, then the results are shown.
P value = 0.000
r = 0.9851 or 98.51%
With those results, reject the null and infer that there is a statistically significant correlation (which is the alternative hypothesis).
The linear correlation between the marketing dollars spent and resulting cereal sales is strong within a given month. The correlation coefficient (r) = 0.9851 within the inference range of $2,548 to $8,023 marketing dollars analyzed. This is a strong positive correlation. The more marketing, the higher the cereal sales.
Likely, at some point, that cereal sales would level off regardless of how high the amount of marketing dollars. That is why it is very important to keep your conclusion with the inference range.
Another method to perform the statistical evaluation is by comparing the r- calculated value of 0.9851 to the r-critical value.
The r-critical value for a sample size of 12 at alpha risk of 0.05 is 0.4973.
If the value of r-calculated is >0.4973, then there is a statistical significant correlation and in this example that is clearly the case.
Regression takes it a step further and develops a formula to describe the nature of the relationship. Visit the Regression module for more information.
As indicated earlier, the scatter plot should be visually examined. Even if the correlation coefficient was very low (linear relationship) there may be a non-linear relationship such as cubic or quadratic that could be very strong.
Finding the Pearson Correlation
Coefficient of two sets of data is done in Excel as shown below. The
data does not have to be normally distributed but do have to be equal
The Pearson Correlation Coefficient between these two sets of data is -0.2636, a weak negative correlation.
Correlation indicates the amount of linear association that exists
between two variables in the form of a value between -1.0 to 1.0.
Such as the linear correlation from earlier example where the value of -0.2636 was found and indicates a negative correlation but it is not very strong.
Regression provides an equation describing the nature of relationship such as y=mx+b.
There are various types of Regression:
Simple Linear Regression
Single regressor (x) variable such as x1 and model linear with respect to coefficients.
Multiple Linear Regression
Multiple regressor (x) variables such as x1, x2...xn and model linear with respect to coefficients.
Simple Non-Linear Regression
Single regressor (x) variable such as x and model non-linear with respect to coefficients.
Multiple Non-Linear Regression
Multiple regressor (x) variables such as x1, x2, x3 and model nonlinear with respect to coefficients.
This module of slides provides additional insight into Correlation and Regression. This is critical component of statistical analysis and can quickly provide answers about the inputs and their effect on the outputs. These tools are frequently used in the DMAIC journey.
Click here to purchase the Correlation and Regression module and view others that are available.
Six Sigma Modules
The following presentations are available to download.
Green Belt Program (1,000+ Slides)
Cause & Effect Matrix
Central Limit Theorem
1-Way Anova Test
Correlation and Regression