Correlation analysis measures how two variables are related. Thecorrelation coefficient (r) is a statistic that tells you the strengthand direction of that relationship. It is expressed as a positive ornegative number between -1 and 1. The value of the number indicates the strengthof the relationship: Show
The sign of the correlation coefficient indicates whether the direction ofthe relationship is positive (direct) or negative (inverse). Variables whichhave a direct relationship (a positive correlation) increase together and decrease together. In aninverse relationship (a negative correlation), one variable increases while the other decreases. While the sign indivates how one variable changes with respect to anothervariable, the magnitude of the number indicates the strength of a relationship. It is important to remember that while correlation coefficients can be usedfor prediction (i.e. if we know the value for one variable, and thecorrelation, we can predict what the value of the second variable will be) theymay NOT be used for causation (i.e. we cannot say that one variable causesanother). Example Suppose you are reading a study of Regents exams. The investigator wantedto know if performance in grade school was related to scores on the Regentsexams. He did a correlation analysis on grade school performance and Regentsexam score, and found that r = .75 in his study. This tells you two things:
If a correlation exists between two variables, this does NOT imply that onevariable causes another. Causation and correlation are two very differentthings. The two correlation coefficients that appear most often in the literatureare the Pearson-product moment and the Spearmanrank sum. The ‘correlation coefficient’ was coined by Karl Pearson in 1896. Accordingly, this statistic is over a century old, and is still going strong. It is one of the most used statistics today, second to the mean. The correlation coefficient's weaknesses and warnings of misuse are well documented. As a 15-year practiced consulting statistician, who also teaches statisticians continuing and professional studies for the Database Marketing/Data Mining Industry, I see too often that the weaknesses and warnings are not heeded. Among the weaknesses, I have never seen the issue that the correlation coefficient interval [−1, +1] is restricted by the individual distributions of the two variables being correlated. The purpose of this article is (1) to introduce the effects the distributions of the two individual variables have on the correlation coefficient interval and (2) to provide a procedure for calculating an adjusted correlation coefficient, whose realised correlation coefficient interval is often shorter than the original one. The implication for marketers is that now they have the adjusted correlation coefficient as a more reliable measure of the important ‘key-drivers’ of their marketing models. In turn, this allows the marketers to develop more effective targeted marketing strategies for their campaigns. CORRELATION COEFFICIENT BASICSThe correlation coefficient, denoted by r, is a measure of the strength of the straight-line or linear relationship between two variables. The well-known correlation coefficient is often misused, because its linearity assumption is not tested. The correlation coefficient can – by definition, that is, theoretically – assume any value in the interval between +1 and −1, including the end values +1 or −1. The following points are the accepted guidelines for interpreting the correlation coefficient:
CALCULATION OF THE CORRELATION COEFFICIENTThe calculation of the correlation coefficient for two variables, say X and Y, is simple to understand. Let zX and zY be the standardised versions of X and Y, respectively, that is, zX and zY are both re-expressed to have means equal to 0 and standard deviations (s.d.) equal to 1. The re-expressions used to obtain the standardised scores are in equations (1) and (2): The correlation coefficient is defined as the mean product of the paired standardised scores (zX i, zY i) as expressed in equation (3). Where n is the sample size. For a simple illustration of the calculation, consider the sample of five observations in Table 1. Columns zX and zY contain the standardised scores of X and Y, respectively. The last column is the product of the paired standardised scores. The sum of these scores is 1.83. The mean of these scores (using the adjusted divisor n–1, not n) is 0.46. Thus, r X,Y=0.46. Table 1 Calculation of correlation coefficient Full size table REMATCHINGAs mentioned above, the correlation coefficient theoretically assumes values in the interval between +1 and −1, including the end values +1 or −1 (an interval that includes the end values is called a closed interval, and is denoted with left and right square brackets: [, and], respectively. Accordingly, the correlation coefficient assumes values in the closed interval [−1, +1]). However, it is not well known that the correlation coefficient closed interval is restricted by the shapes (distributions) of the individual X data and the individual Y data. The extent to which the shapes of the individual X and individual Y data differ affects the length of the realised correlation coefficient closed interval, which is often shorter than the theoretical interval. Clearly, a shorter realised correlation coefficient closed interval necessitates the calculation of the adjusted correlation coefficient (to be discussed below). The length of the realised correlation coefficient closed interval is determined by the process of ‘rematching’. Rematching takes the original (X, Y) paired data to create new (X, Y) ‘rematched-paired’ data such that all the rematched-paired data produce the strongest positive and strongest negative relationships. The correlation coefficients of the strongest positive and strongest negative relationships yield the length of the realised correlation coefficient closed interval. The rematching process is as follows:
Continuing with the data in Table 1, I rematch the X, Y data in Table 2. The rematching produces: Table 2 Rematched (X, Y) data of Table 1 Full size table So, just as there is an adjustment for R2, there is an adjustment for the correlation coefficient due to the individual shapes of the X and Y data. Thus, the restricted, realised correlation coefficient closed interval is [−0.99, +0.90], and the adjusted correlation coefficient can now be calculated. CALCULATION OF THE ADJUSTED CORRELATION COEFFICIENTThe adjusted correlation coefficient is obtained by dividing the original correlation coefficient by the rematched correlation coefficient, whose sign is that of the sign of original correlation coefficient. The sign of adjusted correlation coefficient is the sign of original correlation coefficient. If the sign of the original r is negative, then the sign of the adjusted r is negative, even though the arithmetic of dividing two negative numbers yields a positive number. The expression in (4) provides only the numerical value of the adjusted correlation coefficient. In this example, the adjusted correlation coefficient between X and Y is defined in expression (4): the original correlation coefficient with a positive sign is divided by the positive-rematched original correlation. Thus, r X,Y (adjusted)=0.51 (=0.46/0.90), a 10.9 per cent increase over the original correlation coefficient. IMPLICATION OF REMATCHINGThe correlation coefficient is restricted by the observed shapes of the individual X- and Y-values. The shape of the data has the following effects:
CONCLUSIONThe everyday correlation coefficient is still going strong after its introduction over 100 years. The statistic is well studied and its weakness and warnings of misuse, unfortunately, at least for this author, have not been heeded. I discuss a ‘maybe’ unknown restriction on the values that the correlation coefficient assumes, namely, the observed values fall within a shorter than the always taught [−1, +1] interval. I introduce the effects of the individual distributions of the two variables on the correlation coefficient closed interval, and provide a procedure for calculating an adjusted correlation coefficient, whose realised correlation coefficient closed interval is often shorter than the original one, which reflects a more precise measure of linear relationship between the two variables under study. The implication for marketers is that now they have the adjusted correlation coefficient, as a more reliable measure of the important ‘key drivers’ of their marketing models. In turn, this allows the marketers to develop more effective targeted marketing strategies for their campaigns. What correlation coefficient is strong positive?Describing Correlation Coefficients. Which of the following correlation coefficients may represent a strong correlation?The strongest linear relationship is indicated by a correlation coefficient of -1 or 1. The weakest linear relationship is indicated by a correlation coefficient equal to 0.
Is 0.95 A strong positive correlation?For example, suppose the value of oil prices is directly related to the prices of airplane tickets, with a correlation coefficient of +0.95. The relationship between oil prices and airfares has a very strong positive correlation since the value is close to +1.
Is 1.2 A strong positive correlation?Positive correlation is measured on a 0.1 to 1.0 scale. Weak positive correlation would be in the range of 0.1 to 0.3, moderate positive correlation from 0.3 to 0.5, and strong positive correlation from 0.5 to 1.0.
|