# Should I use Kendall correlation or Pearson correlation?

Asked by timothykinney (2733) November 15th, 2009

I am analyzing some data for my research in preparation for a poster presentation this December. I have calculated correlation matrices for the data using the cor() function in R. I have used the Pearson method (which is default) and the Kendall method. I then exported these correlation matrices and plotted the row that I’m interested in correlating. The plots are significantly different.

If I split the data into two classes, the plots look more similar. But if I combine the data into one big set and perform the correlations, the Kendall plot shows more positive correlation while the Pearson plot shows more negative correlation.

In other words, they seem to be showing different results. I’m not a trained statistician, but I’m trying to validate the conclusions I am drawing from this data. I don’t want to choose a statistical method just because it suits my objectives better.

Is there a good reason to choose either Pearson or Kendall’s method for correlation?

Observing members: 0 Composing members: 0

How does it look on the scatter plot when you just look at raw data?

dpworkin (26995)

For the most highly correlated point (by both methods) I get a grouping of points near the origin that could be linear and then two outliers. One near the x-axis (high x, low y) and one in the upper-right hand corner (high x, high y).

This outlier is probably drawing the correlation to be more positive than it really is. Would that imply that neither test is very good for establishing correlation?

I have heard of a correlation ratio test, but I haven’t learned how or when to use it yet.

timothykinney (2733)

Throw out the outliers and see what happens to the distribution. It might be more rational.

dpworkin (26995)

I guess I should have thought of that. :)

timothykinney (2733)

I’m not a real statistician, it just seemed like a reasonable thing to try. Also we were discouraged from using other than Pearson’s R, but I don’t know why technically.

dpworkin (26995)

I do some general stats for my own research, but I must admit I don’t use these tests very often, so I’m shaky on what they are exactly. I looked them up, and Pearson’s r is defined as “the strength of linear dependence between two variables” and Kendall’s tau is “the degree of correspondence between two rankings and assessing the significance of this correspondence. In other words, it measures the strength of association of the cross tabulations.” So given that they are different equations and measuring different kinds of correspondence, it makes sense that you’re seeing two different plots. (So long as you’re talking about plots of the Kendall tau values and the Pearson’s r values!)

You should ask someone who has more experience with stats. Did you take a stats class? Your former professor might be willing to look over your stuff and give you a hand.

“I then exported these correlation matrices and plotted the row that I’m interested in correlating. The plots are significantly different.” I am not sure what you mean by this. If that should be clear in context, you can probably safely assume I don’t know what the hell I am talking about and disregard the rest.

That said…:

Wikipedia has good information on what these correlations mean, but in short, you can think of it this way: Pearson’s correlation tries to measure a linear relationship between two variables, as in, if X changes by about a, then Y changes by amount b. It operates on the actual covariances and standard deviations of the variables.

Kendall’s is a rank correlation test – it orders all your cases by both X and Y, and then figures whether those two orderings are similar, without considering how much values differ along X or Y.

Which you should use depends on your exact question and on how the data looks like, and so forth. Generally, if your data are continuous and you expect a correlation of a linear form (Y equals a times x plus some constant, Y = ax + b), Pearsons should do the job. If your data are not continous but are ordered or there is a non-linear but monotonic correlation, you can use Kendall’s instead.

(Disclaimer: I am not a real statistician either. If you are really doing this for actual research, try to locate a statistician in your institute/university/whatever, and getting them to work with you.)

If an actual linear/ranked correlation exists, however, it would be likely that both tests would detect it. If you are getting a small positive correlation using one test, and a small negative using another, consider the possibility that there is no correlation. Your observation that points tend to form a cloud in one place when plotted would also suggest this, and you are correct that the outlier would probably drive the positive Pearson correlation.

As someone suggested, removing the outlier(s) and testing again is a good start.

Janka (2172)

Have you tried Spearman’s? That’s what we use (I am a Ph.D. sociology student) when our Pearson comes up with unexpected results. Are you using SPSS?

jenandcolin (2296)

“That’s what we use (I am a Ph.D. sociology student) when our Pearson comes up with unexpected results.”

I cannot help pointing out that selecting a statistical procedure based on the procedure you tried first not giving the sort of results you expected is not good practice. You cannot simply go about trying different correlation measures until you find one the results of which please you.

The reason is that if you do that long enough, you can always find some test that will match your expectations. What you need to do is to decide beforehand what your criteria for using Pearson/Spearman/Kendall/whatever are, and then stick to it once you see the data. Otherwise you bias your research so that you are much more likely to draw conclusions that you already expect just because you expect them, and miss actual groundbreaking stuff when you reject a correlation measure for “giving you unexpected results”.

Janka (2172)

It’s not that we simply use the one that fits our hypothesis best (you’re correct, that would be bad science). We do this to try to see what the issue might be (clarification of the issue is actually good practice). We actually routinely use a variety of measures (ideally both quantitative and qualitative) for any research project. In fact, it is very difficult to get funding if you do not do this.

jenandcolin (2296)

**Also, as I am sure you would agree, you do have an ethical obligation to report ALL correlation measures. Therefore, if you use one or three you should include all in any write up. Explaining the variance between the two (or three) is actually a key part of any sophisticated research paper.

jenandcolin (2296)

@jenandcolin I’m using R. I haven’t tried Spearman’s yet, but it’s worth a look. I am not quite sure how to interpret the values from Pearson’s though. Is anything over .33 a “good” correlation? Or is it more complicated than that?

timothykinney (2733)

@jenandcolin Yes, like you, I see no problem with using multiple measures, as long as you are aware beforehand what observations will lead to using which, and reporting all, etc. I am sorry if I implied a suspicion of fraud on your part, that was not my intent. I was just worried that people new to the fun would misinterpret.

Janka (2172)

Have you ever heard of gamma correlation coefficoent. Ijust read about. is that also use commonly in statistical research?

nc3bh (1)

So I realize your past your date of presenting… but for anyone who is looking at a similar situation:

Pearson’s requires that all the data is normally distributed by only that variable (X is normal, Y is normal… makes a bell curve).

Kendall’s does not require the data to be distributed in any way.

You can throw out the outliers if the data is normal distributed and for the most part looks like it is linear on Person’s correlation. IF this is the case you would use this correlation.

Spearman’s is more common than Kendall’s but they are similar. Kendall’s is a (t) test statistics… Meaning you have a small sample size not a population sample.

.33 correlation is not strongly correlated. As far as “good” or “bad” goes it depends on if you want them to be correlated, what your hypothesis is. A Correlation is going to be between -1 and 1. If it is 0, then the data is uncorrelated. +1 or -1 means it is perfectly correlated (one always effects the other), this is rare. The closer to 1 or -1 the stronger the correlation. example 0.89 is a strong positive correlation. While -0.13 is a weak negative correlation.

R, SPSS, SAS, Mintab, and even excel with the right package can help you.

Sorry I didn’t help in time :(

Styler (21)

or