WEBVTT
00:00:02.480 --> 00:00:05.360
In this video, we’re gonna learn about linear correlation.
00:00:05.360 --> 00:00:12.080
There are lots of situations where we have two sets of data related to individuals or events, and we call this bivariate data.
00:00:12.880 --> 00:00:15.920
For example, student’s scores in math tests and English scores.
00:00:16.200 --> 00:00:18.120
Each student took both tests.
00:00:18.120 --> 00:00:22.560
So we have two sets of numbers related to individual students.
00:00:22.560 --> 00:00:28.520
We can use one set for the “𝑥”-coordinates and the other for the “𝑦”-coordinates and plot all the data as points on a scatterplot.
00:00:29.600 --> 00:00:35.760
Then we can examine any patterns that may emerge in the scatterplots to see if they suggest any association between the two data sets.
00:00:37.080 --> 00:00:40.200
One type of pattern that can emerge is a straight line relationship.
00:00:40.880 --> 00:00:51.160
This has turned out to be so useful in scientific and statistical analysis that techniques have been developed to quantify and interpret linear correlation between two associated sets of data.
00:00:52.200 --> 00:00:56.840
So we’re gonna talk about linear correlation and the terminology that we use to describe it.
00:00:58.200 --> 00:01:01.840
Let’s start by describing an experiment that I do with my math students.
00:01:01.840 --> 00:01:08.920
I give each student a different-sized circle and ask them to measure the diameter and circumference and then we gather in all the results.
00:01:09.880 --> 00:01:13.720
This sounds pretty easy maybe, but they only have straight rulers to measure with.
00:01:14.080 --> 00:01:20.640
So they need to be quite creative about how they measure the circumference, and I don’t let them calculate it if they happen to know about “𝜋” and the formula.
00:01:21.760 --> 00:01:32.960
So we have two bits of data about each circle and we use the diameters as the “𝑥”-coordinates and the circumferences as the “𝑦”-coordinates and we plot all these points on a scatterplot.
00:01:33.200 --> 00:01:38.880
So here’s the data that I gather for one class, and here’s the scatterplot.
00:01:39.080 --> 00:01:43.960
Now the first thing that jumps off the page is this point here, which looks very different to all the others.
00:01:45.400 --> 00:01:51.480
Most points are close to a straight line running something like this, but the other point is a long way from the pack.
00:01:52.200 --> 00:01:57.320
In fact, it turned out to be due to a student who read out their diameter and circumference the wrong way round.
00:01:57.320 --> 00:02:01.800
So we were able to swap the “𝑥”- and “𝑦”-coordinates over to correct them.
00:02:02.120 --> 00:02:08.640
But if the student who made the mistake hadn’t been in the room to explain what they’d done, then we’d have had a tricky decision to make.
00:02:08.640 --> 00:02:10.480
Why’s that point so far away from the others?
00:02:10.680 --> 00:02:17.080
Because it was a genuine circle which was very different to all the others or was there some kind of mistake?
00:02:17.280 --> 00:02:19.520
You shouldn’t just throw away data because it looks different.
00:02:19.920 --> 00:02:23.640
You need to find out more about it: is it real or is it a mistake?
00:02:23.960 --> 00:02:27.080
If it’s real, then you need to take it into account in your analysis.
00:02:28.200 --> 00:02:33.520
So after our correction, this is what the scatterplot looked like with a new line of best fit.
00:02:33.760 --> 00:02:42.280
The line of best fit that we’ve drawn is positioned in such a way that it minimises the overall vertical distance to all of the points, like these orange lines here.
00:02:42.280 --> 00:02:44.320
It’s called a least squares regression line.
00:02:44.920 --> 00:02:48.040
But we’re not gonna going into the detail of how we calculate that just now.
00:02:48.040 --> 00:03:02.360
We’re just gonna draw it by eye, trying our ruler in lots of different positions until we find a route that is as close as possible to as many of the points as possible with a nice even balance of points above and below the- the line along its entire length.
00:03:02.360 --> 00:03:12.640
So we’ve got points above and below here, we’ve got points above and below here, and we’ve also got points above and below in the middle here.
00:03:12.880 --> 00:03:15.800
And now we can use the line of best fit to make predictions.
00:03:16.240 --> 00:03:27.800
For example, if we had a circle that had a diameter of “two” inches, we could draw a line up to our line of best fit and across to the “𝑦”-axis.
00:03:27.800 --> 00:03:32.480
And that looks like it would have a circumference of between “six” and “six and a half” inches.
00:03:32.880 --> 00:03:42.160
So without actually having to do measurements on the circle if you know the diameter of a circle, you can use this graph to make a prediction about what its circumference would be.
00:03:42.160 --> 00:03:45.840
And likewise if we know the circumference, we could make a prediction about the diameter.
00:03:45.840 --> 00:03:58.760
So if we had a circle with a circumference of “twenty” inches, we could draw a line across from the “𝑦”-axis to our line of best fit and then down to the “𝑥”-axis.
00:03:59.080 --> 00:04:02.920
And it looks like that’s just under “six point five” inches in diameter.
00:04:05.160 --> 00:04:11.280
We could even go as far as calculating the equation of that line of best fit and using that to make our predictions.
00:04:11.520 --> 00:04:16.320
So for example, if we had a diameter of “three” inches, “𝑥 will be equal to three”.
00:04:16.320 --> 00:04:27.440
We can plug that into our equation and then that would give us an answer of “nine point four” inches for the circumference, which is a bit easier and probably more accurate than reading off of that scale.
00:04:29.320 --> 00:04:35.560
Now looking at the equation, we can see that the slope or the gradient is “positive three point one”.
00:04:35.960 --> 00:04:50.520
And because the pattern of dots make a pretty close fit to a straight line and that line has this positive slope as we’ve just seen, we say that the points are positively correlated, or if you want to be really accurate, positively linearly correlated.
00:04:52.360 --> 00:04:59.040
And if the points had suggested a line with a negative slope, then we’d have said that they had negative correlation.
00:04:59.280 --> 00:05:04.200
So the terms positive and negative correlation are statements about bivariate data.
00:05:05.160 --> 00:05:18.520
So if higher values on one aspect of data are associated with higher values on the other aspect of data and lower values on one aspect of data are associated with lower values on the other aspect of data, we call that positive correlation.
00:05:18.760 --> 00:05:27.760
And if high values on one aspect of data are associated with low values on the other aspect of data, we call that negative correlation.
00:05:28.000 --> 00:05:34.320
And some people call positive correlation direct correlation and negative correlation inverse correlation.
00:05:34.640 --> 00:05:36.680
So they’re terms that you might also come across.
00:05:38.040 --> 00:05:39.160
But that doesn’t really cover it all.
00:05:39.560 --> 00:05:42.200
Sometimes there’s no correlation between two data sets.
00:05:42.720 --> 00:05:51.760
For example, if you plotted the number of doughnuts people can eat without licking their lips against the number of books that they’ve read over the past year, you might expect a scatterplot looking something like this.
00:05:52.400 --> 00:05:56.080
There’s no association between the two at all; there’s no correlation.
00:05:57.480 --> 00:06:05.360
Knowing how many books someone has read over the past year tells you nothing about how many doughnuts they’re likely to be able to eat without licking their lips and vice versa.
00:06:07.080 --> 00:06:15.360
Okay then, we’ve got a basic idea of what correlation is now: it’s a way to describe apparent associations between data sets or even the lack of association between them.
00:06:15.760 --> 00:06:18.400
Let’s go through a summary of what the basic types are.
00:06:19.720 --> 00:06:26.840
We’ve got positive or direct correlation, negative or inverse correlation, and no correlation.
00:06:26.840 --> 00:06:30.400
But there are also different strengths of correlation.
00:06:30.640 --> 00:06:36.040
So strong correlation is when the points are closer to a line of best fit.
00:06:36.040 --> 00:06:43.080
Weaker correlation is when they’re scattered a bit more randomly further away from that line of best fit; there’s a bit more variation going on there.
00:06:44.120 --> 00:06:55.160
So for example with weak positive correlation, you still get higher data values on one data aspect associated with higher values on the other data aspect and-and lower with lower and so on.
00:06:55.160 --> 00:06:59.960
But the picture is a little bit more confused; it’s not quite so clear that they’re correlated.
00:07:01.320 --> 00:07:08.800
And likewise with negative correlation, you’ve still got high-high values on one data aspect associated with low values on the other data aspect.
00:07:08.800 --> 00:07:14.880
But those points don’t conform to that line of best fit so clearly.
00:07:14.880 --> 00:07:18.560
Now this strong and weak correlation idea is all a bit fluffy and woolly.
00:07:20.120 --> 00:07:29.680
If we drew the axes slightly differently and used a different scale, we could make correlation look stronger or weaker by having the points more spaced out or closer to the line.
00:07:29.680 --> 00:07:32.280
So that’s not really that great.
00:07:32.280 --> 00:07:38.120
But luckily we have something called a correlation coefficient which quantifies the strength of the correlation.
00:07:38.440 --> 00:07:48.640
And this is a number that runs on a scale from “negative one” for perfect negative correlation through “zero” for no correlation up to “positive one” for perfect positive correlation.
00:07:49.880 --> 00:07:56.160
So perfect negative correlation would be when all of the points exactly sit on that line of best fit.
00:07:56.680 --> 00:08:02.920
In perfect positive correlation, all the points would exactly fit on that line of best fit.
00:08:02.920 --> 00:08:08.440
So in both of those cases, our line of best fit would make perfect predictions of one thing from the other.
00:08:09.920 --> 00:08:19.560
So going back to our circle measuring task that we did with my students, that should’ve given us perfect positive correlation between the diameter and the circumference of a circle.
00:08:19.560 --> 00:08:25.360
We know that there’s a formula that exactly describes this relationship: the circumference is “𝜋 times the diameter”.
00:08:26.440 --> 00:08:32.280
Now the only reason that it didn’t come out perfect was that the students weren’t able to measure the circles with “a hundred percent” accuracy.
00:08:32.840 --> 00:08:35.600
But we did see a pretty strong positive correlation.
00:08:35.800 --> 00:08:46.480
And we had a good deal of confidence that the predictions of one aspect based on the other using our line of best fit were going to be quite reliable because all the data points were close to that line.
00:08:48.000 --> 00:08:52.440
The line was a good predictor for the data points that we gathered.
00:08:52.440 --> 00:08:56.800
So going back to our scale, we had correlation which was quite strong.
00:08:56.800 --> 00:08:59.920
It was probably in this region, not “one” but approaching “one”.
00:09:01.400 --> 00:09:03.440
So in the real world, things are quite messy.
00:09:03.440 --> 00:09:08.840
So we would probably would never expect to get perfect positive or perfect negative correlation.
00:09:08.840 --> 00:09:21.040
We would always be operating in this kind of zone in between here somewhere and we’ll be looking at the tendency: are we sort of generally closer to “negative one” or are we generally closer to “zero” or are we generally closer to “one”?
00:09:22.520 --> 00:09:29.200
So the value of the correlation coefficient tells us how reliable the predictions made using our line of best fit are.
00:09:29.440 --> 00:09:34.040
Close to “negative one” or “positive one”, that means they’re quite reliable.
00:09:34.440 --> 00:09:36.840
Closer to “zero”, they’re totally unreliable.
00:09:38.480 --> 00:09:41.360
So let’s have a look at these two scatterplots.
00:09:41.360 --> 00:09:45.480
So there’re two classes, A and B, and they both did a math test and an English test.
00:09:45.480 --> 00:09:50.120
And we’ve used the English scores as our “𝑥”-coordinates; and the math scores as our “𝑦”-coordinates.
00:09:50.120 --> 00:09:53.320
So for class A, we’ve got this particular pattern.
00:09:53.320 --> 00:09:57.040
Everybody scored about “fifty” on English, but there’s a complete range of scores on math.
00:09:57.240 --> 00:10:03.240
And for class B everybody scored about “fifty” on math, but there’s a complete range of scores on English.
00:10:03.240 --> 00:10:06.920
Now those points suggest a pretty clear line of best fit in each case.
00:10:06.920 --> 00:10:13.720
So for class A, the line of best fit would be vertical; and for class B, the line of best fit would be horizontal.
00:10:14.520 --> 00:10:19.680
So how strong do you think the correlation is in each case?
00:10:19.680 --> 00:10:25.440
Well, in fact — both cases — we’ve got “zero” or no correlation.
00:10:25.440 --> 00:10:29.880
And that’s because knowing one of the scores tells you nothing about the other.
00:10:29.880 --> 00:10:33.560
There’s no predictability of one score based on the other score.
00:10:34.440 --> 00:10:41.880
In class A, if I know someone scored “fifty” for English, that doesn’t tell me anything at all about what they might have scored in their math score.
00:10:41.880 --> 00:10:46.680
People who scored “fifty” for English scored a whole range of different scores on their math test.
00:10:47.240 --> 00:11:01.800
And likewise for class B, if I know somebody scored “fifty” on math, that doesn’t enable me to predict what score they got on their English test because people who scored “fifty” on math scored the complete range of different scores on their English test.
00:11:03.400 --> 00:11:15.880
This means that although the points suggest a pretty good line of best fit because it’s exactly horizontal or exactly vertical, you can’t use one score to make a prediction about the other for any individual student.
00:11:16.080 --> 00:11:18.400
This means there is no correlation between the two.
00:11:19.640 --> 00:11:24.440
Correlation is all about the predictive power of one piece of data for another piece of data.
00:11:25.840 --> 00:11:30.600
Now correlation is also about association between data within a certain range.
00:11:30.880 --> 00:11:37.360
For example, one March I planted some sunflower seeds in my garden and I measured how tall the plants were every day.
00:11:37.880 --> 00:11:40.720
By the end of September, I’d gathered a lot of data.
00:11:41.160 --> 00:11:51.040
And there was a pretty strong positive correlation between the number of days that had passed since I planted the seeds and the height of my plants, which were about “twelve” feet tall by that stage.
00:11:51.840 --> 00:12:00.880
Now by extending that pattern, I confidently predicted that by the end of the following January my plants will be “twenty” feet tall and I wondered if that would be a world record.
00:12:02.000 --> 00:12:03.200
Of course I was wrong.
00:12:03.480 --> 00:12:04.200
Autumn came.
00:12:04.520 --> 00:12:07.440
They stopped growing, they died, they fell over, and they rotted.
00:12:08.880 --> 00:12:21.640
Although the data that I gathered was very good at estimating how tall the plants would have been over the time that I was gathering the data in this region here, it turned out to be very bad at making predictions about the future.
00:12:23.360 --> 00:12:28.360
Using patterns to make estimations within the range of data you’ve collected is called interpolation.
00:12:28.560 --> 00:12:34.560
And this could be very reliable if the data has strong positive or strong negative correlation.
00:12:35.000 --> 00:12:42.440
But trying to use those patterns to make predictions about the future or beyond the range of data- the data that you’ve collected is called extrapolation.
00:12:42.840 --> 00:12:48.280
And it could be very unreliable even in data that was perfectly correlated within the data range that you gathered.
00:12:49.440 --> 00:12:59.160
Another thing, although we’ve been talking about correlation in this video, really — and as we mentioned this a couple of times — we mean linear correlation: how well the data fits a straight line pattern.
00:12:59.880 --> 00:13:04.440
Sometimes though the data doesn’t fit a straight line so well, but maybe it would fit a curve.
00:13:06.000 --> 00:13:12.480
Take this data about the number of visits to the UK between “nineteen seventy-eight” and “nineteen ninety-nine” for example.
00:13:12.480 --> 00:13:28.880
If we fit a linear pattern through the middle here, we can see that although it’s quite a good line of best fit with this pattern emerging at the ends, it’s the line is tending to underpredict the number of thousands of visits made each year, but in the middle it’s overpredicting.
00:13:28.880 --> 00:13:36.360
So although it looks like a reasonable line of best fit, there’s a pattern to the way in which is making errors about making its predictions.
00:13:37.560 --> 00:13:43.800
If we fitted more of a curve like this, then there’s a mix of underestimates and overestimates moving along that line.
00:13:43.800 --> 00:13:48.840
So it’s a slightly better predictor of the number of visits based on what year it is.
00:13:49.760 --> 00:13:56.280
So although nonlinear correlation is beyond the scope of this video, we did just want you to be aware that it is something that does exist.
00:13:58.000 --> 00:14:02.280
So we’ve taken a look at strong or weak positive or direct correlation.
00:14:03.680 --> 00:14:12.920
So we’ve seen strong or weak positive or direct correlation: the closer the correlation coefficient is to “one”, the stronger the correlation.
00:14:14.440 --> 00:14:24.760
And we’ve seen strong or weak negative or inverse correlation: in this case, the closer the correlation coefficient is to “negative one”, the stronger the correlation.
00:14:26.200 --> 00:14:28.800
And we’ve seen examples of no correlation.
00:14:29.160 --> 00:14:37.200
Now this can happen if you’ve got this random splatter of points that look like this or if you’ve got a completely vertical or completely horizontal line of best fit.
00:14:38.400 --> 00:14:44.640
When the correlation coefficient is close to “zero”, knowing one piece of data doesn’t help you to predict what the other one would be.
00:14:45.120 --> 00:14:54.560
So for example, if we knew what their math score was, it wouldn’t help us to predict what their English score was because there’s a whole range of different values that it could’ve been.
00:14:55.760 --> 00:15:07.080
We’ve also seen that when we’ve got good strong correlation doing interpolation, making predictions of one piece of data based on the other within the range of data we’ve got, can be quite reliable.
00:15:08.080 --> 00:15:15.680
But trying to do extrapolation or make predictions beyond the data range that we’ve gathered can give us very bad results indeed.
00:15:17.080 --> 00:15:21.960
One last thing, correlation tells you about association, not necessarily causality.
00:15:22.880 --> 00:15:30.320
It could just be a coincidence that two sets of data correlate or maybe there’s some other underlying factor affecting both sets of data.
00:15:31.760 --> 00:15:45.680
For example, between “two thousand” and “two thousand and nine”, an analysis of the average amount of margarine consumed per person by people in the United States each year correlated very strongly with the divorce rate per thousand people in the state of Maine that year.
00:15:45.880 --> 00:15:47.360
That’s just a coincidence.
00:15:48.920 --> 00:15:55.480
How could the number of divorces in one particular state be affected by how much margarine was being consumed elsewhere in the country?
00:15:56.760 --> 00:16:04.320
There’s also a very weak negative correlation between how yellow people’s teeth are and how long they live.
00:16:04.520 --> 00:16:06.760
Now there’s no causal link between the two.
00:16:06.960 --> 00:16:12.360
But shorter lifespans and having yellow teeth are both caused by smoking tobacco.
00:16:12.360 --> 00:16:19.360
So perhaps that aspect is causing this underlying apparent weak correlation between those two other pieces of data.