Every year, at the first faculty meeting, representatives of the registrar tell us what percentage of the incoming class is [insert variable in which we are interested, such as American Indian, working class, international, etc]. They compare it to last year’s percentage. This drives me crazy because they do so as if comparing the last two data points in a sequence is indicative of a trend. But to determine whether or not there is a trend, and therefore whether the increase or decrease in the percentage of [insert variable in which we are interested] significant relative to last year, depends on more than two data points!
xkcd does an excellent job of illustrating just how two data points can be utterly meaningless, even wildly fallacious.
Other great xkcd cartoons: attribution and the in group, on statistical significance, correlation or causation, and the minimal group paradigm.
Originally posted in 2009.
Lisa Wade, PhD is an Associate Professor at Tulane University. She is the author of American Hookup, a book about college sexual culture; a textbook about gender; and a forthcoming introductory text: Terrible Magnificent Sociology. You can follow her on Twitter and Instagram.
Comments 21
Elena — July 6, 2009
And the alt text punchline for the comic is "By the third trimester, there will be hundreds of babies inside you."
Shinobi — July 6, 2009
This drives me totally bonkers. TOTALLY BONKERS.
Duran — July 6, 2009
Your reasoning, lisa, as it is quite often in the case where math, science, or logic is involved, is wrong.
It is only important to include multiple data points if forecasting is intended. With two data points, you can indeed draw very accurate conclusions: about those two data points!
So, either your logic here is fallacious, or you didn't give us the full story:
But to determine whether or not there is a trend, and therefore whether the increase or decrease in the percentage of [insert variable in which we are interested] significant relative to last year,
Is sociology considered a science? I seriously cannot believe you are a professor (and therefore almost certainly hold an advanced degree). I really can't.
Jesse — July 6, 2009
Yeah, if someone notes that the percentage of international students has increased over the past few years, are they really saying that eventually 3 million percent of the student body will be from other countries? Come on.
Jesse — July 6, 2009
And of course you can have 50 million data points, you still need faith in some theoretical assumptions to extrapolate into the future. Problems with extrapolation are not derived from the small number of data points used.
Penny — July 6, 2009
"two data points can be utterly meaningless, even wildly fallacious"
Have to agree with the above--two data points aren't meaningless or fallacious -- they just don't *predict* much. But not all statistical reports are about making predictions; some can just be about stating the numerical facts from year A and year B, for the record.
Rhys — July 6, 2009
How hard would it be to go back further than last year's data point? Save them each year yourself and extrapolate a trend further down the line. Crikey.
Jeff — July 6, 2009
Yes, Duran, sociology is a science. It's not a natural science like math or physics, but a social science. Read any journal article in political science, sociology, psychology, etc. and you'll see that they're all heavily based on quantitative analyses of phenomena in the respective discipline.
The point here is about extrapolation versus interpolation. Extrapolation is making predictions outside your set of data, and interpolation is making inferences about "missing points" inside your set of data. If you have 2 American Indians one year, and then 4 next year, is the growth exponential or not? There's no way to tell with just two years; you need more than just two years of data to make any meaningful insights with statistical significance. Just like how in MLB stats, player's batting averages aren't ranked unless they have a certain number of at bats. A batting average of .750 is nice for the first 4 at bats, but clearly won't last until the end of the season.
Jeff — July 6, 2009
And with all sciences, the point of data gathering is to make predictions. And the basis for a good prediction is good data. Two points close together on the x axis (in this case time)=bad data.
Jesse — July 6, 2009
1. Math is not a natural science.
2. Sociology may be a social science, but I can't remember a scientific argument ever being advanced on this blog.
3. "The point here is about extrapolation versus interpolation" -- I don't see any points being made about interpolation. All you do is define it.
4. Who on earth asked if the growth in American Indian population is exponential? The answer is obviously no, it's not exponential. If the American Indian representation doubled every year, there would be a trillion American Indian students in 40 years. You don't need ANY data points to see that this is ridiculous. The only people who seem to think that this is an interesting possibility to explore are you and Lisa.
Sheesh. If you turn on the news and are told that Barack Obama has been President for X days now, and then you complain that everyone seems to think that Obama will be President forever, you're really not showing off how smart you are. Quite the opposite.
Jeff — July 6, 2009
1. OK, correction, math is the language of all sciences.
2. Sociology is most definitely a social science. Period. Interestingly enough, the word sociology literally means "social science." I just happened to stumble upon this website, so I have no idea of its history.
3. The argument in this blog is about erroneous assumptions about the growth/decline of certain groups of students (such as Native Americans) in successive years (more appropriately, in this case, 2 years). The x-axis is time in years; the y-axis is population of Native American Students.
Let's say that in 2008 there are (to keep the math simple) 2 Native American Students. In 2009 there are 4 and 2010, there are 6. So, if we just take 2008 and 2009 as our 2 data points, we are unsure on how to "interpolate" how the growth from 2 to 4 happened. Is the equation either:
A) y=2^x the population increase exponentially 4(pop) =2^(2years)
B) y=2x the population doubles 4(pop)= 2*(2years)
C) y=x+2 the population increase by 2 each year 4(pop)=(2years)+2
With only two data points, all these are (mathematically) plausible. However, knowing that in 2010 there were six students, we can then presume that the growth equation is C y=x+2
Extrapolation would be predicting what would happen in 2011--a data point outside our data set. We would wrongly predict there would be 8 students with equations A and B.
4) See the original quotation below the comic. This is what the joke in the comic is about in the first place, as well as the comment the author of the blog made. You can't make meaningful insights on a phenomena (such as the increase in Native American students) by looking at only two consecutive years.
In this case of the xkcd comic, the growth (presumably) is exponential or an addition of 1.
y(# of husbands) = 0^(days-1) 1=0^(1-1)
y(# of husbands) = 0+(days) 1=0+1
Obama comment) This model is not analogous to the above example because there are not varying degrees of being the president. Obama is either the president or not. A binomial variable. Which is another bag of worms.
Hope that helps clear up any confusion!
Jeff — July 6, 2009
*Correction
The exponential growth for the comic should be y=x(1-x)^(x-1) but this would work for only days 0 and 1
Another possibility is simply y=x
Jesse — July 6, 2009
Wow Jeff, so you really think that if someone from a registrar tells you that American Indian representation has doubled in the last year, that it will continue to double every year going forward in the future? Is that really what the person means?
If I tell you that crime has gone down over the past year, am I really just saying that crime will eventually decline to zero? There's nothing else useful in my statement? It's totally pointless and useless unless we want to make forecasts?
Please. The only person doing silly extrapolation is you.
Obama has been President for X days. Tomorrow he will have been President for X+1 days. How many days will he have been a President in 100 years? Oh no the dangers of extrapolation! So no, the model is perfectly analogous. And somehow no one makes the mistake that you seem so eager to fight against.
In the real world, extrapolation is a problem no matter HOW MANY DATA POINTS YOU HAVE. Stock market returns in the USA between 1935 and 2008 look pretty good. That's a lot of data. And yet that data wouldn't have done you much good at forecasting 2008 stock returns.
You know, things are the way they are, until they aren't. No black President in America for over 200 years, and then there was one. The historical data was useless.
There are some regularities in the world that we can recognize and make use of but ultimately the future is somewhat unpredictable.
But yeah, thanks for clearing up all my confusion.
Jeff — July 6, 2009
Forget it, man. Not worth it. Back to some real studying instead of wasting my time trying to explain a joke you can't get.
Jeff — July 6, 2009
OK, I can't resist....
Wow Jeff, so you really think that if someone from a registrar tells you that American Indian representation has doubled in the last year, that it will continue to double every year going forward in the future? Is that really what the person means?
That's what the math would say...which is exactly why you need more than two points of data--so the math will be accurate. To recognize the REAL trend. That's what makes the joke funny.
If I tell you that crime has gone down over the past year, am I really just saying that crime will eventually decline to zero? There’s nothing else useful in my statement? It’s totally pointless and useless unless we want to make forecasts?
You're making the silly extrapolation here. And what good is data if we don't make forecasts and do research to figure out why the crime went down. Were there more police forces? Increased church attendance?
And as for the rest of it: all can be attributed to outliers. Stock Market in 1929: outlier. Black president in 2008: outlier
Social data are usually not pretty. Which is why (not to beat a dead horse) YOU NEED MORE THAN TWO DATA POINTS!
And here: http://stockcharts.com/charts/historical/djia1900.html
Hmmm...looks like a pretty noticeable trend to me....
However if you looked at the years 1930-1935, you would predict a pretty negative growth. Which is exactly why you must have a larger amount of data to get an accurate picture!
For real. I'm done.
Jesse — July 6, 2009
Wow that's a new level of stupid.
YOU NEED MORE THAN TWO DATA POINTS!
The only person who looked at two data points and even thought about extrapolating out to infinity was you. Who are you warning exactly?
And what good is data if we don’t make forecasts and do research to figure out why the crime went down. Were there more police forces? Increased church attendance?
You were the one who said that the data had no use if we weren't going to use it to forecast.
If crime rate goes down 50%, that fact is interesting in itself. You know, it's pretty hard to investigate why crime went down if you don't even know that it went down.
But if someone tells you that the crime rate was X last year and X/2 last year, that apparently means they're a moron. It's just so useless, because you can't extrapolate from just 2 data points! What possible point could they be making by comparing the crime rate from one year to the next??? Look at me, I'm smart!
There are plenty of predictive models that assume that future changes are independent of past changes. In fact, if you're looking at the % of students in your school from a certain background, it's virtually certain that no trend is sustainable in the long run. And virtually everyone understands this.
The point of taking this year's numbers and comparing them to last year's numbers is to (1) describe the current situation and (2) put the current situation in some sort of larger context. It's not done to extrapolate trends millions of years into the future.
This is rocket science; someone who doesn't understand it shouldn't be lecturing other people about anything.
Jesse — July 6, 2009
Oh sorry, meant that this is NOT rocket science.
Jesse — July 6, 2009
I mean, from the original post:
Every year, at the first faculty meeting, representatives of the registrar tell us what percentage of the incoming class is [insert variable in which we are interested, such as American Indian, working class, international, etc]. They compare it to last year’s percentage. This drives me crazy because they do so as if comparing the last two data points in a sequence is indicative of a trend.
This drives me crazy because two data points in a sequence IS indicative of a trend. Not a trend that will necessarily continue indefinitely into the future, or a trend that has existed since the beginning of time, but a trend between the two years.
If crime is down 50%, THAT IS INTERESTING. The fact that crime will not go down to zero in a few years, or that crime was not infinite a few years ago, IS IRRELEVANT NONSENSE.
Just for Fun: The Folly of Two Data Points - Treat Them Better — January 25, 2015
[…] Just for Fun: The Folly of Two Data Points […]
Larry Charles Wilson — January 25, 2015
The fact that I only discovered this blog five years ago means I missed some real fun.
Gunnar Tveiten — February 4, 2015
They mostly do this to have something positive to report even when really nothing has changed. The school of our kids do this too; each year there's a survey on the performance of the school, and each year they'll point out that they made some progress on variable X, Y and Z.
In reality the only thing that is happening is that the variables bounce up and down a bit randomly, and thus some things go up and other things go down; long-term there's no trend at all, it's entirely flat.
(and of course they "forget" to mention those variables that went DOWN from last year, thus a situation where 5 variables go up and 5 go down is reported as "we made some improvements on (insert list of 5 up-going variables here)"