Jakob Nielsenâ€™s Alertbox: March 1, 2004

# Probability Theory and Fishing for Significance

Sidebar to Jakob Nielsen 's column Risks of Quantitative Studies , March 2004.

Let's say that we suspect a coin of being biased because it seems to come up "heads" too often when we're playing heads-or-tails. As an experiment, we toss the coin ten times, and sure enough, it comes up heads seven times, and tails three times. Is this proof that the coin is unfair?

When we toss a coin ten times, there are 1,024 different outcomes in terms of heads-and-tails sequences. So, how many of these outcomes include at least seven heads? There are 120 outcomes with exactly seven heads, forty-five outcomes with eight heads, ten outcomes with nine heads, and only one outcome in which all ten tosses come up heads. In all, 176 outcomes include at least seven heads.

In other words, assuming a fair coin, the probability for observing at least seven heads in ten tosses is 17%. This is much too big a p to be statistically significant . To conclude that the coin is biased, we'd need nine heads out of ten, because only eleven outcomes out of 1,024 would generate nine or more heads with a fair coin, giving us a p of 1.1%.

## A Tale of 400 Quarters

Now, let's say that we go to the bank and get \$100 worth of quarters (400 coins). We toss each quarter ten times, and one of them comes up heads all ten times. Surely this coin must be biased. After all, ten heads has p = 0.1%, leading to a strong conclusion of statistical significance.

No. The problem here is that we've gone fishing for statistical significance , which violates the calculations' assumptions.

We know that if we toss a fair coin ten times, there is only 0.1% probability that it will come up heads all ten times. But, if we repeat this experiment with 400 coins, there is a 32% probability that we will observe the all-ten-heads outcome at least once. And p = 32% is much too high to be statistically significant.

There's also 32% probability for observing a run of ten tails. In other words, more than half the time we will find a coin that gives ten identical outcomes.

This is why it is not valid research to conduct a study, collect lots of data about lots of variables, and then claim significance because some of the variables seem to correlate. Doing so is exactly the same as tossing lots of quarters, then reporting on the few coins that had an unusual outcome.

The press reported an experiment with the Euro coin that was just barely shy of statistical significance in showing that the coin might come up heads too often. This is an example of publication bias, which I discussed in the main article : The experiment's finding was reported only because it was sensational. Many other statistics professors surely used Euro coins to demonstrate probability for their students, but when those results didn’t produce anything suspicious, they weren't mentioned outside the classroom.