Summary: When collecting usability metrics, testing 20 users typically offers a reasonably tight confidence interval.
We can define usability in terms of quality metrics, such as learning time, efficiency of use, memorability, user errors, and subjective satisfaction. Sadly, few projects collect such metrics because doing so is expensive: it requires four times as many users as simple user testing.
Many users are required because of the substantial individual differences in user performance. When you measure people, you'll always get some who are really fast and some who are really slow. Given this, you need to average these measures across a fairly large number of observations to smooth over the variability.
Standard Deviation for Web Usability Data
We know from previous analysis that user performance on websites follows a normal distribution. This is happy, because normal distributions are fairly easy to deal with statistically. By knowing just two numbers -- the mean and the standard deviation -- you can draw the bell curve that represents your data.
I analyzed 1,520 measures of user time-on-task performance for 70 different tasks from a broad spectrum of websites and intranets. Across these many studies, the standard deviation was 52% of the mean values. For example, if it took an average of 10 minutes to complete a certain task, then the standard deviation for that metric would be 5.2 minutes.
To compute the standard deviation, I first removed the outliers representing excessively slow users. Is this reasonable to do? In some ways, no: slow users are real, and you should consider them when assessing a design's quality. Thus, even though I recommend removing outliers from the statistical analyses, you shouldn't forget about them. Do a qualitative analysis of outliers' test sessions and find out what "bad luck" (i.e., bad design) conspired to drag down their performance.
For most statistical analyses, however, you should eliminate the outliers. Because they occur randomly, you might have more outliers in one study than in another, and these few extreme values can seriously skew your averages and other conclusions.
The only reason to compute statistics is to compare them with other statistics. That my hypothetical task took an average of ten minutes means little on its own. Is ten minutes good or bad? You can't tell from putting that one number on a slide and admiring it in splendid isolation.
If you asked users to subscribe to an email newsletter, a ten-minute average task time would be extremely bad. We know from studies of many newsletter subscription processes that the average task time across other websites is four minutes, and users are only really satisfied if it takes less than two minutes. On the other hand, ten minutes would indicate very high usability for more complex tasks, such as applying for a mortgage.
The point is that you collect usability metrics to compare them with other usability metrics, such as comparing your site with your competitors' sites or your new design with your old.
When you eliminate outliers from both statistics, you still have a valid comparison. True, average task time in both cases will be a bit higher if you keep the outliers. But, without the outliers, you're more likely to reach correct conclusions, because you're less likely to overestimate an average that happened to have a greater number of outliers.
Estimating Margin of Error
When you average together several observations from a normal distribution, the standard deviation (SD) of your average is the SD of the individual values divided by the square root of the number of observations. For example, if you have ten observations, then the SD of the average is 1/sqrt(10) = 0.316 times the original SD.
We know that for user testing of websites and intranets, the SD is 52% of the mean. In other words, if we tested ten users, then the SD of the average would be 16% of the mean, because .316 x .52 = .16.
Let's say we're testing a task that takes five minutes to perform. So, the SD of the average is 16% of 300 seconds = 48 seconds. For a normal distribution, two-thirds of the cases fall within +/- 1 SD from the mean. Thus, our average would be within 48 seconds of the five-minute mean two-thirds of the time.
The following chart shows the margin of error for testing various numbers of users, assuming that you want a 90% confidence interval (blue curve). This means that 90% of the time, you hit within the interval, 5% of the time you hit too low, and 5% of the time you hit too high. For practical Web projects, you really don't need more accurate interval than this.
The red curve shows what happens if we relax our requirements to being right half of the time. (Meaning that we'd hit too low 1/4 of the time and too high 1/4 of the time.)
Determining the Number of Users to Test
In the chart, the margin of error is expressed as a percent of the mean value of your usability metric. For example, if you test 10 users, the margin of error is +/- 27% of the mean. This means that if the mean task time is 300 seconds (five minutes), then your margin of error is +/- 81 seconds. Your confidence interval thus goes from 219 seconds to 381 seconds: 90% of the time you're inside this interval; 5% of the time you're below 219, and 5% of the time you're above 381.
This is a rather wide confidence interval, which is why I usually recommend testing with 20 users when collecting quantitative usability metrics. With 20 users, you'll probably have one outlier (since 6% of users are outliers), so you'll include data from 19 users in your average. This makes your confidence interval go from 243 to 357 seconds, since the margin of error is +/- 19% for testing 19 users.
You might say that this is still a wide confidence interval, but the truth is that it's extremely expensive to tighten it up further. To get a margin of error of +/- 10%, you need data from 71 users, so you'd have to test 76 to account for the five likely outliers.
Testing 76 users is a complete waste of money for almost all practical development projects. You can get good-enough data on four different designs by testing each of them with 20 users, rather than blow your budget on only slightly better metrics for a single design.
In practice, a confidence interval of +/- 19% is ample for most goals. Mainly, you're going to compare two designs to see which one measures best. And the average difference between websites is 68% -- much more than the margin of error.
Also, remember that the +/- 19% is pretty much a worst-case scenario; you'll do better 90% of the time. The red curve shows that half of the time you'll be within +/- 8% of the mean if you test with 20 users and analyze data from 19. In other words, half the time you get great accuracy and the other half you get good accuracy. That's all you need for non-academic projects.
Quantitative vs. Qualitative
Based on the above analysis, my recommendation is to test 20 users in quantitative studies. This is very expensive, because test users are hard to come by and require systematic recruiting to actually represent your target audience.
Luckily, you don't have to measure usability to improve it. Usually, it's enough to test with a handful of users and revise the design in the direction indicated by a qualitative analysis of their behavior. When you see several people being stumped by the same design element, you don't really need to know how much the users are being delayed. If it's hurting users, change it or get rid of it.
You can usually run a qualitative study with 5 users, so quantitative studies are about 4 times as expensive. Furthermore, it's easy to get a quantitative study wrong and end up with misleading data. When you collect numbers instead of insights, everything must be exactly right, or you might as well not do the study.
Because they're expensive and difficult to get right, I usually warn against quantitative studies. The first several usability studies you perform should be qualitative. Only after your organization has progressed in maturity with respect to integrating usability into the design lifecycle and you're routinely performing usability studies should you start including a few quant studies in the mix.