Risks of Quantitative Studies

by Jakob Nielsen on March 1, 2004

Summary: Number fetishism leads usability studies astray by focusing on statistical analyses that are often false, biased, misleading, or overly narrow. Better to emphasize insights and qualitative research.

There are two main types of user research: quantitative (statistics) and qualitative (insights). Quant has quaint advantages, but qualitative delivers the best results for the least money. Furthermore, quantitative studies are often too narrow to be useful and are sometimes directly misleading.

The key benefit of quantitative studies is simple: they boil a complex situation down to a single number that's easy to grasp and discuss. I exploit this communicative clarity myself, for example, in reporting that using websites is 206% more difficult for users with disabilities and 122% more difficult for senior citizens than for mainstream users.

Of course, using bottom-line scores to summarize elaborate usability study outcomes neglects details that take 273 pages to explain: Why are websites more difficult for these groups? What should you do about it?

Numbers, however, have their own stories to tell:

  • They tell us that the situation is much worse for users with disabilities than for seniors. Because there are many more seniors and they constitute a particularly affluent audience, websites might nonetheless choose to spend more resources catering to seniors than to the disabled. Knowing the score lets organizations make conscious decisions in how they allocate scarce resources.
  • They tell us that the problems are not small. If the Web were 5% more difficult for users with disabilities than for other users, most people would say "whatever; deal with it." But discriminating by 206% is too much for many of us to stomach.

Numbers also allow comparisons between designs and tracking over time. Ten years from now, if websites are only 50% harder to use for seniors than for younger users, we'll know that we've made substantial progress.

Beware Number Fetishism

When I read reports from other people's research, I usually find that their qualitative study results are more credible and trustworthy than their quantitative results. It's a dangerous mistake to believe that statistical research is somehow more scientific or credible than insight-based observational research. In fact, most statistical research is less credible than qualitative studies. Design research is not like medical science: ethnography is its closest analogy in traditional fields of science.

User interfaces and usability are highly contextual, and their effectiveness depends on a broad understanding of human behavior. Typically, designers must combine and trade-off design guidelines, which requires some understanding of the rationale and principles behind the recommendations. Issues that are so specific that a formula can pinpoint them are usually irrelevant for practical design projects.

Fixating on numbers rather than qualitative insights has driven many usability studies astray. As the following points illustrate, quantitative approaches are inherently risky in a host of ways.

Random Results

Researchers often perform statistical analysis to determine whether numeric results are "statistically significant." By convention, they deem an outcome significant if there is less than 5% probability that it could have occurred randomly rather than signifying a true phenomenon.

This sounds reasonable, but it implies that one out of twenty "significant" results might be random if researchers rely purely on quantitative methods.

Luckily, most good researchers — especially those in the user-interface field — use more than a simple quantitative analysis. Thus, they typically have insights beyond simple statistics when they publish a paper, which drives down, but doesn't eliminate, bogus findings.

There's a reverse phenomenon as well: Sometimes a true finding is statistically insignificant because of the experiment's design. Perhaps the study didn't include enough participants to observe a major — but rare — finding in sufficient numbers. It would therefore be wrong to dismiss issues as irrelevant just because they don't show up in quantitative study results.

The "butterfly ballot" in the 2000 election in Florida is a good example: a study of 100 voters would not have included a statistically significant number of people who intended to vote for Al Gore but instead punched the hole for Patrick Buchanan, because less than 1% of voters made this mistake. A qualitative study, on the other hand, would likely have revealed some voters saying something like, "Okay, I want to vote for Gore, so I'm punching the second hole ... oh, wait, it looks like Buchanan's arrow points to that hole. I have to go down one for Gore's hole." Hesitations and almost-errors are gold to the observant study facilitator, but to translate them into design recommendations requires a qualitative analysis that pairs observations with interpretive knowledge of usability principles.

Pulling Correlations Out of a Hat

If you measure enough variables, you will inevitably discover that some seem to correlate. Run all your stats through the software and a few "significant" correlations will surely pop out. (Remember: one out of twenty analyses are "significant," even if there is no underlying true phenomenon.)

Studies that measure seven metrics will generate twenty-one possible correlations between the variables. Thus, on average, such studies will have one bogus correlation that the statistics program deems "significant," even if the issues being measured have no real connection.

In my Web Usability 2004 project, we collected metrics on fifty-three different aspects of user behavior on websites. There are thus 1,378 possible correlations that I could throw into the hopper. Even if we didn't discover anything at all in the study, about sixty-nine correlations would emerge as "statistically significant."

Obviously, I'm not going to stoop to correlation hunting; I'll only report statistics that relate to reasonable hypotheses founded on an understanding of the underlying phenomena. (In fact, statistics program analyses assume that researchers have specified the hypotheses in advance; if you hunt for "significance" in the output after the fact, you're abusing the software.)

Overlooking Covariants

Even when a correlation represents a true phenomenon, it can be misleading if the real action concerns a third variable that is related to the two you're studying.

For example, studies show that intelligence declines by birth order. In other words, a person who was a first-born child will on average have a higher IQ than someone who was born second. Third-, fourth-, fifth-born children and so on have progressively lower average IQs. This data seems to present a clear warning to prospective parents: Don't have too many kids, or they'll come out increasingly stupid. Not so.

There's a hidden third variable at play: smarter parents tend to have fewer children. When you want to measure the average IQ of first-born children, you sample the offspring of all parents, regardless of how many kids they have. But when you measure the average IQ of fifth-born children, you're obviously sampling only the offspring of parents who have five or more kids. There will thus be a bigger percentage of low-IQ children in the latter sample, giving us the true — but misleading — conclusion that fifth-born children have lower average IQs than first-born children. Any given couple can have as many children as they want, and their younger children are unlikely to be significantly less intelligent than their older ones. When you measure intelligence based on a random sample from the available pool of children, however, you're ignoring the parents, who are the true cause of the observed data.

(Update added 2007: The newest research suggests that there may actually be a tiny advantage in IQ for first-born children after correcting for family size and the parents' economic and educational status. But the point remains that you have to correct for these covariants, and when you do so, the IQ difference is much less than plain averages may lead you to believe.)

As a Web example, you might observe that longer link texts are positively correlated with user success. This doesn't mean that you should write long links. Website designers are the hidden covariant here: clueless designers tend to use short text links like "more," "click here," and made-up words. Conversely, usability-conscious designers tend to explain the available options in user-centered language, emphasizing text and other content-rich design elements over more vaporous elements such as "smiling ladies." Many of these designers' links might indeed have a higher word count, but that's not why the designs work. Adding words won't make a bad design better; it'll simply make it more verbose.

Over-Simplified Analysis

To get good statistics, you must tightly control the experimental conditions — often so tightly that the findings don't generalize to real problems in the real world.

This is a common problem for university research, where the test subjects tend to be undergraduate students rather than mainstream users. Also, instead of testing real websites with their myriad contextual complexities, many academic studies test scaled-back designs with a small page count and simplified content.

For example, it's easy to run a study that shows breadcrumbs are useless: just give users directed tasks that require them to go in a straight line to the desired destination and stop there. Such users will (rightly) ignore any breadcrumb trail. Breadcrumbs are still recommended  for many sites, of course. Not only are they lightweight, and thus unlikely to interfere with direct-movement users, but they're helpful to users who arrive deep within a site via search engines and direct links. Breadcrumbs give these users context and help users who are doing comparisons by offering direct access to higher levels of the information architecture.

Usability-in-the-large is often neglected by narrow research that doesn't consider, for example, revisitation behavior, search engine visibility, and multi-user decision-making. Many such issues are essential for the success of some of the highest-value designs, such as B2B websites and enterprise applications  on intranets.

Distorted Measurements

It's easy to prejudice a usability study by helping the users at the wrong time or by using the wrong tasks. In fact, you can prove virtually anything you want if you design the study accordingly. This is often a factor behind "sponsored" studies that purport to show that one vendor's products are easier to use than a competitor's products.

Even if the experimenters aren't fraudulent, it's easy to get hoodwinked by methodological weaknesses, such as directing the users' attention to specific details on the screen. The very fact that you're asking about some design elements rather than others makes users notice them more and thus changes their behavior.

One study of online advertising attempted to avoid this mistake, but simply made another one instead. The experimenters didn't overtly ask users to comment on the ads. Instead, they asked users to simply comment on the overall design of a bunch of Web pages. After the test session, the experimenters measured users' awareness of various brands, resulting in high scores for companies that ran banners on the Web pages in the study.

Does this study prove that banner ads work for branding, even though they don't work for getting qualified sales leads? No. Remember that users were directed to comment on the page designs. These instructions obviously made users look around the page much more thoroughly than they would have during normal Web use. In particular, someone who's judging a design typically inspects all individual design elements on the page, including the ads.

Many Web advertising studies are misleading, possibly because most such studies come from advertising agencies. The most common distortion is the novelty effect: whenever a new advertising format is introduced, it's always accompanied by a study showing that the new type of ad generates more user clicks. Sure, that's because the new format enjoys a temporary advantage: it gathers user attention simply because it's new and users have yet to train themselves to ignore it. The study might be genuine as far as it goes, but it says nothing about the new advertising format's long-term advantages once the novelty effect wears off.

Publication Bias

Editors follow the "man bites dog" principle to highlight new and interesting stories. This is true for both scientific journals and popular magazines. While understandable, this preference for new and different findings imposes a significant bias in the results that get exposure.

Usability is a very stable field. User behavior is pretty much the same year after year. I keep finding the same results in study after study, as do many others. Every now and then, a bogus result emerges and publication bias ensures that it gets much more attention than it deserves.

Consider the question of Web page download time. Everyone knows that faster is better. Interaction design theory has documented the importance of response times since 1968, and this importance has been seen empirically in countless Web studies since 1995. E-commerce sites that speed up response times sell more. The day your server is slow, you lose traffic. (This happened to me recently: on January 14, Tog got "slashdotted"; because we share a server, my site lost 10% of its normal pageviews for a Wednesday when AskTog's increased traffic slowed useit.com down.)

If twenty people study download times, nineteen will conclude that faster is better. But again: one of every twenty statistical analyses will give the wrong result, and this one study might be widely discussed simply because it's new. The nineteen correct studies, in contrast, might easily escape mention.

Judging Bizarre Results

Bizarre results are sometimes supported by seemingly convincing numbers. You can use the issues I've raised here as a sanity check : Did the study pull correlations out of a hat? Was it biased or overly narrow? Was it promoted purely because it's different? Or was it just a fluke?

Typically, you'll discover that deviant findings should be ignored. The broad concepts of human behavior in interactive systems are stable and easy to understand.

The exceptions usually turn out to be exactly that: exceptions.

Of course, sometimes a strange finding turns out to be revolutionary rather than illusory. This is rare, but it does happen. The key differentiators are whether the finding is repeatable and whether others can see it now that they know where to look.

In 1989, for example, I published a paper on discount usability engineering, stating that small, fast user studies are superior to larger studies, and that testing with about 5 users is typically sufficient. This was quite contrary to the prevailing wisdom at the time, which was dominated by big-budget testing. During the fifteen years since my original claim, several other researchers reached similar conclusions, and we developed a mathematical model to substantiate the theory behind my empirical observation. Today, almost everyone who does user testing has concluded that they learn most of what they'll ever learn with about five users.

As another example, my conclusion that PDF documents are bad for online information access was supported by four different studies. We're finding the same problems in our newest study, so the conclusion holds across several years as well. I was initially hesitant to come out against online PDF, because it works so well in other contexts (most notably, downloading documents for printing, which is what it was designed for). As the evidence kept mounting, however, it became clear that the conclusion for online PDF was very different than for print PDF.

You might dismiss one study that concluded that the otherwise good PDF was actually bad online. But four or five studies constitute a trend, which much enhances the finding's credibility as a general phenomenon.

Quantitative Studies: Intrinsic Risks

All the reasons I've listed for quantitative studies being misleading indicate bad research; it's possible to do good quantitative research and derive valid insights from measurements. But doing so is expensive and difficult.

Quantitative studies must be done exactly right in every detail or the numbers will be deceptive. There are so many pitfalls that you're likely to land in one of them and get into trouble.

If you rely on numbers without insights, you don't have backup when things go wrong. You'll stumble down the wrong path, because that's where the numbers will lead.

Qualitative studies are less brittle and thus less likely to break under the strain of a few methodological weaknesses. Even if your study isn't perfect in every last detail, you'll still get mostly good results from a qualitative method that relies on understanding users and their observed behavior.

Yes, experts get better results than beginners from qualitative studies. But for quantitative studies, only the best experts get any valid results at all, and only then if they're extremely careful.

Share this article: Twitter | LinkedIn | Google+ | Email