Putting A/B Testing in Its Place

by Jakob Nielsen on August 15, 2005

Summary: Measuring the live impact of design changes on key business metrics is valuable, but often creates a focus on short-term improvements. This near-term view neglects bigger issues that only qualitative studies can find.


In A/B testing, you unleash two different versions of a design on the world and see which performs the best. For decades, this has been a classic method in direct mail, where companies often split their mailing lists and send out different versions of a mailing to different recipients. A/B testing is also becoming popular on the Web, where it's easy to make your site show different page versions to different visitors.

Sometimes, A and B are directly competing designs and each version is served to half the users. Other times, A is the current design and serves as the control condition that most users see. In this scenario, B, which might be more daring or experimental, is served only to a small percentage of users until it has proven itself.

Finally, in multivariate tests, you vary multiple design elements at the same time, but the main issues are the same as with the more common A/B tests. For simplicity, I'll use the term "A/B" to refer to any study where you measure design alternatives by feeding them live traffic, regardless of the number of variables being tested.

Benefits

Compared with other methods, A/B testing has four huge benefits:

  • As a branch of website analytics, it measures the actual behavior of your customers under real-world conditions. You can confidently conclude that if version B sells more than version A, then version B is the design you should show all users in the future.
  • It can measure very small performance differences with high statistical significance because you can throw boatloads of traffic at each design. The sidebar shows how you can measure a 1% difference in sales between two designs.
  • It can resolve trade-offs between conflicting guidelines or qualitative usability findings by determining which one carries the most weight under the circumstances. For example, if an e-commerce site prominently asks users to enter a discount coupon, user testing shows that people will complain bitterly if they don't have a coupon because they don't want to pay more than other customers. At the same time, coupons are a good marketing tool, and usability for coupon holders is obviously diminished if there's no easy way to enter the code. When e-commerce sites have tried A/B testing with and without coupon entry fields, overall sales typically increased by 20-50% when users were not prompted for a coupon on the primary purchase and checkout path. Thus, the general guideline is to avoid prominent coupon fields. Still, your site might be among the exceptions, where coupons help more than they hurt. You can easily find out by doing your own A/B testing under your own particular circumstances.
  • It's cheap: once you've created the two design alternatives (or the one innovation to test against your current design), you simply put both of them on the server and employ a tiny bit of software to randomly serve each new user one version or the other. Also, you typically need to cookie users so that they'll see the same version on subsequent visits instead of suffering fluctuating pages, but that's also easy to implement. There's no need for expensive usability specialists to monitor each user's behavior or analyze complicated interaction design questions. You just wait until you've collected enough statistics, then go with the design that has the best numbers.

Limitations

With these clear benefits, why don't we use A/B testing for all projects? Because the downsides usually outweigh the upsides.

First, A/B testing can only be used for projects that have one clear, all-important goal, that's to say a single KPI (key performance indicator). Furthermore, this goal must be measurable by computer, by counting simple user actions. Examples of measurable actions include:

  • Sales for an e-commerce site.
  • Users subscribing to an email newsletter.
  • Users opening an online banking account.
  • Users downloading a white paper, asking for a salesperson to call, or otherwise explicitly moving ahead in the sales pipeline.

Unfortunately, it is rare that such actions are a site's only goal. Yes, for e-commerce, the amount of dollars collected through sales is probably paramount. But sites that don't close sales online can't usually say that a single desired user action is the only thing that counts. Yes, it's good if users fill in a form to be contacted by a salesperson. But it's also good if they leave the site feeling better about your product and place you on their shortlist of companies to be contacted later in the buying process, particularly for B2B sites If, for example, your only decision criterion is to determine which design generates the most white paper downloads, you risk undermining other parts of your business.

For many sites, the ultimate goals are not measurable through user actions on the server. Goals like improving brand reputation or supporting the company's public relations efforts can't be measured by whether users click a specific button. Press coverage resulting from your online PR information might be measured by a clippings service, but it can't tell you whether the journalist visited the site before calling your CEO for a quote.

Similarly, while you can easily measure how many users sign up for your email newsletter, you can't assess the equally important issue of how they read your newsletter content without observing subscribers as they open the messages.

A second downside of A/B testing is that it only works for fully implemented designs . It's cheap to test a design once it's up and running, but we all know that implementation can take a long time. Before you can expose it to real customers on your live website, you must fully debug an experimental design. A/B testing is thus suitable for only a very small number of ideas.

In contrast, paper prototyping lets you try out several different ideas in a single day. Of course, prototype tests give you only qualitative data, but they typically help you reject truly bad ideas quickly and focus your efforts on polishing the good ones. Much experience shows that refining designs through multiple iterations produces superior user interfaces. If each iteration is slow or resource-intensive, you'll have too few iterations to truly refine a design.

A possible compromise is to use paper prototyping to develop your ideas. Once you have something great, you can subject it to A/B testing as a final stage to see whether it's truly better than the existing site. But A/B testing can't be the primary driver on a user interface design project.

Short-Term Focus

A/B testing's driving force is the number being measured as the test's outcome. Usually, this is an immediate user action, such as buying something. In theory, there's no reason why the metric couldn't be a long-term outcome, such as total customer value over a five-year period. In practice, however, such long-term tracking rarely occurs. Nobody has the patience to wait years before they know whether A or B is the way to go.

Basing your decisions on short-term numbers, however, can lead you astray. A common example: Should you add a promotion to your homepage or product pages? Unless you're promoting something relevant to a user's current need, every promotion you add clutters the pages and lowers the site's usability.

When I point out the usability problems with promotions, I typically get the counter-argument that promotions generate extra sales of the target item or service. Yes: any time you give something prominent placement, it'll sell more. The question is whether doing so hurts your site in other ways.

Sometimes, an A/B test can help you here, if you examine the impact on overall sales, not just sales of the promoted product. Other times, A/B tests will fail you if the negative impact doesn't occur immediately. A cluttered site is less pleasant to use, for example, and might reduce customer loyalty. Although customers might make their current purchases, they might also be less likely to return. However small, such an effect would gradually siphon off customers as they seek out other, better sites. (This is how more-cluttered search engines lost to Google over a four-year period.)

No Behavioral Insights

The biggest problem with A/B testing is that you don't know why you get the measured results. You're not observing the users or listening in on their thoughts. All you know is that, statistically, more people performed a certain action with design A than with design B. Sure, this supports the launch of design A, but it doesn't help you move ahead with other design decisions.

Say, for example, that you tested two sizes of Buy buttons and discovered that the big button generated 1% more sales than the small button. Does that mean that you would sell even more with an even bigger button? Or maybe an intermediate button size would increase sales by 2%. You don't know, and to find out you have no choice but to try again with another collection of buttons.

Of course, you also have no idea whether other changes might bring even bigger improvements, such as changing the button's color or the wording on its label. Or maybe changing the button's page position or its label's font size, rather than changing the button’s size, would create the same or better results. Basically, you know nothing about why button B was not optimal, which leaves you guessing about what else might help. After each guess, you have to implement more variations and wait until you collect enough statistics to accept or reject the guess.

Worst of all, A/B testing provides data only on the element you're testing. It's not an open-ended method like user testing, where users often reveal stumbling blocks you never would have expected. It's common, for example, to discover problems related to trust, where users simply don't want to do business with you because your site undermines your credibility.

Bigger issues like trust or uninformative product information often have effect sizes of 100% or more, meaning that your sales would double if such problems were identified and fixed. If you spend all your time fiddling with 1-2% improvements, you can easily overlook the 100% improvements that come from qualitative insights into users' needs, desires, and fears.

Combining Methods

A/B testing has more problems than benefits. You should not make it the first method you choose for improving your site's conversion rates. And it should certainly never be the only method used on a project. Qualitative observation of user behavior is faster and generates deeper insights. Also, qualitative research is less subject to the many errors and pitfalls that plague quantitative research.

A/B testing does have its own advantages, however, and provides a great supplement to qualitative studies. Once your company's commitment to usability has grown to a level where you're regularly conducting many forms of user research, A/B testing definitely has its place in the toolbox.

(More on this angle in the full-day course on combining UX mehods and Analytics methods like A/B testing.)


Share this article: Twitter | LinkedIn | Google+ | Email