Summary: In addition to being expensive, collecting usability metrics interferes with the goal of gathering qualitative insights to drive design decisions. As a compromise, you can measure users' ability to complete tasks. Success rates are easy to understand and represent usability's bottom line.
Numbers are powerful. They offer a simple way to communicate usability findings to a general audience. Saying, for example, that "Amazon.com complies with 72% of the e-commerce usability guidelines" is a much more specific statement than "Amazon.com has great usability, but they don't do everything right."
In a previous Alertbox, I discussed ways of measuring and comparing usability metrics like time on task. Such metrics are great for assessing long-term progress on a project: Does your site's usability improve by at least 20% per year? If not, you are falling behind relative to both the competition and the needs of the new, less technically inclined users who are coming online.
Unfortunately, there is a conflict between the need for numbers and the need for insight. Although numbers can help you communicate usability status and the need for improvements, the true purpose of usability is to set the design direction, not to generate numbers for reports and presentations. In addition, the best methods for usability testing conflict with the demands of metrics collection.
The best usability tests involve frequent small tests, rather than a few big ones. You gain maximum insight by working with 4-5 users and asking them to think out loud during the test. As soon as users identify a problem, you fix it immediately (rather than continue testing to see how bad it is). You then test again to see if the "fix" solved the problem.
Although small tests give you ample insight into how to improve design, such tests do not generate the sufficiently tight confidence intervals that traditional metrics require. Thinking aloud protocols are the best way to understand users' thinking and thus how to design for them, but the extra time it takes for users to verbalize their thoughts contaminates task time measures.
Thus, the best usability methodology is the one least suited for generating detailed numbers.
To collect metrics, I recommend using a very simple usability measure: the user success rate. I define this rate as the percentage of tasks that users complete correctly. This is an admittedly coarse metric; it says nothing about why users fail or how well they perform the tasks they did complete.
Nonetheless, I like success rates because they are easy to collect and a very telling statistic. After all, if users can't accomplish their target task, all else is irrelevant. User success is the bottom line of usability.
Success rates are easy to measure, with one major exception: How do we account for cases of partial success? If users can accomplish part of a task, but fail other parts, how should we score them?
Let's say, for example, that the users' task is to order twelve yellow roses to be delivered to their mothers on their birthday. True task success would mean just that: Mom receives a dozen roses on her birthday. If a test user leaves the site in a state where this will occur, we can certainly score the task as a success. If the user fails to place any order, we can just as easily determine the task a failure.
But there are other possibilities as well. For example, a user might:
- order twelve yellow tulips, twenty-four yellow roses, or some other deviant bouquet;
- fail to specify a shipping address, and thus have the flowers delivered to their own billing address;
- specify the correct address, but the wrong date; or
- do everything perfectly except forget to specify a gift message to enclose with the shipment, so that Mom gets the flowers but has no idea who they are from.
Each of these cases constitutes some degree of failure (though if in the first instance the user openly states a desire to send, say, tulips rather than roses, you could count this as a success).
If a user does not perform a task as specified, you could be strict and score it as a failure. It's certainly a simple model: Users either do everything correctly or they fail. No middle ground. Success is success, without qualification.
However, I often grant partial credit for a partially successful task. To me, it seems unreasonable to give the same score (zero) to both users who did nothing and those who successfully completed much of the task. How to score partial success depends on the magnitude of user error.
In the flower example, we might give 80% credit for placing a correct order, but omitting the gift message; 50% credit for (unintentionally) ordering the wrong flowers or having them delivered on the wrong date; and only 25% credit for having the wrong delivery address. Of course, the precise numbers would depend on a domain analysis.
There is no firm rule for assigning credit for partial success. Partial scores are only estimates, but they still provide a more realistic impression of design quality than an absolute approach to success and failure.
The following table shows task success data from a study I recently completed. In it, we tested a fairly big content site, asking four users to perform six tasks.
|Note: S = success, F = failure, P = partial success|
In total, we observed 24 attempts to perform the tasks. Of those attempts, 9 were successful and 4 were partially successful. For this particular site, we gave each partial success half a point. In general, 50% credit works well if you have no compelling reasons to give different types of errors especially high or low scores.
In this example, the success rate was (9+(4*0.5))/24 = 46%.
Simplified success rates are best used to provide a general picture of how your site supports users and how much improvement is needed to make the site really work. You should not get too hung up on the details of such numbers, especially if you're dealing with a small number of observations and a rough estimate of partial success scores. For example, if your site scored 46% but another site scored 47%, it's not necessarily a better site.
That a 46% success rate is not at all uncommon might provide some cold comfort. In fact, most websites score less than 50%. Given this, the average Internet user's experience is one of failure. When users try to do something on the Web for the first time, they typically fail.
Although using metrics alone will not solve this dilemma, it can give us a way to measure our progress toward better, more usable designs.