Usability Metrics and Methodologies: Trip Report from a British Computer Society Seminar, London, 21 May 1990

by Jakob Nielsen on May 22, 1990

Summary: Seminar participants discussed frameworks for usability evaluation, task analysis strategies, and measurement scales.

The British Computer Society's human-computer interaction specialist group sponsored a one-day seminar on usability metrics and methodologies in connection with its 1990 annual meeting in London. The seminar was organized by Nigel Bevan from the National Physical Laboratory and attracted an audience of about 80 people. The lively discussions during the seminar and the tea breaks indicated substantial interest in the issues of usability engineering.

There is obviously a trend towards taking usability methodology seriously as an object of study in itself in order to improve our ability to choose the right method for the job at hand. My own work on "discount usability engineering" is one exponent of this trend, and my presentation at the seminar was based on a study of the effects of expertise on the quality of a usability evaluation. It does turn out that people are better at finding usability problems in an interface the more they know about usability principles. This may come as no particular surprise, but this type of precise understanding of the tradeoffs in using different personnel categories for usability work is essential for practical usability engineering.

As another example of the discount approach, Ralph Rengger from the National Physical Laboratory discussed the number of test subjects needed for user testing and mentioned that the lists of usability problems encountered by the first five test subjects and the first ten subjects looked very similar. So for most practical tests, there is really no need to run a large number of subjects.

Adaptable Systems Coping with Change

The first speaker was John Brooke from Digital. His basic premise (based on his work in the ISO usability standards group) was that usability cannot be defined without knowing the context in which the system will be used. There are no absolute criteria for usability and there are no bad systems-only systems applied in the wrong context. Furthermore, Brooke emphasized that it is not enough to understand the users, their tasks, and their environment, since the only constant is change itself. The users, tasks, and the environment will change over time (if nothing else, then as a result of using the new system). As a matter of fact, the environment is also changing because people do not spend all their time on their office. They also work at home, in the airport, etc.

Because of this substantial potential for change, Brooke felt that the system itself must be capable of change-unless, of course, its designer is omniscient and can predict all the changes. Since designers are not omniscient, he advocated an ecological metaphor for systems development: The survival, change, and adaption of the fittest through some kind of "macro-selection" over time.

Brooke described two models for achieving ecological change of systems:

  1. Adaptive systems that modify themselves through, for example, building a dynamic model of the user. He was very sceptical about this approach since he had never seen a non-trivial such system that worked.
  2. The Darwinist approach of having adaptable systems constructed such that external agents can bring about change. This is done in some current systems through programming such as the editing of DOS batch files, but this approach means that the people who need the changes often cannot make them without outside help.

Brooke's project at DEC was to develop an "ecology management system" called DB_builder to allow users to construct their working environment by putting together elements of the environments used by other people in their organization. To construct these "virtual applications," users only need knowledge of their tasks and not of the underlying implementation. Users can create environments either by reference to specific other users ("let me have such-and-such feature from John") or by reference to general user groups ("give me a secretarial environment").

With the DB_builder, innovations spread through cultural diffusion as users acquire the tools, techniques, and experiences of each other. Brooke and his team assume that there is no single end-point to the development of a system: It will continue to grow. By the way, this diffusion principle seems similar to that used by the user-tailorable "buttons" project at Xerox EuroPARC, described by Allan MacLean at CHI'90.

Analyzing Video Tapes

Another tool-based approach to usability was presented by Paul Walsh and Jackie Laws from STC Technology Ltd. (STL). Their company has built a usability laboratory complete with mirrors and video cameras. To help analyze the video tapes from their usability tests, they have built a computerized tool running on a Macintosh II with a video board. The tool allowed the usability engineer to annotate the tape with codes for various events during a test and to retrieve specific video clips quickly.

STL is not currently selling their video analysis tool but they are "definitely interested in developing it further." I talked to others at the seminar who were very interested in buying such a tool, so there seems to be a good market opportunity in making a human factors video analysis product commercially available. Walsh and Laws claimed that the use of their new tool had reduced the time required for analyzing the video of a usability test from ten times the duration of the test to only three times the duration. Even so, a factor of three still seems like a lot of time to me.

In the discussion session, John Brooke mentioned that he had also set up a usability lab with all the works when he came to DEC but that this lab is now falling into disuse as they conduct more and more of their testing on a contextual basis in the field. Brian Shackel added that the main reason for usability labs was to demonstrate to developers that users have real problems. We will still need the usability labs for this purpose every five years or so, when the next generation of eager computerists enters industry and need convincing.

A Framework for Evaluation Methods

The traditional approach to user interface evaluation involves iterative design and experimental user testing. Andy Whitefield from University College London mentioned that these methods can sometimes be difficult to use: For example, iterative design is not suitable for safety-critical systems developed using formal specification methods, and user testing is difficult if the intended users are in another part of the world.

Because of these and other issues, usability methods vary widely in software development projects, and Whitefield wanted to progress beyond the stage where the choice of methods depended solely on the experience of the human factors specialist. He advocated work towards an increased understanding of the nature of usability engineering, and he presented a new framework for evaluation methods.

First, he defined human factors evaluation as an assessment of the conformity between a system's performance and its desired performance, where

"System" = user+computer+task+environment.
"Performance" = quality of the task product in relation to the resource costs associated with producing this product.

Second, he wanted an "assessment statement" to include both a report of any measurement studies as well as a diagnosis identifying the reasons for any shortfalls in the interface. Such assessment statements can in principle be produced on the basis of an evaluation of the real (working) computer system or a representation of the system such as a mockup of screen designs or a specification. Also, the evaluation can be performed using real users or using some surrogate to represent the users. For example, a usability specialist can perform a walkthrough of the interface and try to estimate where users will have trouble. Since the two evaluation components (computer and users) can be either real or representational independently of each other, Whitefield concluded that there are in principle four categories of evaluation methods:

  Representational users Real users
Representational computer system Analytic methods User reports such as surveys, questionnaires
Real computer system Specialist reports, walkthroughs Observational methods, experiments

Identifying Key Elements of a Task Hierarchy

To decrease the cost of a usability evaluation, Linda Macaulay from the University of Manchester stressed the need to focus on the users' key tasks. She based her work on a methods called USTM (User Skills and Task Match) which Brian Shackel later mentioned originated in work by Ken Eason back in 1974.

The basic principle is to divide the user's work up into a task hierarchy as shown in the figure. Task 1 represents the user's entire work with the system, while tasks 2 and 3 represent the two hypothetical main components into which task 1 can be subdivided. To perform task 2, the user actually performs subtasks 4 and 5, and the task hierarchy can be further recursively subdivided until one reaches the primitive subtasks that can be directly mapped onto commands or other operations.

To specify usability targets for new computer systems or to test a design, one then identifies those subtasks that are of the highest importance for overall usability. For example, it might be the case that subtasks 4, 6, 7, and 8 are the ones first encountered by new users or that they are the ones most frequently performed by experienced users. In either case, it becomes important to obtain high usability ratings for these subtasks. Because tasks 6, 7, and 8 are constituent subtasks of task 4, one would normally only measure the usability of performing task 4 since the exact scores on tasks 6, 7, and 8 would be irrelevant. The performance on task 4 is the sum of the performance on its subtasks and the relative performance on the subtasks is therefore not important.

One could further imagine that, say, task 10 was a critical task because it formed a potential differentiator where our system was hoped to be different and better than the competitors. If so, task 10 would also be classified as a key task where we would want to set specific usability targets.

conceptual diagram of a task hierarchy
The user's task is recursively subdivided into progressively more primitive constituent subtasks and one then identifies those key subtasks that are essential for the usability of the product.

Measurement Scales

Peter Wright from the University of York cited Lord Kelvin's statement that "when you cannot measure, your knowledge is of a meagre and unsatisfactory kind," and confessed to suffering from a case of "physics envy" because usability is so difficult to measure. A major problem is that a measurement result of, say, 12% decrease in task time might mean 12% better productivity, but could just as well mean 12% more time wasted on other activities or 12% more stress for the users. So we do not necessarily have a true ratio scale for usability measurements.

We also have difficulties with the simpler types of measurement scale. It may actually be hard to achieve a nominal scale for usability problems, even though nominal scales are the weakest form of measurement scale and only require us to be capable of deciding when two usability problems are the same. Unfortunately, it is hard to know to what extent two observations are really cases of the same underlying usability problem, and we do not even have a good enumeration of usability problems with which to match observations.

Using the somewhat more ambitious ordinal scales to measure usability problems, we face the counting problem. We would like to know whether one interface is better than another (the ordering of the ordinal scale) based on counting the frequencies of various usability problems in the two systems. Assume, for example, that we have measured the frequencies of usability problems A, B, and C as follows:

  Probl. A Probl. B Probl. C
System 1 1 1 8
System 2 2 1 1

It may initially seem as if System 2 is much better than System 1, but that is only true if Problem C is actually a key problem. It could be the case that Problem A is a usability catastrophe that keeps users from ever getting to use the system, whereas Problem C is a triviality that slows down users a second or so. Then it would obviously be preferable to use System 1. Unfortunately, usability comparisons are normally much less clear-cut that that.

Share this article: Twitter | LinkedIn | Google+ | Email