Traffic Log Patterns

by Jakob Nielsen on July 10, 2006

Summary: The relative popularity of a site's pages, the number of visitors referred by other sites, and the traffic from search queries continue to follow a Zipf distribution.


About 10 years ago, I showed that the popularity of a website's pages followed a power law . Briefly stated, a few pages on a website were extremely popular, a larger set was moderately popular, and the vast majority constituted the " long tail " of low-traffic pages.

Mathematically, the Zipf curve is a straight line on a double-logarithmic diagram, when we plot pages with their popularity rank on the x -axis and their number of hits on the y -axis.

The old analyses showed that this same distribution also described both the number of incoming references to a website from other sites and the outgoing traffic from a company's employees.

Do these findings continue to hold today? The Web is now 2,200 times bigger , so traffic patterns might have changed. I decided to find out.

Page Popularity

The following chart shows useit.com traffic during a recent eight-week period. Each dot represents one page, and the pages are sorted by popularity. The most popular page (the homepage) got 261,024 pageviews.

 

For the most popular 350 pages, the empirical data follows the theory remarkably well. Thereafter, however, the data trails off and the next 700 pages have less traffic than predicted. Also, in theory, there should have been about a quarter-million additional pages with low traffic, but I simply haven't gotten around to writing that much.

Zipf distribution of the popularity of Web pages sorted by traffic rank

The small insert in the upper right shows the equivalent diagram from my analysis 10 years ago. It's uncanny how closely it resembles the new data. In particular, the old curve also trailed off toward the end and was missing a vast number of low-traffic pages.

The end of the "long tail" is absent from both diagrams because the sites haven't accumulated enough old content to have the expected number of low-traffic pages. Instead, they have a drooping tail . It might take hundreds of years for a site written by a single person (or even a single marketing department) to accumulate a quarter-million pages.

Incoming Traffic from Other Sites

The next diagram shows the number of visitors referred to useit.com through links from other websites during the eight-week period. Each dot represents one external site.

 

Zipf distribution of incoming referrals from other sites, sorted by traffic rank

In this case, the empirical data hugs the theory's red line all the way down to the x-axis. The difference here is that there's no lack of other sites that might link to useit.com, and a huge number of these sites are low-traffic blogs that only occasionally refer individual users.

The chart's one obvious exception to the theory is that the site that referred the most visitors accounted for many more visits than predicted. Google (depicted as an extra-large dot) referred 257,040 visitors; in theory, it should have referred only 52,479.

Google is five times as popular as the theory predicts, but this phenomenon could fade as other search engines catch up. Only time will tell.

Also, while Google is disproportionally important, when taken together, the other 35,631 referring sites accounted for 35% more traffic. Clearly, it's not a good idea to focus only on #1.

Search Engine Queries

To find useit.com, users employed a total of 110,399 different queries across various search engines. Of these queries, 83% were used only once during the eight-week period.

 

The top 10 queries accounted for 10% of the total traffic, so each one of these queries is obviously more important than those that brought only one visitor. Taken together, however, the single-use queries accounted for three times as much traffic as the top 10 queries. This statistic shows the folly of focusing search engine optimization solely on a few high-performing queries. Your site must be found when users enter relevant queries -- and the possibilities are typically vast.

The following diagram shows the distribution of search engine queries, sorted by the number of incoming users who arrived after searching for that string.

Zipf distribution of the search queries sorted by traffic rank

This chart shows roughly the expected distribution, but with a hump for queries #5–300. In other words, queries in the middle range are more important than the theory predicts . Sample queries in this range include response time , open link in new window , teenagers , site maps , eye tracking , and link color . Useit.com is not particularly focused on any of these topics, which is why they're not at the top of the list. For each query, however, I have at least one good related article.

In general, my site covers a broad range of topics in some depth, which is probably why it has this hump of mid-range queries with extra traffic.

As we proceed down the list of query popularity, the queries become longer and longer. The following diagram shows the number of query words for each of the first nineteen groups of a thousand queries. (That is, queries #1-1,000 are first, followed by queries #1,001-2,000, and so on through queries #18,001-19,000.)

Length of search queries by traffic cohort

Single-word queries were fairly common among the first thousand queries (i.e., those that generated the most traffic), but drop off quite quickly. Conversely, four- and five-word queries are rare among the most popular queries, but are a big proportion of queries starting at about #7,000.

This shows the importance of considering longer queries in your search engine marketing: Multiple-word queries are the best way to capture the vast range of user interests.

In this case, long queries in the #7,000 traffic range included radio buttons and check boxes (five words) and horizontal scrollbar in html (four words).

In 10 Years, Almost No Change

In comparing the new data with data from 10 years ago, the biggest finding is that the curves look almost the same . Several measures of Web traffic followed a Zipf curve in 1996, and they still do.

 

The two exceptions are both search related:

  • A single search engine is disproportionally popular. Is this a temporary fluke or a fundamental change in the Internet's fabric? Check back in 10 years!
  • Users today enter more long queries. (I discuss this trend in more detail in my course on Fundamental Guidelines for Web Usability .)

Mainly, though, the big patterns of Web use remain remarkably robust. This is explained by the same phenomenon that dictates the long-term durability of usability guidelines . In both cases, conclusions are independent of changes in technology or fashion . Rather, they are due to the fundamental nature of human behavior.

 

Knowing that a single distribution describes these many forms of Web use can help you analyze your own log files : plot your statistics on a log-log scale and see if they fall on a straight line. If yes, your site follows the theory. If no, see where you deviate: In the head, the middle, or the tail? Above or below the prediction line? Any deviations help you understand ways in which your traffic is different than the norm. These insights may also help you spot opportunities for growth.


Share this article: Twitter | LinkedIn | Google+ | Email