Zipf curves follow a straight line when plotted on a double-logarithmic diagram. The figure shows a simple dataset with 300 elements that follow a Zipf distribution. Note how the line connecting the datapoints is straight on the right diagram (with logarithmic scales on both axes). Most of the plots you are used to see have linear scales, so for sake of comparison, the left diagram in the figure has the same datapoints plotted on linear scales.
|Linear scales on both axes||Logarithmic scales on both axes|
It is clear from the table that Zipf curves have a tendency to hug the axes of the diagram when plotted on linear scales. This is why we usually plot them on double-logarithmic diagrams, even though most people are not used to interpret such diagrams. A simple description of data that follow a Zipf distribution is that they have
- a few elements that score very high (the left tail in the diagrams)
- a medium number of elements with middle-of-the-road scores (the middle part of the diagram)
- a huge number of elements that score very low (the right tail in the diagram)
Zipf distributions have been shown to characterize use of words in a natural language (like English) and the popularity of library books, so typically
- a language has a few words ("the", "and", etc.) that are used extremely often, and a library has a few books that everybody wants to borrow (current bestsellers)
- a language has quite a lot of words ("dog", "house", etc.) that are used relatively much, and a library has a good number of books that many people want to borrow (crime novels and such)
- a language has an abundance of words ("Zipf", "double-logarithmic", etc.) that are almost never used, and a library has piles and piles of books that are only checked out every few years (reference manuals for Apple II word processors, etc.)
Much available data suggests that Web use follows a Zipf distribution. The figure shows the distribution of incoming page requests to www.sun.com during a one-month period last year. Each datapoint represents one page, with the x-axis showing pages sorted according to popularity: the first page is the most popular one (the home page), the second page is the one that received second-most requests that month, and so on until we reach page number 10,000 which was only requested a single time that month. The heavy line shows the actual empirical data from the log files and the thin red line shows a Zipf curve that seems to fit the data quite well except for the low end. The deviation at the low end is due to a variety of factors, including the fact that the site is not old enough yet to have enough accumulated pages of low-frequency interest.
Comparing empirical log data from Sun's website with a theoretical Zipf distribution.
Note use of a double-logarithmic scale .
The figure shows incoming page-requests to a single site. Other studies have found that Zipf curves characterize the outgoing page requests from the employees of an organization (there are a few pages that everybody look at and a large number of pages that are seen only once). It also seems that the distribution of hypertext references on the Web follows a Zipf distribution (both in the sense that there are a few sites that everybody link to and many sites that almost nobody links to; and that any given site gets much of its traffic referrals from a few other sites while receiving small amounts of traffic from a vast variety of other sites) and that the participation in Usenet discussion groups follows a Zipf distribution (a few people post most of the messages and many people post very sparingly).
Update: More Evidence
See also my 2003 article, Diversity is Power for Specialized Sites , for additional information about the Web's Zipf distribution (and the Zipf distribution of weblogs).