Mark Bernstein recently sent me the following email:
Driving to work, I realized that I'm not sure I know the answer to the following word problem. I suspect that most webmasters don't know it either, and that it would be useful to know. Possible Alertbox column?
A word problem:
A. Beauregard Clump, boy webmaster, is reviewing his logs over the past weeks. The total number of hits that one of his pages received is as follows:
week hits 1 120 2 132 3 116 4 120 5 148
- Is the increased traffic in week 5 statistically significant?
- Is the growth in weekly traffic throughout this period statistically significant?
Let us start by looking at the second question.
When analyzing numbers related to the growth of a website, I normally recommend looking at them on a logarithmic scale. The reason is that the Web and the Internet both experience exponential growth . Therefore, Web statistics are better analyzed in terms of growth rates than in terms of linear growth.
Admittedly, the Web's growth has slowed down since the phenomenal pace we experienced in 1993 and 1994, but it still doubles in size every year. New sites go live, new users come online, and traffic keeps climbing.
The diagram shows the reported traffic stats plotted with a logarithmic scale on the y -axis. With a logarithmic scale, an exponential growth curve shows as a straight line, and the best-fit growth curve has been added to the figure in red.
A regression analysis gives R 2 =0.26 which means that 26% of the variance in the data seems to be due to an underlying growth in site traffic (the remaining 74% of the variance is random fluctuations). Unfortunately, the statistical significance of the regression is p =0.37 which means that we would have observed the same data 37% of the time even if there was no underlying growth at all but only random fluctuations in site traffic.
Is a significance level of 37% low enough to conclude that the site is growing? Most scientists would say no, since they usually require a significance level of p <0.05 or even p <0.01 to conclude that a study supports a hypothesis. Since we are not looking for scientific truth, we can relax our requirements somewhat and conclude that the odds are in favor of slight growth.
The main reason to study Web growth is to plan server capacity and future business models . To do so, we need to know not just the most likely growth of the site, but also the possible range of growth. The most likely growth of the site comes from the best-fit regression curve in the chart: an annualized growth rate of 442%, meaning that traffic in a year will be about 662 pageviews per week.
When running a regression analysis, you can ask for a confidence interval , which is the range of possible values for the growth rate. I like working with 90% confidence intervals, meaning that the true value will fall outside the estimated range 10% of the time. For the sample site, the 90% confidence interval for the annualized growth rate is from -88% to 24,606%. This huge spread is a good indication that the data is too weak for any real conclusions. In other words, Beauregard Clump could easily be on a decline to 14 pages per week next year or his page could explode to 30,000 weekly pageviews. We will need more data to narrow the range and plan the site's future.
Example With More Data: www.useit.com
With data from more weeks it becomes possible to predict growth much more precisely. The following chart shows usage data from www.useit.com. Again, the data has been plotted on a logarithmic scale and fitted with a red regression line.
For this dataset, the exponential growth curve has a much better regression fit, with R 2 =0.96 and p <0.001. The best fit corresponds to an annualized growth rate of 505%, with a 90% confidence interval ranging from 433% to 588%. In other words, weekly traffic in February 1999 is most likely to be 253,000 pageviews but could be anywhere from 223,000 to 298,000.
In calculating the statistics for useit.com I eliminated the traffic data from the last two weeks in December and the first week in January (shown as lighter dots in the diagram). Traffic slows dramatically on many sites during the holiday season , and it is therefore best to discard the data from this period when calculating any long-term trends. The exception from this rule would obviously be any sites that specifically aimed at selling Christmas presents or provide other holiday services.
In the early years of the Web, it was also necessary to analyze traffic data differently during the summer months because the Web was dominated by university users who spent much of the summer away from their Internet accounts. Summer traffic is less exceptional now, both because of increased business use and because many students use other accounts when they are away from school. It may still be necessary to treat summer traffic differently when analyzing sites that are targeted at users in countries like Germany where people take long summer vacations and may not bring a laptop to the beach.
A final point is that it is best to analyze traffic data in weekly chunks . Looking at daily traffic introduces too much noise from the varying transmission problems on the Internet. Also, many sites have very different traffic patterns on weekdays and on weekends. For a business-oriented site, it is very common for weekend traffic to drop to less than half of the normal rate. Fluctuations from Friday to Saturday and from Sunday to Monday obviously do not indicate any real change in site usage, so the statistics become much easier to analyze when aggregated over a full week.
Individual Weekly Stats
Now returning to question 1 in Mark Bernstein's word problem: Is the increase in traffic in Week 5 significant? The traditional way to answer this question would be to build a statistical model for the data from Weeks 1 through 4 and then calculate the probability that the observation from Week 5 would fit within the model's predictions. If the probability is low enough (typically, less than 5%), then you conclude that it is too unlikely to have happened if the site had behaved the same in Week 5 as in Weeks 1-4. In other words, something new must have happened to cause the increased traffic. If, on the other hand, the probability for fitting within the statistical model is high, then you conclude that nothing significant had changed and that the increase was just a random fluctuation.
Performing this statistical analysis requires the data to follow certain assumptions. For most statistical analyses, the data must follow a normal distribution (or at least be close to one). In this case, we have much too little data to conclude anything about the distribution of the observations, so it would be very difficult to conclude anything about a single week's traffic.
I am being cautious because Web traffic has not yet been sufficiently studied for us to know its statistical properties. The two main things that are known are that long-term traffic tends to grow exponentially (the average site grew 130% in 1997 ) and that short-term traffic is extremely volatile . It is very common for a website to have its traffic double or be cut in half from one week to the next . This high variability means that it is extremely difficult to judge traffic patterns based on short-term data. Only by looking at observations over several months does it become possible to distinguish between fluctuations and trends.
Once the "normal" volatility for a given site is known it will be possible to calculate the probability that any given week's traffic is uncommonly large or small. There is insufficient data to provide an answer for the sample site, but my intuition says that the increase in Week 5 falls well within the random fluctuations expected on the Web.
Because of the high variability in site traffic, it is usually best to have enough spare server capacity to handle a doubled load without warning . Furthermore, the long-term traffic trends should be tracked regularly, and upgrades planned before it is too late and irate users start leaving you for competing sites.
Ironically, a few days after I wrote this essay, my own site experienced a capacity overload. www.useit.com has about five times the capacity needed to serve normal traffic, but this turned out to be insufficient to handle the load the day Jesse Berst wrote about my work in AnchorDesk . Traffic exploded and it was almost impossible to get through for about two hours - until the server was upgraded. At least I had a contingency plan, even though it took a little time to activate.