I have received some interesting user comments on my June Alertbox on linkrot.
Old Links Keep Propagating
Karl M. Bunday writes:
Your Alertbox "Fighting Linkrot" came out just as I was engaging in a laborious procedure to fight linkrot--notifying more than 100 Web masters, over two days, by E-mail about the current URL for my site, formerly hosted on first one, then another commercial on-line service. My School Is Dead, Learn in Freedom! site has had its domain name and thus own URL,
http://learninfreedom.org/since mid-December 1997. Trying to find ways to alert webmasters of sites that link to my earlier sites about my new URL helped me discover some of the reasons linkrot is such a persistent and growing problem.
My very earliest Web presence began in early 1995. A pioneer in on-line commerce offered me some space on his site to post some of my FAQ files and gave me my own directory on his server. It didn't take me long at all to see that my link from a famous virtual library page about my site's subject had been copied on to several other Web sites, and the Yahoo link was very widely copied. I think the Yahoo code is how my site became known to the Point Survey Top 5 Percent Award people when I won their award. I have not publicized that site for two years now, but it still gets new links set up to it all the time . That site has been defunct since about November 1997--just as it won a favorable review in that month's issue of Yahoo Internet Life magazine. Many people writing Web pages copy links they see on other sites and put those links on their own sites. Links propagate by plagiarism more than they propagate by bookmarking.
In January of 1996 I got my own site on a commercial on-line service with an awkward, too-long domain-directory name. That site's provider didn't give me access statistics. By March of 1997 I had set up yet another commercial on-line service directory as my site, and that provider gave me access statistics. As soon as I tested the second site, I put up "moving notices" on the earlier commercial on-line service site to send people to the new site. Quite a few people came over to the second site. By December of 1997 a friendly site master who shares my subject-matter interest helped me set the School Is Dead, Learn in Freedom! site with my own URL,
learninfreedom.org, and I then changed my "moving notices" and began heavily publicizing my permanent domain name.
But as I write this, there are still many more links, according to AltaVista, to old URLs of my site than to the current URL . I have been visiting those linking sites (and also visiting referring sites that appear in the March 1997 site's access logs) to tell the webmasters to update their links. I am appalled at how often I find pages, for example, by college students that purport to be original works of research that are actually stolen from a professional association, or family site link pages that copy Yahoo's directory pages verbatim, or other pages that engage in blatant plagiarism. Surely any page written since November 1997 that links to my very earliest Web site is entirely in error, and was probably written by a webmaster who didn't visit links before posting them. I did a first round of notifying webmasters back in February 1998, and now in June 1998 I have done another round, only to see many new sites appear with my oldest, long obsolete link.
Search engine behavior only exacerbates this problem. I agree, after thinking about it, with your suggestion that my two earlier Web sites that provide only "moving notices" now should be kept up as long as possible, even though I'm paying for them out of a too-limited budget. But the search engines sometimes rank those obsolete sites higher in well-constructed searches (even after I've pared down those pages to being just redirects) than my current site. That's crazy. There ought to be a universally observed search engine adherence to the META robots tag so that
<META NAME="robots" CONTENT="noindex">really made sure that a page didn't get indexed--I have had all my old obsolete pages marked that way for a long time, but they still get indexed and reindexed. It would be better still to develop a new LINK tag that would inform compliant robots (which all robots should be) that a page was obsolete and replaced by a new page, something like
<LINK REV="dead" REL="current" HREF="http://learninfreedom.org/">or something like that.
Jakob's reply: It is indeed a pain to get other sites to update their listings. Currently, the only two approaches are
- Manual email to the webmaster, pointing out the obsolete link. A large site gets so much email that they often don't deal with it (indeed, it is one of the challenges in running a large site to set up processes for handling email)
- Hoping that the other site runs a link validator that notices redirects and causes the links to be updated with the new URLs
Neither approach is very effective. In the long term, I hope for an automated solution that would allow the owner of a site to send an standardized message (digitally signed, of course) to an automated agent at the remote webserver that would cause the server to update the obsolete URLs with the new ones. Message authentication is necessary to prevent link kidnapping where competitors causes links to go to their site.
A recent study found that the main search engines had between 2% and 5% linkrot. This is less than the Web average of 6% but still too much for comfort. Search engines should focus more on their core competency and clean up their databases to remove dead links and update redirected links as soon as possible. Many other steps are necessary to enhance search usability: for example, search engines could record what links users actually follow when presented with a hitlist for a query term and then promote these useful links to a more prominent position for the next user.
As a short term solution, if you run a large site, consider implementing a special subpage under your feedback area for reporting link updates . This page should contain standard fields to report the page on your site that needs change, the old URL, the new URL, and a contact email in case of problems. You can then automatically retrieve the old and the new URL and present your Web gardener with a view of three pages: your own page (highlighting the link that might need change) as well as the result of retrieving the old and the new URL. If everything seems to be OK (i.e., not a link kidnap), then the gardener could click a single button to update the site without the need for any manual HTML editing.
Alan Levine writes:
There is a nifty, well maintained site called " Ghostsites " that chronicles once-hyped web sites that now R.I.P.
We've been combing our error logs and are amazed at the URLs people are attempting to reach: sites that have not had an active link in more than 4 years and other pages that were never linked from other pages.
Should Old Pages Live Forever?
Ihor Prociuk writes:
You stated that: "Any URL that has ever been exposed to the Internet should live forever:..."
I agree that in all instances, pages that have been moved or removed should have, at the very minimum, a link page pointing from the old location to the new one. And any decent webserver/browser should be able to handle redirects and automatic bookmark updating. However if, as you say, "6% of the links on the Web are broken", after a while, the number of redirection pages could become quite large. (The percentage will not increase at the same rate as the increase in the number of overall links). I'm not sure what impact this would have on search engines. One of the annoying things about search engines is all the broken links they return. You would think they could run "link validators" more often to reduce clutter in their databases and improve overall performance.
I think that a time-limited notice (one year?) or redirect would be sufficient for webmasters to update their outbound links. And, if as you suggest, link validators be run on a regular basis, this would catch most of the bad links. Think of a company that is about to move. It makes reasonable efforts to inform its current customers that it is about the change phone numbers. However, after a while, the letterhead just gives the new number, It certainly doesn't have "5 years ago our phone number was xxx-xxxx and 2 years ago it was yyy-yyyy, now it's zzz-zzzz" . Phone companies take this approach. When a number changes, they tell you the new number. My phone company doesn't offer an option that would allow me to connect to the new number without redialing. They want me to update whatever record system (addressbook, auto-dialer, etc.) I'm using. If I want to avoid the message, I'll do the update.
Jakob's reply: Unfortunately, a year is not nearly enough time for redirecting links from old pages to the new ones. Even the search engines seem to be more than a year behind in many cases, despite having substantially more resources and professional staff than the average site. Bunday's case story above shows that two-year-old links are still rampant.
Luckily, the resources needed for redirections are relatively limited: typically a single line per obsolete URL in a server-configuration file. Thus, it should be possible to keep redirects alive for several years. I do agree that they probably can't literally live forever, but I advise extreme conservatism in pruning the database of URL redirects. The server log files will keep track of the frequency of access to the old URLs, and I would keep a redirect active as long as the old URL got at least one hit per month. After all, the cost of that extra line in the configuration file is very low. Once, say, two months have gone by without any hits, then it may be time to prune the redirect database.
SiteDeath: The Ultimate Linkrot
Mark Nottingham writes:
I had the extreme misfortune to sit through a Sybase marketing session about their involvement in the soccer World Cup today (lots of marketing speak, no content). I won't even go into the usability issues I spotted when they demo'd the France '98 Web site, but one thing struck me out of the blue:
What's going to happen to the content?
Sybase, EDS and other companies have gone to enormous expense to showcase their abilities in creating an 'end-to-end solution'. There are copious amounts of data on the site. It's getting millions of hits a day, apparently. But, when the whole thing is over, how long will they keep it up? A week? A year? How long before it isn't profitable, and then what happens?
It's the ultimate form of link rot .
I've done a quick survey of Olymics sites ;
www.atlantagames.comis no more, but
www.olympic.nbc.comis still going, as is www.nagano.olympic.org. But as events get more corporatized (such as the World Cup), will we see content disappearing? It'd be a shame to see all of those dead links, as well as losing material for future study of the early days of the Internet.
To me, this brings up issues about the ownership of Internet content, and what it engenders in the user base.
Jakob's reply: Many of these showcase sites have horrible usability because they were designed to show off the sponsoring company's technology and not to provide a useful service to users. Quite often, they are built by technology companies that don't have a single usability engineer on staff. An interesting exception is the World Cup USA 1994 which was designed by Darrell Sano: he does know what he is doing, as evidenced by the analysis of the project in Sano's book Designing Large-Scale Web Sites: A Visual Design Methodology . The 1994 site is currently preserved at mirror servers in London and Tokyo, even though the main site in California is long gone.
For major sports events, I advise the World Cup organizers and the International Olympic Committee to make sponsors sign a contract to keep their sites alive for at least ten years. These sites form the major historical presence for the games after the event, so it is sad to see them go. I would also include a contractual requirement for formal usability testing: after all, the organizers' reputation is at stake, and a bad site reflects poorly on their commitment to their fans.
Update added July 2000 :
- The London mirror of the 1994 World Cup is now gone
- The Tokyo mirror is still going strong (good work, Japan)
- The 1998 World Cup site in France is still operational
- The 1998 NBC Olympics site now redirects to a site for the 2000 Olympics in Sydney - probably fine, but there is no way to get the old content about what happened two years ago
- The official site for the Nagano Olympics also redirects to a site for the Sydney games - less appropriate in my view
Update added July 2005 :
- The Tokyo mirror continues to be available
- The 1998 World Cup site in France gone; the domain is now an advertising site for various soccer-related products
- NBC Olympic site is now gone (used to be the site for the 1998 winter Olympics)
- The Nagano Olympics site now redirects to a general site for the International Olympic Committee -- an OK redirect, but at the loss of the original data
Of six sports sites mentioned in 1998, only one retains the original mission and content in 2005. 83% of the services are no longer available . (And we're talking sites for billion-dollar operations like the Olympics and soccer World Cup.)
Pavel Podvoiski writes:
WWW is rotting, and not only because of the rotten links. Content is rotting at ever greater rate :(
Once designed for sharing information, WWW now seems to be hiding that information. I was just pissed off one more time, so I decided to write to you in hope your voice would be heard.
- In mid 1996 I visited www.microsoft.com for help in russification (i'm from Russia) of Windows NT 3.51. In few minutes I had found it just by following links.
- In 1997 I requested this information again (NT died). Surprise!! I can't find it. After approx. 30 minutes of struggling with "search" engine I found this document again.
- In 1998, just for curiousity, I repeated this experiment. No surprise, I didn't find it. A lot of buzz and almost no facts . BTW, can you tell me, does MS Internet Information Server 4.0 support FTP transfer resume or not? I was unable to figure it out from the MS site. (BTW, can you find IIS info on MS site anymore?)
Another example, 3Com this time:
- In April 1998 we where upgraiding from 10MBit to 100Mbit Ethernet. In five or so clicks i was right there. Info on cards and hubs was right in my hands with useful figures and suggestions.
- Yesterday, 20-Jan-1999, I visited this site again (company is moving+expanding) and found none, NONE!!!!! useful information, - a lot of "how good we are" and outdated outdated tables with corps of info (when i said outdated it means you trying to find product listed in table and got nothing). Finaly i and my collegue followed links to _dealers_ and found some info.
I can go on and on. Yahoo, even NCSA and a lot of others ..... :( :( :(
Update Link Destinations
Helmut Wollmersdorfer writes:
Not only dead links but also wrong links can be a problem.
Wrong links mean that the link exists, but it links to another page; not the page the designer wanted it to link.
To avoid both - dead and wrong links - I go through my pages each month, test each link and look what happens. Additionally I do the same procedure after each update of my pages.
Also, each month I ask myself: Are my links to foreign pages a good tip for my users, are there some links to remove, or some that should be changed to better ones concerning the same topic.
Jakob's reply: This is a great point. For a large site it may not be feasible to check every single link destinations every month, but doing some form of check would be a good task for the content gardener (a job category I have proposed for large sites to maintain their old content). Guidelines for checking the destinations of old links include:
- Know what old content gets the most page views: check the links on these pages more frequently than links on less-visited pages
- Have a robot that visits linked pages at regular intervals and compares their current content with an archived snapshot of their content the day you linked to them. Destinations that change the most are likely candidates for replacement. Of course, a change could be caused by an update or an improvement in the destination page, so the simple fact of the change is not enough to yank the link. Also an unchanging page can become obsolete, meaning that it would be better to link to a new source on the Web.