Preventing Public Search Engines from Spidering PDF Files

by Jakob Nielsen on July 28, 2003

Sidebar to Jakob Nielsen 's column Gateway Pages Prevent PDF Shock .

There are four mechanisms you can use to keep your PDF files out of search engines:

  • Place all PDF files in a separate directory and use a robots.txt file to tell search engines to avoid anything in that directory. If the PDF files are in a directory called /pdf , for example, add the following two lines to your robots.txt file:
    User-agent: *
    Disallow: /pdf/
    The robots.txt file should be at your website's root level (e.g., www.useit.com/robots.txt).
  • If you prefer to keep your PDF files in the directories where they logically belong, you can list the individual PDF files on separate lines in the robots.txt file. This is obviously a maintenance nightmare. Unfortunately, there is no way to disallow spidering of a certain file type, so you must list each file if you want to use robots.txt without a special directory for banned files.
  • Google supports an extension to the robots.txt standard that allows you to keep it from spidering PDF files. Unfortunately, this is not part of the full standard and thus will not work for other search engines. Add the following lines to robots.txt:
    User-agent: Googlebot
    Disallow: /*.pdf$
  • To keep search engines from following links from your proper Web pages to the PDF files, add the following meta tag to the head of each page:
    <meta name="robots" content="nofollow">
    If you've followed the recommendation to use a gateway page for each PDF file , and you ensure that the gateway page contains the only link to the PDF, then preventing search engines from following the link will do the trick.

None of these solutions is ideal. It would be much better if you could tell search engines the file types that you want them to index.

Even if you use the "nofollow" convention for PDF file links, there is still a risk that other websites will cluelessly link directly to your PDF files, and thus expose the URLs to spiders. (See sidebar for advice on how to link to PDF documents on other websites .)

As a final option, you can password protect all PDF files. Because search engines won't know the password, they won't be able to index the PDF file. This approach is good for extranets and for documents that you're selling, because users will accept the need for authentication. For standard Web browsing, however, passwords are a bad idea because they're an additional barrier between users and the information they seek.


Share this article: Twitter | LinkedIn | Google+ | Email