There are four mechanisms you can use to keep your PDF files out of search engines:
Place all PDF files in a separate directory and use a
to tell search engines to avoid anything in that directory. If the PDF files are in a directory called
- If you prefer to keep your PDF files in the directories where they logically belong, you can list the individual PDF files on separate lines in the robots.txt file. This is obviously a maintenance nightmare. Unfortunately, there is no way to disallow spidering of a certain file type, so you must list each file if you want to use robots.txt without a special directory for banned files.
Google supports an extension to the robots.txt standard that allows you to keep it from spidering PDF files. Unfortunately, this is not part of the full standard and thus will not work for other search engines. Add the following lines to robots.txt:
To keep search engines from following links from your proper Web pages to the PDF files, add the following meta tag to the head of each page:
<meta name="robots" content="nofollow">
None of these solutions is ideal. It would be much better if you could tell search engines the file types that you want them to index.
Even if you use the "nofollow" convention for PDF file links, there is still a risk that other websites will cluelessly link directly to your PDF files, and thus expose the URLs to spiders. (See sidebar for advice on how to link to PDF documents on other websites .)
As a final option, you can password protect all PDF files. Because search engines won't know the password, they won't be able to index the PDF file. This approach is good for extranets and for documents that you're selling, because users will accept the need for authentication. For standard Web browsing, however, passwords are a bad idea because they're an additional barrier between users and the information they seek.