The robots.txt file and search engine optimization – Webmaster SEO

On how to tell the search engine spiders and crawlers which directories and files to include, and which to avoid.

Search engines find your web pages and files by sending out robots (also called bots, spiders or crawlers) that follow the links found on your site, read the pages they find and store the content in the search engine databases.

Dan Crow of Google puts it this way: “Usually when the Googlebot finds a page, it reads all the links on that page and then fetches those pages and indexes them. This is the basic process by which Googlebot “crawls” the web.”

But you may have directories and files you would prefer the search engine robots not to index. You may, for instance, have different versions of the same text, and you would like to tell the search engines which is the authoritative one (see: How to avoid duplicate content in search engine promotion).

How do you stop the robots?

the robots.txt file

If you are serious about search engine optimization you should make use of the Robots Exclusion Standard adding a robots.txt file to the root of you domain.

By using the robots.txt file you can tell the search engines what directories and files they should spider and include in their search results, and what directories and files to avoid.

This file must be uploaded to the root accessible directory of your site, not to a sub directory. Hence Pandia’s robots.txt file is found at http://www.pandia.com/robots.txt.

Plain ASCII please!

robots.txt should be a plain ASCII text file.

Use a text editor or text HTML editor to write it, not word processors like Word.

Pandia’s robots.txt file gives a good example of an uncomplicated file of this type:

(more…)