robots.txt

If you operate a public web site, there is little doubt that you'd like an occasional visit from search engine minions, known as robots. Robot are little agents that search engines dispatch to your site to scan your page contents and sent them back to the mother ship for cataloguing and finally including in search engine results pages (SERP).

If you have ever scanned your web logs, you would undoubtedly noticed these agents. They come with different names like googlebot, msnbot, and yahoo slurp. Almost all legitimate robots ask for permission before crawling a site, and the way that's done is through a file named robots.txt. This capability has been around since the early days of the search engines, but it is perhaps one of those often forgotten details. The reason is that if a robot can't locate /robots.txt on a Web site's root, it takes that as a green light to crawl and index the whole site.

robots.txt is flat ASCII file with a simple format. It is placed at the root directory of the Web site, so for example, it can be accessed this way: http://www.tmcnet.com/robots.txt. If you want search engines to crawl your whole site, you would specify this inside robots.txt:

User-agent: *
Disallow:

If you want to block robots from a certain location of your

site, you would specify this:

User-agent: *
Disallow:

I won't bore you with the details. You can read about the stuff <a title="robots.txt" href=http://www.robotstxt.org/wc/norobots.html>here</a>.

Now the question is: if a missing robots.txt file is an open permission to crawl, why bother creating one? The best reason is to save on bandwidth. Many sites are designed to deliver a standard page to help lost users with missing pages. A robot looking for a missing /robots.txt file would also receive this page, and while in most instances, the standard error page will not cause any harm, the robot would still have to parse it, wasting bandwidth and resources. A safe practice to avoid this waste is to place an empty robots.txt on your Web site.

Finally, understand that /robots.txt works based on the honor system. While most legitimate search engines follow its instructions, there is no way to enforce obedience via this file.

1 Comment