robots.txt

If you operate a public web site, there is little doubt that you'd like an occasional visit from search engine minions, known as robots. Robot are little agents that search engines dispatch to your site to scan your page contents and sent them back to the mother ship for cataloguing and finally including in search engine results pages (SERP).

 If you have ever scanned your web logs, you would undoubtedly noticed these agents. They come with different names like googlebot, msnbot, and yahoo slurp. Almost all legitimate robots ask for permission before crawling a site, and the way that's done is through a file named robots.txt. This capability has been around since the early days of the search engines, but it is perhaps one of those often forgotten details. The reason is that if a robot can't locate /robots.txt on a Web site's root, it takes that as a green light to crawl and index the whole site.

 robots.txt is flat ASCII file with a simple format. It is placed at the root directory of the Web site, so for example, it can be accessed this way: http://www.tmcnet.com/robots.txt. If you want search engines to crawl your whole site, you would specify this inside robots.txt:

User-agent: *
Disallow:
If you want to block robots from a certain location of your 
site, you would specify this:
User-agent: *
Disallow:

I won't bore you with the details. You can read about the stuff <a title="robots.txt" href=http://www.robotstxt.org/wc/norobots.html>here</a>.

Now the question is: if a missing robots.txt file is an open permission to crawl, why bother creating one? The best reason is to save on bandwidth. Many sites are designed to deliver a standard page to help lost users with missing pages. A robot looking for a missing /robots.txt file would also receive this page, and while in most instances, the standard error page will not cause any harm, the robot would still have to parse it, wasting bandwidth and resources. A safe practice to avoid this waste is to place an empty robots.txt on your Web site.

Finally, understand that /robots.txt works based on the honor system. While most legitimate search engines follow its instructions, there is no way to enforce obedience via this file.

| 1 Comment | 0 TrackBacks

Listed below are links to sites that reference robots.txt:

robots.txt TrackBack URL : http://blog.tmcnet.com/mt/mt-tb.cgi/4543

Around TMCnet:

1 Comment

Am I missing something, or is it that the two "user-agent" examples are identical?

February 2013

Sun Mon Tue Wed Thu Fri Sat
          1 2
3 4 5 6 7 8 9
10 11 12 13 14 15 16
17 18 19 20 21 22 23
24 25 26 27 28    

Technorati

Technorati search

» Blogs that link here

Powered by Movable Type 4.38

About this Entry

This page contains a single entry by published on May 23, 2005 11:17 AM.

MSNBC.com's missing expandable menu was the previous entry in this blog.

Google web accelerator – part I is the next entry in this blog.

Find recent content on the main index or look in the archives to find all content.

Subscribe to Blog

Categories

Around TMCnet Blogs

Latest Whitepapers

TMCnet Videos