May 2005 Archives

robots.txt

May 23, 2005 11:17 AM | 1 Comment

If you operate a public web site, there is little doubt that you'd like an occasional visit from search engine minions, known as robots. Robot are little agents that search engines dispatch to your site to scan your page contents and sent them back to the mother ship for cataloguing and finally including in search engine results pages (SERP).

 If you have ever scanned your web logs, you would undoubtedly noticed these agents. They come with different names like googlebot, msnbot, and yahoo slurp. Almost all legitimate robots ask for permission before crawling a site, and the way that's done is through a file named robots.txt. This capability has been around since the early days of the search engines, but it is perhaps one of those often forgotten details. The reason is that if a robot can't locate /robots.txt on a Web site's root, it takes that as a green light to crawl and index the whole site.

 robots.txt is flat ASCII file with a simple format. It is placed at the root directory of the Web site, so for example, it can be accessed this way: http://www.tmcnet.com/robots.txt. If you want search engines to crawl your whole site, you would specify this inside robots.txt:

User-agent: *
Disallow:
If you want to block robots from a certain location of your 
site, you would specify this:
User-agent: *
Disallow:

I won't bore you with the details. You can read about the stuff <a title="robots.txt" href=http://www.robotstxt.org/wc/norobots.html>here</a>.

Now the question is: if a missing robots.txt file is an open permission to crawl, why bother creating one? The best reason is to save on bandwidth. Many sites are designed to deliver a standard page to help lost users with missing pages. A robot looking for a missing /robots.txt file would also receive this page, and while in most instances, the standard error page will not cause any harm, the robot would still have to parse it, wasting bandwidth and resources. A safe practice to avoid this waste is to place an empty robots.txt on your Web site.

Finally, understand that /robots.txt works based on the honor system. While most legitimate search engines follow its instructions, there is no way to enforce obedience via this file.

As a frequent reader of MSNBC's Web site, I started noticing this past weekend that their left-side navigation menu items no longer expanded. As of this writing, the menu has yet to regain its dynamic trait.

The expanding menu has been part of MSNBC's navigational feature for many years. As the user hovered over the different item, a submenu would branch off displaying links to the top news for that section and other relevant sub-sections within. The sub-section items, once hovered over, would in turn open up their own menus displaying relevant links.

I always liked this functionality. It provided a one-click access to the stories I wanted to view. Dynamic menus do come with some inherent issues. One of the most problematic is layering. Most dynamic menus have the unfortunate side-effect of being eclipsed by active controls on a web browser. Those controls comprise items such as drop-down lists, applets, and flash.

To solve that problem, MSNBC would hide the active controls on the page whenever the user hovered on a menu item, thus the expanded menu would not clash with other controls on the page. It meant that many times interactive banners would suddenly vanish, and I suspect the advertisers weren't so pleased about their banners doing the disappearing act.

Now, with the expandable menu gone (at least for now), MSNBC is reaping several benefits, albeit at the expense of upsetting the dynamic menu fans. The banners would no longer need to be hidden, the users who would need to click on and visit the various section pages to see the relevant links are now greeted with a sponsored splash page (read more page impressions), and MSNBC.com would no longer need to maintain the dynamic menu.

According to one of our Web designers most people dislike dynamic menus because they interfere with the page and irritate the user. Perhaps that was part of MSNBC's reasoning to kill its dynamic menu. But given the other benefits, I hardly doubt MSNBC agonized much over this decision.

By the time I got home tonight, my 9-year old was nearly finished with her homework. The only question remaining on her assignment sheet was "What is Olympus Mons?"

Now I knew I had heard of this term before, but I just couldn't come up with a definitive answer. Was it a crater on the Moon? A rock formation on Mars? I was certain the term pertained to some off-earth object, but it's a big space with lots of objects.

So I promised her that we will look it up in the dictionary after dinner. Her response, "Let's look it up on Google first?" I was struck with how fast the Internet has endeared itself to even the elementary school kids today. The truth is that I do the same when I am looking for something, so why shouldn't she? But somehow I can't help feeling sad about how drastically the Web has mutated our culture.

Instead of opening a book or two, now we just Google it. In some ways we have been robbed from the fun and challenge of searching for something the old-fashioned way. But there is no defying progress.

As a compromise, I suggested for us to look the term up on wikipedia.com. At least that Web site has some semblance to a real encyclopedia. No dice, wikipedia was stumped, though it came up with some suggested links. But clicking on those would mean too much effort. And so Google became the clear the winner, and we didn't even have to click on any search results. In a flash of a page-load, the answer sat before us.

Olympus Mons, located on Mars, is the largest volcano in the solar system. A speedy answer, courtesy of the omniscient Google.

December 2008

Sun Mon Tue Wed Thu Fri Sat
  1 2 3 4 5 6
7 8 9 10 11 12 13
14 15 16 17 18 19 20
21 22 23 24 25 26 27
28 29 30 31      

Technorati

Technorati search

» Blogs that link here

Powered by Movable Type 4.23-en

About this Archive

This page is an archive of entries from May 2005 listed from newest to oldest.

April 2005 is the previous archive.

June 2005 is the next archive.

Find recent content on the main index or look in the archives to find all content.

Subscribe to Blog

Categories

Around TMCnet Blogs

  • Communications and Technology Blog - Tehrani.com:
    Why CRM is More Important Now Than Ever
  • Cross Talk:
    OCS R2 Ready for Prime Time - See a
  • First Coffee:
    Pitney Bowes's CDQ, Informatica and Kerensen, DMC's Link 200
  • Greg Galitzine's VoIP Authority Blog:
    Better Living Through... VoIP?
  • On Rad's Radar?:
    Bell-Head versus Net-Head
  • VoIP & Gadgets Blog:
    MovableType Facebook Connect Problems and Fixes
  • Communications and Technology Blog - Tehrani.com:
    ITEXPO Miami Show Hotel Sold Out
  • First Coffee:
    Everlusion's CustomerHunt, Vindicia, YouTube Rules, Voxify Work Up 100
  • Greg Galitzine's VoIP Authority Blog:
    Happy New Year!
  • VoIP & Gadgets Blog:
    CNAM (CallerID with Name) on Asterisk using Reverse Phone
  • Latest Whitepapers

    TMCnet Videos