spidering blogs

Because I am personally offended by 404 errors in my logs, I finally got around to adding a robots.txt file to this blog. If you’re a web developer you use robots.txt to control how spiders such as google or yahoo index your site, for example, to keep them from indexing particular parts of your site or to keep a particular spider from hitting your site at all.

Initially I thought that I didn’t need a robots.txt because I wanted all parts of my blog to be indexed by everyone! More indexing good! But then I noticed that I was getting enormous amounts of hits on trackback links (/mt/tb.cgi) which, since they are a script, change every time they are accessed. That seems silly. If I add the trackbacks to the individual posting archives (something that’s on my todo list), the content will get indexed via those pages, so I won’t need tb.cgi indexed at all. ditto ditto comments if I ever get around to adding those (/mt/mt-comments.cgi), so those don’t need to be indexed either. Really, the whole /mt/ directory should be off limits to spiders in the first place. So. robots.txt:

User-agent: *
Disallow: /mt/