First date with the Googlebot: Headers and compression

Thursday, March 06, 2008

googlebot with flowers

Name/User-Agent: Googlebot
IP Address: Learn how to verify Googlebot
Looking For: Websites with unique and compelling content
Major Turn Off: Violations of the Webmaster Guidelines Googlebot —what a dreamboat. It's like they know us <head>, <body>, and soul. They're probably not looking for anything exclusive; they see billions of other sites (though we share our data with other bots as well), but tonight we'll really get to know each other as website and crawler.

I know, it's never good to over-analyze a first date. We're going to get to know Googlebot a bit more slowly, in a series of posts:

  1. Our first date (tonight!): Headers Googlebot sends, file formats they "notice," whether it's better to compress data
  2. Judging their response: Response codes (301, 302), how they handle redirects and If-Modified-Since
  3. Next steps: Following links, having them crawl faster or slower (so they don't come on too strong)

And tonight is just the first date...


Googlebot: ACK

Website: Googlebot, you're here!

Googlebot: I am.

GET / HTTP/1.1
Host: example.com
Connection: Keep-alive
Accept: */*
From: googlebot(at)googlebot.com
User-Agent: Mozilla/5.0 (compatible; Googlebot/2.1; +https://www.google.com/bot.html)
Accept-Encoding: gzip,deflate

Website: Those headers are so flashy! Would you crawl with the same headers if my site were in the U.S., Asia or Europe? Do you ever use different headers?

Googlebot: My headers are typically consistent world-wide. I'm trying to see what a page looks like for the default language and settings for the site. Sometimes the User-Agent is different, for instance AdSense fetches use Mediapartners-Google:

User-Agent: Mediapartners-Google

Or for image search:

User-Agent: Googlebot-Image/1.0

Wireless fetches often have carrier-specific user agents, whereas Google Reader RSS fetches include extra info such as number of subscribers.

I usually avoid cookies (so no Cookie: header) since I don't want the content affected too much by session-specific info. And, if a server uses a session id in a dynamic URL rather than a cookie, I can usually figure this out, so that I don't end up crawling your same page a million times with a million different session ids.

Website: I'm very complex. I have many file types. Your headers say Accept: */*. Do you index all URLs or are certain file extensions automatically filtered?

Googlebot: That depends on what I'm looking for. If I'm indexing for regular web search, and I see links to MP3s and videos, I probably won't download those. Similarly, if I see a JPG, I will treat it differently than an HTML or PDF link. For instance, JPG is much less likely to change frequently than HTML, so I will check the JPG for changes less often to save bandwidth. Meanwhile, if I'm looking for links as Google Scholar, I'm going to be far more interested in the PDF article than the JP