Thursday, March 06, 2008
IP Address: Learn how to verify Googlebot
Looking For: Websites with unique and compelling content
Major Turn Off: Violations of the Webmaster Guidelines Googlebot —what a dreamboat. It's like they know us
<body>, and soul. They're probably
not looking for anything exclusive; they see billions of other sites (though we share our data
with other bots as well), but tonight we'll really get to know each other as website and crawler.
I know, it's never good to over-analyze a first date. We're going to get to know Googlebot a bit more slowly, in a series of posts:
- Our first date (tonight!): Headers Googlebot sends, file formats they "notice," whether it's better to compress data
Judging their response: Response codes (
302), how they handle redirects and
- Next steps: Following links, having them crawl faster or slower (so they don't come on too strong)
And tonight is just the first date...
Website: Googlebot, you're here!
Googlebot: I am.
GET / HTTP/1.1 Host: example.com Connection: Keep-alive Accept: */* From: googlebot(at)googlebot.com User-Agent: Mozilla/5.0 (compatible; Googlebot/2.1; +https://www.google.com/bot.html) Accept-Encoding: gzip,deflate
Website: Those headers are so flashy! Would you crawl with the same headers if my site were in the U.S., Asia or Europe? Do you ever use different headers?
Googlebot: My headers are typically consistent world-wide. I'm trying to see what a page
looks like for the default language and settings for the site. Sometimes the
User-Agent is different, for instance AdSense fetches use
Or for image search:
Wireless fetches often have carrier-specific user agents, whereas Google Reader RSS fetches include extra info such as number of subscribers.
I usually avoid cookies (so no
Cookie: header) since I don't want the content
affected too much by session-specific info. And, if a server uses a session id in a dynamic URL
rather than a cookie, I can usually figure this out, so that I don't end up crawling your same
page a million times with a million different session ids.
Website: I'm very complex. I have many file types. Your headers say
Accept: */*. Do you index all URLs or are certain file extensions automatically
Googlebot: That depends on what I'm looking for. If I'm indexing for regular web search, and I see links to MP3s and videos, I probably won't download those. Similarly, if I see a JPG, I will treat it differently than an HTML or PDF link. For instance, JPG is much less likely to change frequently than HTML, so I will check the JPG for changes less often to save bandwidth. Meanwhile, if I'm looking for links as Google Scholar, I'm going to be far more interested in the PDF article than the JP