然而,在某个用户流量的高峰时段,这个站点转而将它的robots.txt切换到限制性极强的机制上:
# Can you go away for a while? I'll let you back
# again in the future. Really, I promise!
User-Agent: *
Disallow: /
[[["易于理解","easyToUnderstand","thumb-up"],["解决了我的问题","solvedMyProblem","thumb-up"],["其他","otherUp","thumb-up"]],[["没有我需要的信息","missingTheInformationINeed","thumb-down"],["太复杂/步骤太多","tooComplicatedTooManySteps","thumb-down"],["内容需要更新","outOfDate","thumb-down"],["翻译问题","translationIssue","thumb-down"],["示例/代码问题","samplesCodeIssue","thumb-down"],["其他","otherDown","thumb-down"]],["最后更新时间 (UTC):2008-03-01。"],[[["\u003cp\u003eGooglebot's headers are generally consistent globally, using variations for specific purposes like AdSense or image search.\u003c/p\u003e\n"],["\u003cp\u003eGooglebot selectively indexes content based on the search type, prioritizing HTML, PDFs, and images accordingly while avoiding less useful file types.\u003c/p\u003e\n"],["\u003cp\u003eGooglebot utilizes the \u003ccode\u003eContent-Type\u003c/code\u003e header to determine file formats and validates their structure before indexing to ensure content quality.\u003c/p\u003e\n"],["\u003cp\u003eGooglebot supports \u003ccode\u003egzip\u003c/code\u003e and \u003ccode\u003edeflate\u003c/code\u003e compression for content, with a preference for \u003ccode\u003egzip\u003c/code\u003e due to its robust encoding, which can improve crawl efficiency if implemented strategically.\u003c/p\u003e\n"],["\u003cp\u003eControlling crawl rate through robots.txt swapping is discouraged, and using Webmaster Tools to adjust crawl settings is recommended for better results.\u003c/p\u003e\n"]]],["Googlebot, a web crawler, interacts with a website to discuss its crawling behavior. Googlebot uses consistent headers globally but may vary the `User-Agent` for different tasks like AdSense or image search. It generally avoids cookies and prefers `gzip` compression. While it can download various file types, it prioritizes HTML, PDFs, and text, and it will likely download files with unknown extensions to assess the `Content-Type`. It prefers not to see a website frequently swap the robots.txt file, as it only checks it daily.\n"],null,["# First date with the Googlebot: Headers and compression\n\n| It's been a while since we published this blog post. Some of the information may be outdated (for example, some images may be missing, and some links may not work anymore).\n\nThursday, March 06, 2008\n\n\n**Name/User-Agent** : Googlebot \n\n**IP Address** :\n[Learn how to verify Googlebot](/search/docs/crawling-indexing/verifying-googlebot) \n\n**Looking For** : Websites with unique and compelling content \n\n**Major Turn Off** : Violations of the\n[Webmaster Guidelines](/search/docs/essentials)\n[Googlebot](/search/docs/crawling-indexing/googlebot) ---what a dreamboat. It's\nlike they know us `\u003chead\u003e`, `\u003cbody\u003e`, and soul. They're probably\nnot looking for anything exclusive; they see billions of other sites (though we share our data\nwith other bots as well), but tonight we'll really get to know each other as website and crawler.\n\n\nI know, it's never good to over-analyze a first date. We're going to get to know Googlebot a bit\nmore slowly, in a series of posts:\n\n1. Our first date (tonight!): Headers Googlebot sends, file formats they \"notice,\" whether it's better to compress data\n2. Judging their response: Response codes (`301`, `302`), how they handle redirects and `If-Modified-Since`\n3. Next steps: Following links, having them crawl faster or slower (so they don't come on too strong)\n\nAnd tonight is just the first date...\n\n*** ** * ** ***\n\n**Googlebot:** ACK\n\n**Website:** Googlebot, you're here!\n\n**Googlebot:** I am. \n\n```\nGET / HTTP/1.1\nHost: example.com\nConnection: Keep-alive\nAccept: */*\nFrom: googlebot(at)googlebot.com\nUser-Agent: Mozilla/5.0 (compatible; Googlebot/2.1; +https://www.google.com/bot.html)\nAccept-Encoding: gzip,deflate\n```\n\n\n**Website:** Those headers are so flashy! Would you crawl with the same headers if my site\nwere in the U.S., Asia or Europe? Do you ever use different headers?\n\n\n**Googlebot:** My headers are typically consistent world-wide. I'm trying to see what a page\nlooks like for the default language and settings for the site. Sometimes the\n`User-Agent` is different, for instance AdSense fetches use\n`Mediapartners-Google`: \n\n```\nUser-Agent: Mediapartners-Google\n```\n\nOr for image search: \n\n```\nUser-Agent: Googlebot-Image/1.0\n```\n\n\nWireless fetches often have carrier-specific user agents, whereas Google Reader RSS fetches\ninclude extra info such as number of subscribers.\n\n\nI usually avoid cookies (so no `Cookie:` header) since I don't want the content\naffected too much by session-specific info. And, if a server uses a session id in a dynamic URL\nrather than a cookie, I can usually figure this out, so that I don't end up crawling your same\npage a million times with a million different session ids.\n\n\n**Website:** I'm very complex. I have many file types. Your headers say\n`Accept: */*`. Do you index all URLs or are certain file extensions automatically\nfiltered?\n\n**Googlebot:** That depends on what I'm looking for. If I'm indexing for regular web search,\nand I see links to MP3s and videos, I probably won't download those. Similarly, if I see a JPG, I\nwill treat it differently than an HTML or PDF link. For instance, JPG is much less likely to\nchange frequently than HTML, so I will check the JPG for changes less often to save bandwidth.\nMeanwhile, if I'm looking for links as Google Scholar, I'm going to be far more interested in the\nPDF article than the JPG file. Downloading doodles (like JPGs) and videos of skateboarding dogs\nis distracting for a scholar---do you agree?\n\n\n**Website:** Yes, they can be distracting. I'm in awe of your dedication. I love doodles (JPGs)\nand find them hard to resist.\n\n\n**Googlebot:** Me, too; I'm not always so scholarly. When I crawl for image search, I'm very\ninterested in JPGs. And for news, I'm mostly looking at HTML and nearby images.\n\n\nThere are also plenty of extensions (exe, dll, zip, dmg...), that tend to be big and less useful\nfor a search engine.\n\n\n**Website:** If you saw my URL, `https://www.example.com/page1.LOL111`, would you\n(whimper whimper) reject it just because it contains an unknown file extension?\n\n\n**Googlebot:** Website, let me give a bit more background. After actually downloading a file, I\nuse the `Content-Type` header to check whether it really is HTML, an image, text, or\nsomething else. If it's a special data type like a PDF file, Word document, or Excel spreadsheet, I\n'll make sure it's in the valid format and extract the text content. Maybe it has a virus; you\nnever know. If the document or data type is really garbled, there's usually not much to do besides\ndiscard the content.\n\n\nSo, if I'm crawling `https://www.example.com/page1.LOL111` with an unknown file\nextension, it's likely that I would start to download it. If I can't figure out the content type\nfrom the header, or it's a format that we don't index (for example, mp3), then it'll be put aside.\nOtherwise, we proceed indexing the file.\n\n\n**Website:** My apologies for scrutinizing your style, Googlebot, but I noticed your\n`Accept-Encoding` headers say: \n\n```\nAccept-Encoding: gzip,deflate\n```\n\nCan you explain these headers to me?\n\n\n**Googlebot:** Sure. All major search engines and web browsers support gzip compression for\ncontent to save bandwidth. Other entries that you might see here include `x-gzip` (the\nsame as `gzip`), `deflate` (which we also support), and\n`identity` (none).\n\n\n**Website:** Can you talk more about file compression and\n`Accept-Encoding: gzip,deflate`? Many of my URLs consist of big Flash files and\nstunning images, not just HTML. Would it help you to crawl faster if I compressed my larger files?\n\n\n**Googlebot:** There's not a simple answer to this question. First of all, many file formats,\nsuch as swf (Flash), jpg, png, gif, and pdf are already compressed (there are also specialized\nFlash optimizers).\n\n\n**Website:** Perhaps I've been compressing my Flash files and I didn't even know? I'm obviously\nvery efficient.\n\n\n**Googlebot:** Both Apache and IIS have options to enable gzip and deflate compression, though\nthere's a CPU cost involved for the bandwidth saved. Typically, it's only enabled for easily\ncompressible text HTML/CSS/PHP content. And it only gets used if the user's browser or I (a search\nengine crawler) allow it. Personally, I prefer `gzip` over `deflate`. Gzip\nis a slightly more robust encoding---there is consistently a checksum and a full header,\ngiving me less guess-work than with deflate. Otherwise they're very similar compression\nalgorithms.\n\n\nIf you have some spare CPU on your servers, it might be worth experimenting with compression\n(links:\n[Apache](https://www.sitepoint.com/article/web-output-mod_gzip-apache),\n[IIS](https://www.microsoft.com/technet/prodtechnol/WindowsServer2003/Library/IIS/502ef631-3695-4616-b268-cbe7cf1351ce.mspx?mfr=true)).\nBut, if you're serving dynamic content and your servers are already heavily CPU loaded, you might\nwant to hold off.\n\n\n**Website:** Great information. I'm really glad you came tonight---thank goodness my\n[robots.txt](/search/docs/crawling-indexing/robots/intro) allowed it. That file can be like an\nover-protective parent!\n\n\n**Googlebot:** Ah yes; meeting the parents, the robots.txt. I've met plenty of intense ones.\nSome are really just HTML error pages rather than valid robots.txt. Some have infinite redirects\nall over the place, maybe to totally unrelated sites, while others are just huge and have\nthousands of different URLs listed individually. Here's one unfortunate pattern. The site is\nnormally eager for me to crawl: \n\n```\nUser-Agent: *\nAllow: /\n```\n\n\nThen, during a peak time with high user traffic, the site switches the robots.txt to something\nrestrictive: \n\n```\n# Can you go away for a while? I'll let you back\n# again in the future. Really, I promise!\n\nUser-Agent: *\nDisallow: /\n```\n\n\nThe problem with the above robots.txt file-swapping is that once I see the restrictive robots.txt,\nI may have to start throwing away content I've already crawled in the index. And then I have to\nrecrawl a lot of content once I'm allowed to crawl the site again. At least a 503 response code\nwould've been temporary.\n\n\nI typically only re-check robots.txt once a day (otherwise on many virtual hosting sites, I'd be\nspending a large fraction of my fetches just getting robots.txt, and no date wants to \"meet the\nparents\" that often). For webmasters, trying to control crawl rate through robots.txt swapping\nusually backfires. It's better to\n[set the rate to \"slower\"](https://support.google.com/webmasters/answer/48620)\nin Webmaster Tools.\n\n\n**Googlebot:** Website, thanks for all of your questions, you've been wonderful, but I'm going\nto have to say \"FIN, my love.\"\n\n\n**Website:** Oh, Googlebot... ACK/FIN.\n:)\n\n*** ** * ** ***\n\nWritten by [Maile Ohye](/search/blog/authors/maile-ohye) as the website, Jeremy Lilley as the Googlebot"]]