Large site owner's guide to managing your crawl budget
This guide describes how to optimize Google's crawling of very large and frequently updated sites.
If your site does not have a large number of pages that change rapidly, or if your pages seem to be crawled the same day that they are published, you don't need to read this guide; merely keeping your sitemap up to date and checking your index coverage regularly is adequate.
If you have content that's been available for a while but has never been indexed, this is a different problem; use the URL Inspection tool instead to find out why your page isn't being indexed.
Who this guide is for
This is an advanced guide and is intended for:
- Large sites (1 million+ unique pages) with content that changes moderately often (once a week)
- Medium or larger sites (10,000+ unique pages) with very rapidly changing content (daily)
- Sites with a large portion of their total URLs classified by Search Console as Discovered - currently not indexed
General theory of crawling
The web is a nearly infinite space, exceeding Google's ability to explore and index every available URL. As a result, there are limits to how much time Googlebot can spend crawling any single site. The amount of time and resources that Google devotes to crawling a site is commonly called the site's crawl budget. Note that not everything crawled on your site will necessarily be indexed; each page must be evaluated, consolidated, and assessed to determine whether it will be indexed after it has been crawled.
Crawl budget is determined by two main elements: crawl capacity limit and crawl demand.
Crawl capacity limit
Googlebot wants to crawl your site without overwhelming your servers. To prevent this, Googlebot calculates a crawl capacity limit, which is the maximum number of simultaneous parallel connections that Googlebot can use to crawl a site, as well as the time delay between fetches. This is calculated to provide coverage of all your important content without overloading your servers.
The crawl capacity limit can go up and down based on a few factors:
- Crawl health: If the site responds quickly for a while, the limit goes up, meaning more connections can be used to crawl. If the site slows down or responds with server errors, the limit goes down and Googlebot crawls less.
- Google's crawling limits: Google has a lot of machines, but not infinite machines. We still need to make choices with the resources that we have.
Google typically spends as much time as necessary crawling a site, given its size, update frequency, page quality, and relevance, compared to other sites.
The factors that play a significant role in determining crawl demand are:
- Perceived inventory: Without guidance from you, Googlebot will try to crawl all or most of the URLs that it knows about on your site. If many of these URLs are duplicates, or you don't want them crawled for some other reason (removed, unimportant, and so on), this wastes a lot of Google crawling time on your site. This is the factor that you can positively control the most.
- Popularity: URLs that are more popular on the Internet tend to be crawled more often to keep them fresher in our index.
- Staleness: Our systems want to recrawl documents frequently enough to pick up any changes.
Additionally, site-wide events like site moves may trigger an increase in crawl demand in order to reindex the content under the new URLs.
Taking crawl capacity and crawl demand together, Google defines a site's crawl budget as the set of URLs that Googlebot can and wants to crawl. Even if the crawl capacity limit isn't reached, if crawl demand is low, Googlebot will crawl your site less.
Follow these best practices to maximize your crawling efficiency:
- Manage your URL inventory: Use the appropriate
tools to tell Google which pages to crawl and which not to crawl. If Google spends too much
time crawling URLs that aren't appropriate for the index, Googlebot might decide that it's
not worth the time to look at the rest of your site (or increase your budget to do so).
- Consolidate duplicate content. Eliminate duplicate content to focus crawling on unique content rather than unique URLs.
- Block crawling of URLs using robots.txt. Some pages might be important to users, but you don't necessarily want them to appear in Search results. For example, infinite scrolling pages that duplicate information on linked pages, or differently sorted versions of the same page. If you can't consolidate them as described in the first bullet, block these unimportant (for search) pages using robots.txt. Blocking URLs with robots.txt significantly decreases the chance the URLs will be indexed.
410status code for permanently removed pages. Google won't forget a URL that it knows about, but a
404status code is a strong signal not to crawl that URL again. Blocked URLs, however, will stay part of your crawl queue much longer, and will be recrawled when the block is removed.
soft 404pages will continue to be crawled, and waste your budget. Check the Index Coverage report for
- Keep your sitemaps up to date. Google reads your sitemap regularly,
so be sure to include all the content that you want Google to crawl. If your site
includes updated content, we recommend including the
- Avoid long redirect chains, which have a negative effect on crawling.
- Make your pages efficient to load. If Google can load and render your pages faster, we might be able to read more content from your site.
- Monitor your site crawling. Monitor whether your site had any availability issues during crawling, and look for ways to make your crawling more efficient.
Monitor your site's crawling and indexing
Here are the key steps to monitoring your site's crawl profile:
- See if Googlebot is encountering availability issues on your site.
- See whether you have pages that aren't being crawled, but should be.
- See whether any parts of your site need to be crawled more quickly than they already are.
- Improve your site's crawl efficiency.
- Handle overcrawling of your site.
See if Googlebot is encountering availability issues on your site
Improving your site availability won't necessarily increase your crawl budget; Google determines the best crawl rate based on the crawl demand, as described previously. However, availability issues do prevent Google from crawling your site as much as it might want to.
Use the Crawl Stats report to see Googlebot's crawling history for your site. The report shows when Google encountered availability issues on your site. If availability errors or warnings are reported for your site, look for instances in the Host availability graphs where Googlebot requests exceeded the red limit line, click into the graph to see which URLs were failing, and try to correlate those with issues on your site.
Additionally, you can also use the URL Inspection Tool to test a few URLs on your site. If the tool returns Hostload exceeded warnings, that means that Googlebot can't crawl as many URLs from your site as it discovered.
- Read the documentation for the Crawl Stats report to learn how to find and handle some availability issues.
- Block pages from crawling if you don't want them to be crawled. (See manage your inventory)
- Increase page loading and rendering speed. (See Improve your site's crawl efficiency)
- Increase your server capacity. If Google consistently seems to be crawling your site at its serving capacity limit, but you still have important URLs that aren't being crawled or updated as much as they need, having more serving resources might enable Google to request more pages on your site. Check your host availability history in the Crawl Stats report to see if Google's crawl rate seems to be crossing the limit line often. If so, increase your serving resources for a month and see whether crawling requests increased during that same period.
See if any parts of your site are not crawled, but should be
Google spends as much time as necessary on your site in order to index all the high-quality, user-valuable content that it can find. If you think that Googlebot is missing important content, either it doesn't know about the content, the content is blocked from Google, or your site availability is throttling Google's access (or Google is trying not to overload your site).
Search Console doesn't provide a crawl history for your site that can be filtered by URL or path, but you can inspect your site logs to see whether specific URLs have been crawled by Googlebot. Whether or not those crawled URLs have been indexed is another story.
Remember that for most sites, new pages will take several days minimum to be noticed; most sites shouldn't expect same-day crawling for URLs, with the exception of time-sensitive sites such as news sites.
If you are adding pages to your site and they are not being crawled in a reasonable amount of time, either Google doesn't know about them, the content is blocked, your site has reached its maximum serving capacity, or you are out of crawl budget.
- Tell Google about your new pages: update your sitemaps to reflect new URLs.
- Examine your robots.txt rules to confirm that you're not accidentally blocking pages.
- Review your crawling priorities (a.k.a. use your crawl budget wisely). Manage your inventory and improve your site's crawling efficiency.
- Check that you're not running out of serving capacity. Googlebot will scale back its crawling if it detects that your servers are having trouble responding to crawl requests.
Note that pages might not be shown in search results, even if crawled, if there isn't sufficient value or user demand for the content.
See if updates are crawled quickly enough
If we're missing new or updated pages on your site, perhaps it's because we haven't seen them, or haven't noticed that they are updated. Here is how you can help us be aware of page updates.
Note that Google strives to check and index pages in a reasonably timely manner. For most sites, this is three days or more. Don't expect Google to index pages the same day that you publish them unless you are a news site or have other high-value, extremely time-sensitive content.
Examine your site logs to see when specific URLs were crawled by Googlebot.
To learn the indexing date, use the URL Inspection tool or do a Google search for URLs that you updated.
- Use a news sitemap if your site has news content.
- Use the
<lastmod>tag in sitemaps to indicate when an indexed URL has been updated.
- Use a simple URL structure to help Google find your pages.
- Provide standard, crawlable
<a>links to help Google find your pages.
- Submitting the same, unchanged sitemap multiple times per day.
- Expecting that Googlebot will crawl everything in a sitemap, or crawl them immediately. Sitemaps are useful suggestions to Googlebot, not absolute requirements.
- Including URLs in your sitemaps that you don't want to appear in Search. This can waste your crawl budget on pages that you don't want indexed.
Improve your site's crawl efficiency
Increase your page loading speed
Google's crawling is limited by bandwidth, time, and availability of Googlebot instances. If your server responds to requests quicker, we might be able to crawl more pages on your site. That said, Google only wants to crawl high quality content, so simply making low quality pages faster won't encourage Googlebot to crawl more of your site; conversely, if we think that we're missing high-quality content on your site, we'll probably increase your budget to crawl that content.
Here's how you can optimize your pages and resources for crawling:
- Prevent large but unimportant resources from being loaded by Googlebot using robots.txt. Be sure to block only non-critical resources—that is, resources that aren't important to understanding the meaning of the page (such as decorative images).
- Make sure that your pages are fast to load.
- Watch out for long redirect chains, which have a negative effect on crawling.
- Both the time to respond to server requests, as well as the time needed to render pages, matters, including load and run time for embedded resources such as images and scripts. Be aware of large or slow resources required for indexing.
Specify content changes with HTTP status codes
Google generally supports the
If-None-Match HTTP request headers
for crawling. Google's crawlers don't send the headers with all crawl attempts; it depends on
the use case of the request (for example,
AdsBot is more
likely to set the
If-None-Match HTTP request
headers). If our crawlers send the
If-Modified-Since header, the header's value
is the date and time
the content was last crawled. Based on that value, the server may choose to return a
304 (Not Modified) HTTP status code with no response body, in which case Google
will reuse the content version it crawled the last time. If the content is newer than the date
specified by the crawler in the
If-Modified-Since header, the server can return a
200 (OK) HTTP status code with the response body.
Independently of the request headers, you can send a
304 (Not Modified) HTTP
status code and no response body for any Googlebot request if the content hasn't changed since
Googlebot last visited the URL. This will save your server processing time and resources,
which may indirectly improve crawl efficiency.
Hide URLs that you don't want in search results
Wasting server resources on unnecessary pages can reduce crawl activity from pages that are important to you, which may cause a significant delay in discovering great new or updated content on a site.
Exposing many URLs on your site that you don't want crawled by Search can negatively affect a site's crawling and indexing. Typically these URLs fall into the following categories:
- Faceted navigation and session identifiers: Faceted navigation is typically duplicate content from the site; session identifiers and other URL parameters that simply sort or filter the page don't provide new content. Use robots.txt to block faceted navigation pages.
- Duplicate content: Help Google identify duplicate content to avoid unnecessary crawling.
soft 404pages: Return a
404code when a page no longer exists.
- Hacked pages: Be sure to check the Security Issues report and fix or remove any hacked pages you find.
- Infinite spaces and proxies: Block these from crawling with robots.txt.
- Low quality and spam content: Good to avoid, obviously.
- Shopping cart pages, infinite scrolling pages, and pages that perform an action (such as "sign up" or "buy now" pages).
- Use robots.txt if you don't want Google to crawl a resource or page at all.
- Don't add or remove pages or directories from robots.txt regularly as a way of reallocating crawl budget for your site. Use robots.txt only for pages or resources that you don't want to appear on Google for the long run.
- Don't rotate sitemaps or use other temporary hiding mechanisms to reallocate budget.
Handle overcrawling of your site (emergencies)
Googlebot has algorithms to prevent it from overwhelming your site with crawl requests. However, if you find that Googlebot is overwhelming your site, there are a few things you can do.
Monitor your server for excessive Googlebot requests to your site.
In an emergency, we recommend the following steps to slow down an overwhelming crawl from Googlebot:
429HTTP response status codes temporarily for Googlebot requests when your server is overloaded. Googlebot will retry these URLs for about 2 days. Note that returning "no availability" codes for more than a few days will cause Google to permanently slow or stop crawling URLs on your site, so follow the additional next steps.
When the crawl rate goes down, stop returning
429HTTP response status codes for crawl requests; returning
429for more than 2 days will cause Google to drop those URLs from the index.
- Monitor your crawling and your host capacity over time.
- If the problematic crawler is one of the AdsBot crawlers, the problem is likely that you have created Dynamic Search Ad targets for your site that Google is trying to crawl. This crawl will reoccur every 3 weeks. If you don't have the server capacity to handle these crawls, either limit your ad targets or get increased serving capacity.
Myths and facts about crawling
Test your knowledge on how Google crawls and indexes websites.
5xx HTTP response status codes
(server errors) or connection timeouts signal the opposite, and
crawling slows down. We recommend paying attention to the Crawl Stats report in Search
Console and keeping the number of server errors low.
nofollow rule affects crawl budget.
nofollow, it can still be crawled if another page
on your site, or any page on the web, doesn't label the link as
noindex to control crawl budget.
noindex is there to help you keep things out of the index. If you
want to ensure that those pages don't end up in Google's index, continue using
and don't worry about crawl budget. It's also important to note that if you remove URLs
from Google's index with
noindex or otherwise, Googlebot can focus on
other URLs on your site, which means
noindex can indirectly free up some crawl
budget for your site in the long run.
4xx HTTP status codes are wasting crawl budget.
4xx HTTP status codes
429) don't waste crawl budget. Google attempted to
crawl the page, but received a status code and no other content.