What Crawl Budget Means for Googlebot

Monday, January 16, 2017

Recently, we've heard a number of definitions for "crawl budget", however we don't have a single term that would describe everything that "crawl budget" stands for externally. With this post we'll clarify what we actually have and what it means for Googlebot.

First, we'd like to emphasize that crawl budget, as described below, is not something most publishers have to worry about. If new pages tend to be crawled the same day they're published, crawl budget is not something webmasters need to focus on. Likewise, if a site has fewer than a few thousand URLs, most of the time it will be crawled efficiently.

Prioritizing what to crawl, when, and how much resource the server hosting the site can allocate to crawling is more important for bigger sites, or those that auto-generate pages based on URL parameters, for example.

Crawl rate limit

Googlebot is designed to be a good citizen of the web. Crawling is its main priority, while making sure it doesn't degrade the experience of users visiting the site. We call this the "crawl rate limit," which limits the maximum fetching rate for a given site.

Simply put, this represents the number of simultaneous parallel connections Googlebot may use to crawl the site, as well as the time it has to wait between the fetches. The crawl rate can go up and down based on a couple of factors:

Crawl health: If the site responds really quickly for a while, the limit goes up, meaning more connections can be used to crawl. If the site slows down or responds with server errors, the limit goes down and Googlebot crawls less.
Limit set in Search Console: Website owners can reduce Googlebot's crawling of their site. Note that setting higher limits doesn't automatically increase crawling.

Crawl demand

Even if the crawl rate limit isn't reached, if there's no demand from indexing, there will be low activity from Googlebot. The two factors that play a significant role in determining crawl demand are:

Popularity: URLs that are more popular on the Internet tend to be crawled more often to keep them fresher in our index.
Staleness: Our systems attempt to prevent URLs from becoming stale in the index.

Additionally, site-wide events like site moves may trigger an increase in crawl demand in order to reindex the content under the new URLs.

Taking crawl rate and crawl demand together we define crawl budget as the number of URLs Googlebot can and wants to crawl.

Factors affecting crawl budget

According to our analysis, having many low-value-add URLs can negatively affect a site's crawling and indexing. We found that the low-value-add URLs fall into these categories, in order of significance:

Faceted navigation and session identifiers
On-site duplicate content
Soft error pages
Hacked pages
Infinite spaces and proxies
Low quality and spam content

Wasting server resources on pages like these will drain crawl activity from pages that do actually have value, which may cause a significant delay in discovering great content on a site.

What Crawl Budget Means for Googlebot

Crawl rate limit

Crawl demand

Factors affecting crawl budget

Top questions

Does site speed affect my crawl budget? How about errors?

Is crawling a ranking factor?

Do alternate URLs and embedded content count in the crawl budget?

Can I control Googlebot with the `crawl-delay` rule?

Does the `nofollow` rule affect crawl budget?

Do URLs I disallowed through robots.txt affect my crawl budget in any way?

What Crawl Budget Means for Googlebot

Crawl rate limit

Crawl demand

Factors affecting crawl budget

Top questions

Does site speed affect my crawl budget? How about errors?

Is crawling a ranking factor?

Do alternate URLs and embedded content count in the crawl budget?

Can I control Googlebot with the crawl-delay rule?

Does the nofollow rule affect crawl budget?

Do URLs I disallowed through robots.txt affect my crawl budget in any way?

Can I control Googlebot with the `crawl-delay` rule?

Does the `nofollow` rule affect crawl budget?