Large site owner's guide to managing the crawl budget: Sinopsis with Comments

Sinopsis with comments: Google’s Large site owner's guide to managing your crawl budget

Original document’s web address: https://developers.google.com/search/docs/crawling-indexing/large-site-managing-crawl-budget

Sinopsis

Crawl budget management is mostly relevant for the large sites (+1m pages).

Decision on indexing is taken along the following pipeline: each page must be evaluated, consolidated, and assessed to determine whether it will be indexed after it has been crawled.

Crawl budget is determined by two main elements:

crawl capacity limit and
crawl demand.

The crawl capacity limit is based on two factors:

Crawl health: a function of site’s server response time and server errors. If the site slows down or responds with server errors, the limit goes down and Googlebot crawls less.
Google's crawling limits: a function of available crawl capacities with Google.

Crawl demand has the following factors:

Perceived inventory: the factor that site owners can positively control the most, including absence of duplicates, pages unwanted for crawling for some other reason (removed, unimportant, and so on).
Popularity: URLs that are more popular on the Internet tend to be crawled more often to keep them fresher in our index.
Staleness: Google wants to recrawl documents frequently enough to pick up any changes.

Additionally, site-wide events like site moves may trigger an increase in crawl demand in order to reindex the content under the new URLs.

Taking crawl capacity and crawl demand together, Google defines a site's crawl budget as the set of URLs that Googlebot can and wants to crawl.

Crawl budget allocation is predicted by the popularity, user value, uniqueness, and serving capacity.

Management of the URL inventory is seen as the one of the most efficient ways to maximise the crawling capacities.

Best practices of management of URL inventory include steps to:

Consolidate duplicate content through specifying canonical versions of the similar or duplicate pages. Eliminate duplicate content to focus crawling on unique content rather than unique URLs.
Block crawling of URLs using robots.txt.
Include <lastmod> tag if your site includes updated content.
Make your pages efficient to load. If Google can load and render your pages faster, they might be able to read more content from your site.
Monitor your site crawling. Monitor whether your site had any availability issues during crawling, and look for ways to make your crawling more efficient.
Avoid Including URLs in your sitemaps that you don't want to appear in Search. This can waste your crawl budget on pages that you don't want indexed.

Don't use noindex, as Google will still request, but then drop the page when it sees a noindex meta tag or header in the HTTP response, wasting crawling time.

Use robots.txt to block pages or resources that you don't want Google to crawl at all (vs temporarily re-allocate crawl budget for other pages).

Google won't shift this newly available crawl budget to other pages unless Google is already hitting your site's serving limit.

As a way to improve your site’s crawling efficiency:

Specify content changes with HTTP status codes

You can send a 304 (Not Modified) HTTP status code and no response body for any Googlebot request if the content hasn't changed since Googlebot last visited the URL. This will save your server processing time and resources, which may indirectly improve crawl efficiency.

Hide URLs that you don't want in search results

Faceted navigation and session identifiers: Faceted navigation is typically duplicate content from the site; session identifiers and other URL parameters that simply sort or filter the page don't provide new content. Use robots.txt to block faceted navigation pages.

Questions and Answers

Question: The closer your content is to the home page the more important it is to Google.

Answer: Your site's home page is often the most important page on your site, and so pages linked directly to the home page may be seen as more important, and therefore crawled more often. However, this doesn't mean that these pages will be ranked more highly than other pages on your site.

Question: Alternate URLs and embedded content count in the crawl budget.

Answer: Generally, any URL that Googlebot crawls will count towards a site's crawl budget. Alternate URLs, like AMP or hreflang, as well as embedded content, such as CSS and JavaScript, including XHR fetches, may have to be crawled and will consume a site's crawl budget.

Question: The nofollow rule affects crawl budget.

Answer: Any URL that is crawled affects crawl budget, so even if your page marks a URL as nofollow, it can still be crawled if another page on your site, or any page on the web, doesn't label the link as nofollow.

Question: I can use noindex to control crawl budget.

Answer: Any URL that is crawled affects crawl budget, and Google has to crawl the page in order to find the noindex rule.

However, noindex is there to help you keep things out of the index. If you want to ensure that those pages don't end up in Google's index, continue using noindex and don't worry about the crawl budget. It's also important to note that if you remove URLs from Google's index with noindex or otherwise, Googlebot can focus on other URLs on your site, which means noindex can indirectly free up some crawl budget for your site in the long run.

Question: URL versioning is a good way to encourage Google to recrawl my pages.

Answer: Using a versioned URL for your page in order to entice Google to crawl it again sooner will probably work, but often this is not necessary, and will waste crawl resources if the page is not actually changed. If you do use versioned URLs to indicate new content, we recommend that you only change the URL when the page content has changed meaningfully.

Comments

Google determines the best crawl rate based on the crawl demand (more than on crawl capacity). So, improving your site availability won't necessarily increase your crawl budget.

Google uses the term Popularity to define URLs that they want to crawl more than others. This is to say, Clicks and Impressions must predict the frequency of crawls. Another predicting factor is URL depth, i.e. pages linked directly to the home page may be seen as more important, and therefore crawled more often.

Google emphasises that crawling does not equal indexing, yet, one of the biggest factors of crawl efficiency - Popularity - is directly predicted on indexing, etc.

Google only wants to crawl high quality content, so it may be inferred that pages with the highest crawl frequency can be seen as the most high quality (valuable).

Indirect ways to improve crawl efficiency include:

Using last mod tag
Using noindex in HTML tag (though it’s recommended to use robots.txt thereto), etc.

PreviousThe Crawling Lab NextSEO Questions from Business Owners

Last updated 8 months ago