Stay organized with collections
Save and categorize content based on your preferences.
Monday, August 10, 2009
Many questions about website architecture, crawling and indexing, and even ranking issues can be
boiled down to one central issue:
How easy is it for search engines to crawl your site?
We've spoken on this topic at a number of recent events, and below you'll find our presentation
and some key takeaways on this topic.
The Internet is a bigplace;
new content is being created all the time. Google has a finite number of resources, so when faced
with the nearly-infinite quantity of content that's available online, Googlebot is only able to
find and crawl a percentage of that content. Then, of the content we've crawled, we're only able
to index a portion.
URLs are like the bridges between your website and a search engine's crawler: crawlers need to be
able to find and cross those bridges (that is, find and crawl your URLs) in order to get to your
site's content. If your URLs are complicated or redundant, crawlers are going to spend time
tracing and retracing their steps; if your URLs are organized and lead directly to distinct
content, crawlers can spend their time accessing your content rather than crawling through empty
pages, or crawling the same content over and over via different URLs.
In the slides above you can see some examples of what not to do—real-life examples
(though names have been changed to protect the innocent) of homegrown URL hacks and encodings,
parameters masquerading as part of the URL path, infinite crawl spaces, and more. You'll also
find some recommendations for straightening out that labyrinth of URLs and helping crawlers find
more of your content faster, including:
Remove user-specific details from URLs. URL parameters that don't change the
content of the page—like session IDs or sort order—can be removed from the URL and put into a
cookie. By putting this information in a cookie and
301 redirecting
to a "clean" URL, you retain the information and reduce the number of URLs pointing to that same
content.
Rein in infinite spaces. Do you have a calendar that links to an infinite
number of past or future dates (each with their own unique URL)? Do you have paginated data
that returns a
status code of 200
when you add &page=3563 to the URL, even if there aren't that many pages of
data? If so, you have an
infinite crawl space on your
website, and crawlers could be wasting their (and your!) bandwidth trying to crawl it all.
Consider
these tips
for reining in infinite spaces.
Disallow actions Googlebot can't perform. Using your
robots.txt file, you can disallow crawling of
login pages, contact forms, shopping carts, and other pages whose sole functionality is
something that a crawler can't perform. (Crawlers are notoriously cheap and shy, so they don't
usually "Add to cart" or "Contact us.") This lets crawlers spend more of their time crawling
content that they can actually do something with.
One man, one vote. One URL, one set of content. In an ideal world, there's a
one-to-one pairing between URL and content: each URL leads to a unique piece of content, and
each piece of content can only be accessed via one URL. The closer you can get to this ideal,
the more streamlined your site will be for crawling and indexing. If your CMS or current site
setup makes this difficult, you can
use the rel="canonical" element
to indicate the preferred URL for a particular piece of content.
If you have further questions about optimizing your site for crawling and indexing, check out some
of our previous writing on the subject, or stop by
our
Help Forum.
[[["Easy to understand","easyToUnderstand","thumb-up"],["Solved my problem","solvedMyProblem","thumb-up"],["Other","otherUp","thumb-up"]],[["Missing the information I need","missingTheInformationINeed","thumb-down"],["Too complicated / too many steps","tooComplicatedTooManySteps","thumb-down"],["Out of date","outOfDate","thumb-down"],["Samples / code issue","samplesCodeIssue","thumb-down"],["Other","otherDown","thumb-down"]],[],[[["\u003cp\u003eGooglebot has limited resources and can only crawl and index a portion of the web's content, so site architecture is crucial for efficient crawling.\u003c/p\u003e\n"],["\u003cp\u003eWell-structured URLs help search engines easily access and understand website content, while disorganized URLs waste crawl resources.\u003c/p\u003e\n"],["\u003cp\u003eRemoving unnecessary URL parameters, managing infinite crawl spaces, and disallowing irrelevant actions for Googlebot improves crawl efficiency.\u003c/p\u003e\n"],["\u003cp\u003eEnsure each unique piece of content has one corresponding URL, using canonicalization if needed, to optimize crawling and indexing.\u003c/p\u003e\n"],["\u003cp\u003eOptimizing your website's crawlability allows Googlebot to discover and index valuable content more effectively.\u003c/p\u003e\n"]]],["Search engine crawlers navigate websites via URLs; simplifying these URLs is crucial for efficient crawling. Key actions include removing irrelevant URL parameters, managing infinite crawl spaces like calendars or excessive pagination, and disallowing non-functional pages (e.g., login pages) in `robots.txt`. Ideally, each URL should lead to unique content. Using cookies for session data, employing `301` redirects for cleaner URLs, and the `rel=\"canonical\"` tag can streamline crawling and indexing processes.\n"],null,["# Optimize your crawling and indexing\n\nMonday, August 10, 2009\n| It's been a while since we published this blog post. Some of the information may be outdated (for example, some images may be missing, and some links may not work anymore). For current information, check out our [Advanced guide to how Search works](/search/docs/fundamentals/how-search-works).\n\n\nMany questions about website architecture, crawling and indexing, and even ranking issues can be\nboiled down to one central issue:\n**How easy is it for search engines to crawl your site?**\nWe've spoken on this topic at a number of recent events, and below you'll find our presentation\nand some key takeaways on this topic.\n\n\n[The Internet is a *big*place](https://googleblog.blogspot.com/2008/07/we-knew-web-was-big.html);\nnew content is being created all the time. Google has a finite number of resources, so when faced\nwith the nearly-infinite quantity of content that's available online, Googlebot is only able to\nfind and crawl a percentage of that content. Then, of the content we've crawled, we're only able\nto index a portion.\n\n\nURLs are like the bridges between your website and a search engine's crawler: crawlers need to be\nable to find and cross those bridges (that is, find and crawl your URLs) in order to get to your\nsite's content. If your URLs are complicated or redundant, crawlers are going to spend time\ntracing and retracing their steps; if your URLs are organized and lead directly to distinct\ncontent, crawlers can spend their time accessing your content rather than crawling through empty\npages, or crawling the same content over and over via different URLs.\n\n\nIn the slides above you can see some examples of what *not* to do---real-life examples\n(though names have been changed to protect the innocent) of homegrown URL hacks and encodings,\nparameters masquerading as part of the URL path, infinite crawl spaces, and more. You'll also\nfind some recommendations for straightening out that labyrinth of URLs and helping crawlers find\nmore of your content faster, including:\n\n- **Remove user-specific details from URLs.** URL parameters that don't change the content of the page---like session IDs or sort order---can be removed from the URL and put into a cookie. By putting this information in a cookie and [`301` redirecting](/search/docs/crawling-indexing/301-redirects) to a \"clean\" URL, you retain the information and reduce the number of URLs pointing to that same content.\n- **Rein in infinite spaces.** Do you have a calendar that links to an infinite number of past or future dates (each with their own unique URL)? Do you have paginated data that returns a [status code of `200`](/search/docs/crawling-indexing/http-network-errors) when you add `&page=3563` to the URL, even if there aren't that many pages of data? If so, you have an [infinite crawl space](/search/blog/2008/08/to-infinity-and-beyond-no) on your website, and crawlers could be wasting their (and your!) bandwidth trying to crawl it all. Consider [these tips](https://www.google.com/support/webmasters/bin/answer.py?answer=76401) for reining in infinite spaces.\n- **Disallow actions Googlebot can't perform.** Using your [robots.txt file](/search/docs/crawling-indexing/robots/intro), you can disallow crawling of login pages, contact forms, shopping carts, and other pages whose sole functionality is something that a crawler can't perform. (Crawlers are notoriously cheap and shy, so they don't usually \"Add to cart\" or \"Contact us.\") This lets crawlers spend more of their time crawling content that they can actually do something with.\n- **One man, one vote. One URL, one set of content.** In an ideal world, there's a one-to-one pairing between URL and content: each URL leads to a unique piece of content, and each piece of content can only be accessed via one URL. The closer you can get to this ideal, the more streamlined your site will be for crawling and indexing. If your CMS or current site setup makes this difficult, you can [use the `rel=\"canonical\"` element](/search/docs/crawling-indexing/consolidate-duplicate-urls) to indicate the preferred URL for a particular piece of content.\n\n\nIf you have further questions about optimizing your site for crawling and indexing, check out some\nof our [previous writing](/search/help/crawling-index-faq) on the subject, or stop by\nour\n[Help Forum](https://support.google.com/webmasters/community).\n\n\nPosted by\n[Susan Moskwa](/search/blog/authors/susan-moskwa),\nWebmaster Trends Analyst"]]