Optimize your crawling and indexing

Monday, August 10, 2009

Many questions about website architecture, crawling and indexing, and even ranking issues can be boiled down to one central issue: How easy is it for search engines to crawl your site? We've spoken on this topic at a number of recent events, and below you'll find our presentation and some key takeaways on this topic.

The Internet is a bigplace; new content is being created all the time. Google has a finite number of resources, so when faced with the nearly-infinite quantity of content that's available online, Googlebot is only able to find and crawl a percentage of that content. Then, of the content we've crawled, we're only able to index a portion.

URLs are like the bridges between your website and a search engine's crawler: crawlers need to be able to find and cross those bridges (that is, find and crawl your URLs) in order to get to your site's content. If your URLs are complicated or redundant, crawlers are going to spend time tracing and retracing their steps; if your URLs are organized and lead directly to distinct content, crawlers can spend their time accessing your content rather than crawling through empty pages, or crawling the same content over and over via different URLs.

In the slides above you can see some examples of what not to do—real-life examples (though names have been changed to protect the innocent) of homegrown URL hacks and encodings, parameters masquerading as part of the URL path, infinite crawl spaces, and more. You'll also find some recommendations for straightening out that labyrinth of URLs and helping crawlers find more of your content faster, including:

  • Remove user-specific details from URLs. URL parameters that don't change the content of the page—like session IDs or sort order—can be removed from the URL and put into a cookie. By putting this information in a cookie and 301 redirecting to a "clean" URL, you retain the information and reduce the number of URLs pointing to that same content.
  • Rein in infinite spaces. Do you have a calendar that links to an infinite number of past or future dates (each with their own unique URL)? Do you have paginated data that returns a status code of 200 when you add &page=3563 to the URL, even if there aren't that many pages of data? If so, you have an infinite crawl space on your website, and crawlers could be wasting their (and your!) bandwidth trying to crawl it all. Consider these tips for reining in infinite spaces.
  • Disallow actions Googlebot can't perform. Using your robots.txt file, you can disallow crawling of login pages, contact forms, shopping carts, and other pages whose sole functionality is s