Wednesday, June 13, 2007
Last week, I participated in the duplicate content summit at SMX Advanced. I couldn't resist the opportunity to show how Buffy is applicable to the everday Search marketing world, but mostly I was there to get input from you on the duplicate content issues you face and to brainstorm how search engines can help.
A few months ago, Adam wrote a great post on dealing with duplicate content. The most important things to know about duplicate content are:
- Google wants to serve up unique results and does a great job of picking a version of your content to show if your sites includes duplication. If you don't want to worry about sorting through duplication on your site, you can let us worry about it instead.
- Duplicate content doesn't cause your site to be penalized. If duplicate pages are detected, one version will be returned in the search results to ensure variety for searchers.
- Duplicate content doesn't cause your site to be placed in the supplemental index. Duplication may indirectly influence this however, if links to your pages are split among the various versions, causing lower per-page PageRank.
At the summit at SMX Advanced, we asked what duplicate content issues were most worrisome. Those in the audience were concerned about scraper sites, syndication, and internal duplication. We discussed lots of potential solutions to these issues and we'll definitely consider these options along with others as we continue to evolve our toolset. Here's the list of some of the potential solutions we discussed so that those of you who couldn't attend can get in on the conversation.
Specifying the preferred version of a URL in the site's Sitemap file
One thing we discussed was the possibility of specifying the preferred version of a URL in a Sitemap file, with the suggestion that if we encountered multiple URLs that point to the same content, we could consolidate links to that page and could index the preferred version.
Providing a method for indicating parameters that should be stripped from a URL during indexing
We discussed providing this in either an interface such as Webmaster Tools on in the site's robots.txt file. For instance, if a URL contains sessions IDs, the webmaster could indicate the variable for the session ID, which would help search engines index the clean version of the URL and consolidate links to it. The audience leaned towards an addition in robots.txt for this.
Providing a way to authenticate ownership of content
This would provide search engines with extra information to help ensure we index the original version of an article, rather than a scraped or syndicated version. Note that we do a pretty good job of this now and not many people