In order to benefit from better search engine treatment, some large sites are appending SEO-friendly “slugs” (strings of keywords, usually separated by hyphens) to their URLs. They are doing in such a way that confuses bots and flouts convention.
Search for H.L Mencken Chrestomathy on Google and you will see a link to an Amazon.com page from where you can buy A Mencken Chrestomathy as the first result. The URL is http://www.amazon.com/Mencken-Chrestomathy-H-L/dp/0394752090. This is all well and good, but look what happens if you modify the first portion of that URL’s path: http://www.amazon.com/inelegant.org/dp/0394752090; you still see the same page. Amazon.com is effectively ignoring the first path segment, and serving the content identified by the last, i.e. 0394752090.
Another offender is the metafilter.com network of sites. For instance, compare http://ask.metafilter.com/54166/Reading-material-on-English-language-origins with http://ask.metafilter.com/54166/inelegant.org.
These URLs all return a status code of 200 OK, so as far as any user agent is concerned they refer to different resources.
Problems
- Most search engines will penalise such a site for “duplicate content” which can adversely affect its ranking.
- Caches will be less effective because they have no way of knowing that the two URLs are equivalent.
- Malicious third-parties can poison the site’s search engine standing by linking to it with unflattering text in place of the original slug, causing the page to be returned for the malicious phrase.
- It breaks the web. Web clients expect there to be a one-to-one mapping between resources and their URLs.
Bizarrely, Amazon.com must have realised this problem on some level because they do handle it sanely in some cases. Specifically, they seem to sniff for the user agent and issue a 301 “Moved Permanently” Redirect if they suspect that it’s a bot. Hit http://www.amazon.com/inelegant.org/dp/0394752090 with a useragent of “Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)“, and it redirects to http://www.amazon.com/Mencken-Chrestomathy-H-L/dp/0394752090. Hit it with a useragent of “Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.8.1.4) Gecko/20070601 Firefox/2.0.0.4 (Swiftfox)” (the useragent my browser reports itself as) and it doesn’t redirect. Further, if you send a useragent unknown to Amazon.com, such as “inelegant.org bot”, you won’t be redirected either… The upshot of this…”strategy” is that Amazon.com is broken for unknown bots and, seemingly by design, even popular web browsers.
Metafilter doesn’t even pay attention to useragents. Hit http://ask.metafilter.com/54166/inelegant.org with a Googlebot useragent, and you’ll get a 200 OK response.
Solution
If a site has multiple URLs for the same content, they should pick one canonical URL for it, and redirect the other URLs to the canonical URL with a 301 redirect.
Ideally, these sites would remove the numeric ID from the URL entirely. http://www.amazon.com/Mencken-Chrestomathy-H-L/dp/0394752090 could become http://www.amazon.com/h-l-mencken-chrestomathy (or even http://www.amazon.com/h-l-mencken/chrestomathy, if they wanted to support hierarchical navigation). http://ask.metafilter.com/54166/Reading-material-on-English-language-origins could become http://ask.metafilter.com/reading-material-on-english-language-origins.
The general solution for these problem is:
- For each piece of content in the database, store a unique slug. This will likely take the form of normalizing the title/name of each item, lowercasing it (useful URLs are lowercase), and replacing spaces with hyphens (useful URLs prefer hyphens to separate words).
- Canonical URLs containing slugs should use the slug as the unique key for the database look-up, then return the content with a status code of 200. (If the slug was invalid, a 404 should be returned).
- Set up 301 redirects from the URLs containing the numeric IDs to those containing the slugs. For example, http://ask.metafilter.com/54166/ -> http://ask.metafilter.com/reading-material-on-english-language-origins.
(For more information on URL design, see my ebook Useful URLs).