Friday, May 21, 2010

Preventing/solving duplicate content: canonical link/URL vs 301 redirect

Index

What is duplicate content?
   What is a canonical URL?
   What is a 301 redirect?
Duplicate content scenario I - www versus non-www
   Solutions for "www versus non-www" scenario
Duplicate content scenario II - index.php for home
   Solutions for "index.php for home" scenario
Duplicate content scenario III - dynamic pages
   Solutions for "dynamic pages" scenario
Duplicate content scenario IV - uncontrolled extended URLs
   Solution for "uncontrolled extended URLs" scenario
Conclusions

What is duplicate content?

Duplicate content is content (usually text) that is identical to the content of other web pages. These web pages may or may not be on different domains. However, this article is about duplicate content that is accidentally created on one and the same domain.
Regarding duplicate content, Google webmaster central mentions that " . . . the ranking of the site may suffer, or the site might be removed entirely from the Google index, in which case it will no longer appear in search results."
Go back to index

What is a canonical URL?

In search engine lingo, the canonical URL (Uniform Resource Locator aka website address) of a website page is the preferred location of that page. Using the relatively new link element rel="canonical", the Google, Yahoo and Microsoft search engines may be informed as to what the preferred location of a page actually is.

According to Matt Cutts (currently the head of Google’s Webspam team) "this is much like a 301 permanent redirect that only flows within a site. . . the unpreferred URL will disappear from our index, but the anchortext/PageRank from the unpreferred URL will be transferred to the preferred URL".
Go back to index

What is a 301 redirect?

A 301 redirect is the permanent redirection of a web page executed by the hosting server. All traffic intended for the old URL is permanently routed to the new URL, and page ranking value for the old URL is also transferred to the new URL.
Though incomplete, this definition should be sufficient for the purposes of this article.
Go back to index

Duplicate content scenario I - www versus non-www

http://www.domainname.tld/ (tld stands for top level domain like com, au and nl) is the homepage with the intended/expected URL.
http://domainname.tld is the same page.

If your website receives links from other websites through a mix of both URLs, the potential benefit of valuable link popularity is split. Link popularity plays an important role when search engines determine the ranking of web pages.

Checking whether your domain is vulnerable to this kind of duplicate content is easy. Just type the undesired URL into the address bar of your browser (with disabled cache to prevent redirection by the browser!). If your browser is not redirected to the correct location, there may be a problem.

Aside from the duplicate content problem, this simple method may disclose other problems, including the following:
1) 404 error not found.
2) this domain is reserved.
3) promotional page from your webmaster.

The 301 redirect solution will solve these other problems as well and works for all "bots", while the canonical link solution will only solve the duplicate content problem and is, at present, restricted to Google, Yahoo and Microsoft search engines.
Go back to index

Solutions for "www versus non-www" scenario

301 redirect

The hosting server needs to be instructed to "301 redirect" all traffic to the desired domain.
The exact method depends on the hosting server and web development techniques.
Here, I will explain the technique favored by Wasseo (for apache servers) to redirect all requests to the www domain.
Create or edit the .htaccess file in your website’s root web folder to include the following:
RewriteEngine On
RewriteBase /
RewriteCond %{HTTP_HOST} !^www.domainname.tld$ [NC]
RewriteRule ^(.*)$ http://www.domainname.tld/$1 [R=301,L]

canonical link

Specify the following in the head section of the source code of your homepage :
<link rel="canonical" href="http://www.domainname.tld/" />.
This tells search engines that the preferred location of the website page (the canonical URL, in search engine speak) is
http://www.domainname.tld/
instead of
http://domainname.tld/.
Go back to index

Duplicate content scenario II - index.php for home

http://www.domainname.tld/ is the homepage with the intended/expected URL.
http://www.domainname.tld/index.php is the same page. It is not uncommon that the indexation of this URL is the result of specifying this URL somewhere on the website as an internal link.
Again, the potential benefit of link popularity may be reduced.
Go back to index

Solutions for "index.php for home" scenario

301 redirect

The hosting server needs to be instructed to "301 redirect"
http://www.domainname.tld/index.php
to
http://www.domainname.tld/
The exact method depends on the hosting server and web development techniques.
The technique favored by Wasseo (for apache servers):
Create or edit the .htaccess file in your website’s root web folder to include the following:
RewriteRule ^index\.php$ http://www.domainname.tld/ [R=301,L]

canonical link

Specify the following in the head section of the source code of your homepage :
<link rel="canonical" href="http://www.domainname.tld/" />.
This tells search engines that the preferred location is
http://www.domainname.tld/
instead of
http://www.domainname.tld/index.php.
Note that this is the same method as used for scenario I.
Go back to index

Duplicate content scenario III - dynamic pages

Dynamic website pages that are generated through scripts may display additional URLs like:
http://www.domainname.tld/categoryX/product001.php
Go back to index

Solutions for "dynamic pages" scenario

Since it is not always evident to predict such URLs (consider sessionid), 301 redirection may not be a reasonable option.
While these URLs can still be blocked from being crawled by the search engines via the robots.txt file, Google webmaster central now recommends to implement the canonical link element.

canonical link

Specify the following in the head section of the source code of your product001 page :
<link rel="canonical" href="http://www.domainname.tld/product001.php" />.
This tells search engines that the preferred location of the website page (the canonical URL, in search engine speak) is
http://www.domainname.tld/product001.php
instead of
http://www.domainname.tld/categoryA/product001.php.
Note that this is the same method as used for scenarios I and II.
Go back to index

Duplicate content scenario IV - uncontrolled extended URLs


Depending on server configuration and web development techniques, it may be possible for other parties to create and link to undesired URLs simply by extending a URL with some parameters, e.g. ?param=value.
Go back to index

Solution for "uncontrolled extended URLs" scenario

Since it is impossible to predict such URLs, 301 redirection is not an option.
While these URLs can still be blocked from being crawled by the search engines via the robots.txt file, Google webmaster central now recommends to implement the canonical link element.

canonical link

Specify the following in the head section of the source code of your page :
<link rel="canonical" href="http://www.domainname.tld/page.php" />.
This tells search engines that the preferred location of the website page (the canonical URL, in search engine speak) is
http://www.domainname.tld/page.php
instead of
http://www.domainname.tld/page.php?param=value.
Note that this is the same method as used for scenarios I, II and III.
Go back to index

Conclusions

canonical link

  1. easy to implement
  2. singularly solves multiple duplicate content issues in one stroke, especially uncontrollable ones
  3. is geared to some (leading) search engines only

301 redirect

  1. more complicated to implement than canonical link
  2. only solves the instance of a duplicate content url for which it is implemented
  3. is geared to work for all webbots including search engine bots
  4. may prevent/solve other issues like undesired website pages created by webmasters or hosting companies

Go back to index

No comments:

Post a Comment