WordPress SEO and Robots.txt
Filed Under blogging, duplicate content, wordpress seo
One step in the WordPress SEO process is to properly set up robots.txt so that you avoid indexing duplicate content. I’ve spent the better part of two weeks setting up my blog and ran into some conflicting information on the subject. I decided to test things out, and found some interesting things. I thought I’d share my results.
You can read up on robots.txt and what it does here, but essentially, it’s a simple text file placed in the root of your site that prevents parts of your site from being crawled by the engines. Since duplicate content can be an issue, you want to do your best to keep the engines from indexing sections of your site more than once.
Note that the way your content is linked to can determined how it is indexed and whether or not it is seen as duplicate. Google’s advice on dealing with this issue includes this…
Most of the time when we see this, it’s unintentional or at least not malicious in origin: forums that generate both regular and stripped-down mobile-targeted pages, store items shown (and — worse yet — linked) via multiple distinct URLs, and so on.
Further down that page, they state…
Be consistent: Endeavor to keep your internal linking consistent; don’t link to /page/ and /page and /page/index.htm.
And that is exactly what many blog/CMS platforms, including WordPress do. Google also warns about that possibility…
Understand your CMS: Make sure you’re familiar with how content is displayed on your Web site, particularly if it includes a blog, a forum, or related system that often shows the same content in multiple formats.
The Googlebot supports some syntax that other bots do not. Interestingly, Google’s webmaster tools starts you out with:
User-Agent: * Allow: /
As far as I know, other bots don’t directly support the “Allow” command, although Live will validate this robots.txt, so it appears to be relatively safe. Google also supports wildcards within a string, such as */feed/, which Live does not. So, that makes this robots.txt, as suggested by a WordPress SEO plugin developer, fail to validate at the Live webmaster tools…
User-agent: * Disallow: */trackback* Disallow: /wp-* Disallow: */feed* Disallow: /20* User-Agent: MediaPartners-Google Allow: /
While we want to focus on getting traffic from Google, we don’t want to do it at the expense of getting traffic from the other engines if there is another way. In addition, if you use /20*, then you won’t be able to use a post name such as “/200-wordpress-seo-steps/” or date formated archives, such as “/2008/04/24/post-name/.”
The above example robots.txt also doesn’t cover some of the possible url paths that we want to block either, so it is incomplete as well. We need a different solution. I’m currently using an SEO blog template from mytypes.com. They offer a sample robots.txt that looks like this…
User-agent: Googlebot
Disallow: /*/feed/$
Disallow: /*/feed/rss/$
Disallow: /*/trackback/$
User-agent: *
Disallow: /wp-
Disallow: /feed/
Disallow: /trackback/
Disallow: /rss/
Disallow: /comments/feed/
Disallow: /page/
Disallow: /date/
Disallow: /comments/
At first glance, that looks like it will do most of what we want, but a second look shows otherwise. According to my testing, while this will validate at Live webmaster tools, this is still not good enough. Below is a link to a video where I show how this robots.txt doesn’t block everything we want it to block in Google, and how we need to do some editing.
How to test your robots.txt in Google Webmaster Tools
Starting with that file, I’ve made some edits to come up with the current version. Here’s the final version of robots.txt that I’m using, which validates at both Google and Live, and also correctly block/allow the content we want according to Google’s robots.txt checker…
User-agent: Googlebot Disallow: /feed/$ Disallow: /*/feed/rss/$ Disallow: /*/trackback/$ Disallow: /wp- Disallow: /feed/ Disallow: /trackback/ Disallow: /rss/ Disallow: /page/ Disallow: /date/ Disallow: /comments/ Disallow: /tag/ Disallow: /cgi-bin/ User-agent: * Disallow: /wp- Disallow: /feed/ Disallow: /trackback/ Disallow: /rss/ Disallow: /page/ Disallow: /date/ Disallow: /comments/ Disallow: /tag/ Disallow: /cgi-bin/ User-Agent: MediaPartners-Google Allow: /
We’re allowing the Adsense bot to hit everything so that it can display ads properly on all pages. So, go check your robots.txt file, will ya?!
Share ThisIf you're new here, you may want to subscribe to my RSS feed. Thanks for visiting!
Related Posts- Problems with the Mandigo WordPress Theme
- 10 Reasons to Use Blogger to Promote Your Website
- Problems with Blue Zinfandel WordPress Theme
- 40 Days of Blogging
- 7 Ways to Deal with Duplicate Content
Comments protected by Lucia's Linky Love.
One Response to “WordPress SEO and Robots.txt”
Leave a Reply








[…] Schmitt presents WordPress SEO and Robots.txt posted at Free Website Traffic […]