WordPress SEO and Robots.txt
Filed Under blogging, duplicate content, wordpress seo
One step in the WordPress SEO process is to properly set up robots.txt so that you avoid indexing duplicate content. I’ve spent the better part of two weeks setting up my blog and ran into some conflicting information on the subject. I decided to test things out, and found some interesting things. I thought I’d share my results.
You can read up on robots.txt and what it does here, but essentially, it’s a simple text file placed in the root of your site that prevents parts of your site from being crawled by the engines. Since duplicate content can be an issue, you want to do your best to keep the engines from indexing sections of your site more than once.
Note that the way your content is linked to can determined how it is indexed and whether or not it is seen as duplicate. Google’s advice on dealing with this issue includes this…
Most of the time when we see this, it’s unintentional or at least not malicious in origin: forums that generate both regular and stripped-down mobile-targeted pages, store items shown (and — worse yet — linked) via multiple distinct URLs, and so on.
Further down that page, they state…
Be consistent: Endeavor to keep your internal linking consistent; don’t link to /page/ and /page and /page/index.htm.
And that is exactly what many blog/CMS platforms, including WordPress do. Google also warns about that possibility…
Understand your CMS: Make sure you’re familiar with how content is displayed on your Web site, particularly if it includes a blog, a forum, or related system that often shows the same content in multiple formats.
The Googlebot supports some syntax that other bots do not. Interestingly, Google’s webmaster tools starts you out with:
User-Agent: * Allow: /
As far as I know, other bots don’t directly support the “Allow” command, although Live will validate this robots.txt, so it appears to be relatively safe. Google also supports wildcards within a string, such as */feed/, which Live does not. So, that makes this robots.txt, as suggested by a WordPress SEO plugin developer, fail to validate at the Live webmaster tools…
User-agent: * Disallow: */trackback* Disallow: /wp-* Disallow: */feed* Disallow: /20* User-Agent: MediaPartners-Google Allow: /
While we want to focus on getting traffic from Google, we don’t want to do it at the expense of getting traffic from the other engines if there is another way. In addition, if you use /20*, then you won’t be able to use a post name such as “/200-wordpress-seo-steps/” or date formated archives, such as “/2008/04/24/post-name/.”
The above example robots.txt also doesn’t cover some of the possible url paths that we want to block either, so it is incomplete as well. We need a different solution. I’m currently using an SEO blog template from mytypes.com. They offer a sample robots.txt that looks like this…
User-agent: Googlebot
Disallow: /*/feed/$
Disallow: /*/feed/rss/$
Disallow: /*/trackback/$
User-agent: *
Disallow: /wp-
Disallow: /feed/
Disallow: /trackback/
Disallow: /rss/
Disallow: /comments/feed/
Disallow: /page/
Disallow: /date/
Disallow: /comments/
At first glance, that looks like it will do most of what we want, but a second look shows otherwise. According to my testing, while this will validate at Live webmaster tools, this is still not good enough. Below is a link to a video where I show how this robots.txt doesn’t block everything we want it to block in Google, and how we need to do some editing.
How to test your robots.txt in Google Webmaster Tools
Starting with that file, I’ve made some edits to come up with the current version. Here’s the final version of robots.txt that I’m using, which validates at both Google and Live, and also correctly block/allow the content we want according to Google’s robots.txt checker…
User-agent: Googlebot Disallow: /feed/$ Disallow: /*/feed/rss/$ Disallow: /*/trackback/$ Disallow: /wp- Disallow: /feed/ Disallow: /trackback/ Disallow: /rss/ Disallow: /page/ Disallow: /date/ Disallow: /comments/ Disallow: /tag/ Disallow: /cgi-bin/ User-agent: * Disallow: /wp- Disallow: /feed/ Disallow: /trackback/ Disallow: /rss/ Disallow: /page/ Disallow: /date/ Disallow: /comments/ Disallow: /tag/ Disallow: /cgi-bin/ User-Agent: MediaPartners-Google Allow: /
We’re allowing the Adsense bot to hit everything so that it can display ads properly on all pages. So, go check your robots.txt file, will ya?!
If you're new here, you may want to subscribe to my RSS feed. Thanks for visiting!
Related Posts- Problems with the Mandigo WordPress Theme
- 10 Reasons to Use Blogger to Promote Your Website
- Problems with Blue Zinfandel WordPress Theme
- 40 Days of Blogging
- 7 Ways to Deal with Duplicate Content
7 Responses to “WordPress SEO and Robots.txt”
Leave a Reply
[...] Schmitt presents WordPress SEO and Robots.txt posted at Free Website Traffic [...]
Thanks for this informative post. Can you please tell me why you need to specify useragent: Googlebot? Wouldn’t you want other spiders to avoid the same files? Thanks!
Forgot to check notify me of follow up via email.
@Shae,
Yes, you do, but each user agent supports slightly different syntax. For example, Google does not support crawl delay. In my testing, I found that Google ignored my requests to avoid certain files/directories if I used “user-agent: *” so I split it up. A specific set of commands for Google, and a set of commands for all engines. That covers everybody (at least those engines that pay attention to robots.txt). So, the final robots.txt looks like this:
User-agent: Googlebot
Disallow: /feed/$
Disallow: /*/feed/rss/$
Disallow: /*/trackback/$
Disallow: /wp-
Disallow: /feed/
Disallow: /trackback/
Disallow: /rss/
Disallow: /page/
Disallow: /date/
Disallow: /comments/
Disallow: /tag/
Disallow: /cgi-bin/
User-agent: *
Disallow: /wp-
Disallow: /feed/
Disallow: /trackback/
Disallow: /rss/
Disallow: /page/
Disallow: /date/
Disallow: /comments/
Disallow: /tag/
Disallow: /cgi-bin/
User-Agent: MediaPartners-Google
Allow: /
Kurt - Thanks for the quick response!
I understand. I tested my modified version of robot.txt with Google Webmaster tools and it worked as I expected it to when I clumped all disallows under useragent: *, so I think I’ll go with that.
I don’t have ads on my site, so I assume User-Agent: MediaPartners-Google
Allow: /
is not needed.
Thank you again!
@Shae,
Right, the Media bot is only for Adsense. Actually, you probably don’t need that one in there at all, it’s just that Google starts you out with an “allow” for all, so I keep that one in there for clarity. I’m not even running Adsense on this blog at this time, but it doesn’t hurt to have it in there in case I do.
Just found your blog today. Really like it - keep up the good work.Domain info more important than you think :-)Domain information such as DNS, age of domain and even the expiration date are used to distinguish between illegitimate and legitimate domains.Why are google doing this? Simply to get all the factors they can to get an internal “trust score”.This “trust score” is used to eliminate “doorway” pages and spam in the search result.I’M not saying that it’s working perfectly - but they are doing a pretty good job.