Understanding Robot.txt Files For Blogspot Blogs

We begin our discussion by understanding what robot.txt files are all about and how they influence search crawlers. Robot.txt files (as set up by webmasters) are used by search engine spiders to determine which pages of a particular domain should be indexed by it and which pages should be left out of the index.

Blogspot by default restricts search engines from indexing Label pages of Blogspot blogs. This is done with the help of robot.txt files.

Blogspot Labels

Labels help bloggers categorize their posts easily. For example all my posts related to 'Blogging' can be found here :
http://www.inkjam.org/search/label/Blogging

An interesting thing to note here is that fact that a Blogspot blog does not have separate page assigned for its 'Labels'. In fact, post categorized under a label can be accessed only with a 'search' command.

Robots.txt for Blogspot Blogs

As stated earlier, Blogspot by default prevents web spiders from crawling its Label pages. This is how the Robots.txt files of a blogspot blog (by default) looks like :

User-agent: *
Disallow: /search
Allow: /

Sitemap: http://abc.blogspot.com/feeds/posts/default?orderby=updated

The "User-agent: *" command implies that this section applies to all robots.

The "Disallow: /search" command tells the web spiders not to index pages that use the search command.

Note that the slash (/) stands for your homepage URL. The use of the above robot.txt files can be better explained with an example. Let us imagine that your homepage URL is http://abc.blogspot.com/

The cumulative effect of the above robot.txt files would result in the search engines indexing all your pages except pages that use the following URL structure :

http://abc.blogspot.com/search/

Thus label pages which can be accessed only using a search command are kept out of search engine index.

To get search engines to index your blogspot blog's label pages the above mentioned robot.txt files can be modified as follows:

User-agent: *
Disallow: 
Allow: /

Sitemap: http://abc.blogspot.com/feeds/posts/default?orderby=updated

Accessing your Robots.txt files

Robot.txt files for your blogspot blog can be accessed as follows:

1. Go to 'Settings' and then click on 'Search Preferences'

2. Under 'Crawlers and Indexing' 'Edit' the 'Custom Robot.txt'

3. 'Enable Custom Robots.txt content'to access your Blogspot blog's robot.txt files

Important Note: I would strongly recommend that you do not modify your robots.txt files to allow search engines to index your Label pages as this might lead to duplicate content issues.

Setting up robot.txt files to prevent Search Engines from indexing Blogspot Archive pages

Archive pages if indexed by search engines can lead to duplicate content worries and push your search rankings down. You can however, set up robot.txt files to prevent search engines from indexing your Blogspot blogs's Archive pages.

Continuing with our example blog - http://abc.blogspot.com

The monthly archive pages for the said blog would look something like this:

Month	URL
January, 2013	http://abc.blogspot.com/2013_01_01_archive.html
February, 2013	http://abc.blogspot.com/2013_02_01_archive.html
March, 2013	http://abc.blogspot.com/2013_03_01_archive.html

and so forth...

You can remove Blogspot archive pages by modifying the robots.txt files as follows:

User-agent: *
Disallow: /search
Disallow: /2013_01_01_archive.html 
Disallow: /2013_02_01_archive.html 
Disallow: /2013_03_01_archive.html 
Allow: /

Sitemap: http://www.example.com/feeds/posts/default?orderby=updated

Note that you will have to manually add disallow tags for each of your archive pages in a manner similar to what has been displayed above.

INKJAM

Search This Blog

Understanding Robot.txt Files For Blogspot Blogs

Labels

Popular posts from this blog

A Super Funny Joke - The Boy And The Priest

Locked Keys - Blonde Joke

The Great SMS Android Swindle