We begin our discussion by understanding what robot.txt files are all about and how they influence search crawlers. Robot.txt files (as set up by webmasters) are used by search engine spiders to determine which pages of a particular domain should be indexed by it and which pages should be left out of the index.
Blogspot by default restricts search engines from indexing Label pages of Blogspot blogs. This is done with the help of robot.txt files.
Blogspot by default restricts search engines from indexing Label pages of Blogspot blogs. This is done with the help of robot.txt files.
Blogspot Labels
Labels help bloggers categorize their posts easily. For example all my posts related to 'Blogging' can be found here :
http://www.inkjam.org/search/label/Blogging
http://www.inkjam.org/search/label/Blogging
An interesting thing to note here is that fact that a Blogspot blog does not have separate page assigned for its 'Labels'. In fact, post categorized under a label can be accessed only with a 'search' command.
Robots.txt for Blogspot Blogs
As stated earlier, Blogspot by default prevents web spiders from crawling its Label pages. This is how the Robots.txt files of a blogspot blog (by default) looks like :
User-agent: *
Disallow: /search
Allow: /
Sitemap: http://abc.blogspot.com/feeds/posts/default?orderby=updated
The "User-agent: *" command implies that this section applies to all robots.
The "Disallow: /search" command tells the web spiders not to index pages that use the search command.
Note that the slash (/) stands for your homepage URL. The use of the above robot.txt files can be better explained with an example. Let us imagine that your homepage URL is http://abc.blogspot.com/
The cumulative effect of the above robot.txt files would result in the search engines indexing all your pages except pages that use the following URL structure :
http://abc.blogspot.com/search/
Thus label pages which can be accessed only using a search command are kept out of search engine index.
To get search engines to index your blogspot blog's label pages the above mentioned robot.txt files can be modified as follows:
User-agent: *
Disallow:
Allow: /
Sitemap: http://abc.blogspot.com/feeds/posts/default?orderby=updated
Accessing your Robots.txt files
Robot.txt files for your blogspot blog can be accessed as follows:
1. Go to 'Settings' and then click on 'Search Preferences' |
2. Under 'Crawlers and Indexing' 'Edit' the 'Custom Robot.txt' |
3. 'Enable Custom Robots.txt content'to access your Blogspot blog's robot.txt files |
Important Note: I would strongly recommend that you do not modify your robots.txt files to allow search engines to index your Label pages as this might lead to duplicate content issues.
Setting up robot.txt files to prevent Search Engines from indexing Blogspot Archive pages
Archive pages if indexed by search engines can lead to duplicate content worries and push your search rankings down. You can however, set up robot.txt files to prevent search engines from indexing your Blogspot blogs's Archive pages.
Continuing with our example blog - http://abc.blogspot.com
The monthly archive pages for the said blog would look something like this:
Month | URL |
January, 2013 | http://abc.blogspot.com/2013_01_01_archive.html |
February, 2013 | http://abc.blogspot.com/2013_02_01_archive.html |
March, 2013 | http://abc.blogspot.com/2013_03_01_archive.html |
You can remove Blogspot archive pages by modifying the robots.txt files as follows:
User-agent: *
Disallow: /search
Disallow: /
2013_01_01_archive.html
Disallow: /
2013_02_01_archive.html
Disallow: /
2013_03_01_archive.html
Allow: /
Sitemap: http://www.example.com/feeds/posts/default?orderby=updated