SEMGURU – A Web 2.0 Blog

Posts Tagged ‘robots.txt

What is robots.txt

A text file present in the root directory of a site which is used to control which pages are indexed by a robot.

A robots.txt file provides restrictions to search engine robots (known as “bots”) that crawl the web. These bots are automated, and before they access pages of a site, they check to see if a robots.txt file exists that prevents them from accessing certain pages.

Creating the robots.txt file

There is nothing difficult about creating a basic robots.txt file. It can be created using notepad or whatever is your favorite text editor. Each entry has just two lines:

User-Agent: [Spider or Bot name]
Disallow: [Directory or File Name]

This line can be repeated for each directory or file you want to exclude, or for each spider or bot you want to exclude.

A few examples will make it clearer.

1. Exclude a file from an individual Search Engine

You have a file, privatefile.htm, in a directory called ‘private’ that you do not wish to be indexed by Google. You know that the spider that Google sends out is called ‘Googlebot’. You would add these lines to your robots.txt file:

User-Agent: Googlebot
Disallow: /private/privatefile.htm

2. Exclude a section of your site from all spiders and bots

You are building a new section to your site in a directory called ‘newsection’ and do not wish it to be indexed before you are finished. In this case you do not need to specify each robot that you wish to exclude, you can simply use a wildcard character, ‘*’, to exclude them all.

User-Agent: *
Disallow: /newsection/

Note that there is a forward slash at the beginning and end of the directory name, indicating that you do not want any files in that directory indexed.

3. Allow all spiders to index everything

Once again you can use the wildcard, ‘*’, to let all spiders know they are welcome. The second, disallow, line you just leave empty, that is your disallow from nowhere.

User-agent: *
Disallow:

4. Allow no spiders to index any part of your site

This requires just a tiny change from the command above – be careful!

User-agent: *
Disallow: /

If you use this command while building your site, don’t forget to remove it once your site is live!

How we can validate robots.txt file.

You can use the robots.txt analysis tool in Google webmaster tools to:

  • Check specific URLs to see if your robots.txt file allows or blocks them.
  • See if Googlebot had trouble parsing any lines in your robots.txt file.
  • Test changes to your robots.txt file.


May 2024
M T W T F S S
 12345
6789101112
13141516171819
20212223242526
2728293031  

Delicious

Linkedin

View Lalit Wason (lalit.dost@gmail.com)'s profile on LinkedIn
SearchMyCampus

Technorati

Add to Technorati Favorites

Monthly Traffic Reports

  • 23,970 hits

Tracking

Real Time Web Analytics Clicky