Robots.txt Guide 2018

No items found.
Robots.txt file provides valuable information for the search engine spiders that crawl the Internet.

Robots.txt file provides valuable information for the search engine spiders that crawl the Internet. The search engine spiders check the file, before going through the pages of your site. If you have correctly configured robots.txt, it allows them to more efficiently scan the site, cause you help the robots immediately start indexing valuable information on your site. However, as the directive in robots.txt and noindex instructions in meta tag robots are only a recommendation for robots. Therefore, they do not guarantee that the closed pages are not indexed and added to the index. If you need to close the part of the site from being indexed, then, for example, you can further take advantage of the closing of the directory password.  

Basic Syntax

User-Agent: the robot for which the following rules apply (for example, Googlebot). Disallow: the pages that you want to block access (you can specify a large list of directives with each new line). Each group User-Agent/Disallow separated by a blank line. However, nonblank lines should not exist within the group (between User-Agent and the last directive Disallow). Symbol '#' is used for comments in the file robots.txt: for the current row everything that is after '#' is ignored. These comments can be used for the whole line and at the end of the line after the directives. Directories and file names are case sensitive: 'catalog', 'Catalog' and 'CATALOG' are all different directories for search engines. Host: is used to specifying the primary site mirror for search engines. Therefore, if you want to glue 2 sites and do a 301 pages redirect, you should not do a redirect for the file robots.txt on dubs site to seen by search engines this directive on the site that you want to glue together. Crawl-delay: You can limit the rate of crawl of your site. If your site has colossal attendance, the load on the server from a variety of search robots can lead to additional problems. For more flexible directives you can use 2 characters: * (Asterisk) - means any sequence of characters $ (Dollar sign) = the end of a line  

The Basic Examples of Using Robots.txt:

Ban on indexing the entire site(deny all): User-agent: * Disallow: / This instruction is important to use when you are developing a new website and share access to it, for example, through a subdomain. Often, the developers overlooked to close for indexing the site and immediately receive a complete copy of the site for search engines to index. If it happens, you have to do 301 paginal redirect to your primary domain. Next construction allows you to index the whole site: User-agent: * Disallow: Ban on indexing of a concrete folder: User-agent: Googlebot Disallow: /no-index/ Ban on visiting the page for a particular robot: User-agent: Googlebot Disallow: /no-index/this-page.html Ban on indexing of a specific file type User-agent: * Disallow: /*.pdf$ Allow for a specific search robot to visit a particular page User-agent: * Disallow: /no-bots/block-all-bots-except-rogerbot-page.html Link to Sitemap User-agent: * Disallow: Sitemap: http://www.example.com/none-standard-location/sitemap.xml Nuances of using this directive: If you regularly add to the site unique content, it is better do not add to the robots.txt the link to your sitemap. In addition, since a lot of unscrupulous webmasters parse content from other sites and use for their projects, you need to make the card site with an unusual sitemap.xml name. For example, my-new-sitemap.xml and then add this link via search engines webmasters.


Which is better to use robots.txt or meta-robots?

If you want do not include the page in the index, it is better to use the noindex meta-tag robots. Add the following meta tag on the page section. This will allow you to:

  • remove the index page, during your next visit, from a search robot (and you do not have to manually delete this page via webmasters),
  • to transmit link juice.

Through robots.txt better to close of indexing:

  • admin site,
  • site search results,
  • registration pages/login/password recovery.


How to Check the Robots.txt File?

After finally forming a robots.txt file, check it for errors. You can use the verification tools from search engines:

Webmaster_Tools_-_robots_txt_Tester

Google Webmasters: Sign-in with confirmed current site, go to Crawl -> robots.txt Tester checks the file robots.txt. With this tool you can:

  • immediately see all the errors and possible problems,
  • make all changes at once and check for errors, and then move to the ready file on your website,
  • check if you closed all unnecessary for index page and opened all desired pages.

  Note that the creation and configuration of robots.txt are listed in first points of the internal website optimization and is the beginning of search engine promotion. It is important to set it up correctly so that the right pages and sections were available for indexing search engines and the undesirable pages were closed. However, should remember that the robots.txt does not guarantee the pages out of indexing.