Analyzing a robots.txt file

February 10, 2006

Earlier this week, we told you about a feature we made available through the Sitemaps program that analyzes the robots.txt file for a site. Here are more details about that feature.

What the analysis means

The Sitemaps robots.txt tool reads the robots.txt file in the same way Googlebot does. If the tool interprets a line as a syntax error, Googlebot doesn't understand that line. If the tool shows that a URL is allowed, Googlebot interprets that URL as allowed.

This tool provides results only for Google user-agents (such as Googlebot). Other bots may not interpret the robots.txt file in the same way. For instance, Googlebot supports an extended definition of the standard. It understands Allow: lines, as well as * and $. So while the tool shows lines that include these extensions as understood, remember that this applies only to Googlebot and not necessarily to other bots that may crawl your site.

Subdirectory sites

A robots.txt file is valid only when it's located in the root of a site. So, if you are looking at a site in your account that is located in a subdirectory (such as https://www.example.com/mysite/), we show you information on the robots.txt file at the root (https://www.example.com/robots.txt). You may not have access to this file, but we show it to you because the robots.txt file can impact crawling of your subdirectory site and you may want to make sure it's allowing URLs as you expect.

Testing access to directories

If you test a URL that resolves to a file (such as https://www.example.com/myfile.html), this tool can determine if the robots.txt file allows or blocks that file. If you test a URL that resolves to a directory (such as https://www.example.com/folder1/), this tool can determine if the robots.txt file allows or blocks access to that URL, but it can't tell you about access to the files inside that folder. The robots.txt file may have set restrictions on URLs inside the folder that are different than the URL of the folder itself.

Consider this robots.txt file:

User-Agent: *
Disallow: /folder1/

User-Agent: *
Allow: /folder1/myfile.html

If you test https://www.example.com/folder1/, the tool will say that it's blocked. But if you test https://www.example.com/folder1/myfile.html, you'll see that it's not blocked even though it's located inside of folder1.

Syntax not understood

You might see a "syntax not understood" error for a few different reasons. The most common one is that Googlebot couldn't parse the line. However, some other potential reasons are:

The site doesn't have a robots.txt file, but the server returns a status of 200 for pages that aren't found. If the server is configured this way, then when Googlebot requests the robots.txt file, the server returns a page. However, this page isn't actually a robots.txt file, so Googlebot can't process it.
The robots.txt file isn't a valid robots.txt file. If Googlebot requests a robots.txt file and receives a different type of file (for instance, an HTML file), this tool won't show a syntax error for every line in the file. Rather, it shows one error for the entire file.
The robots.txt file containes a rule that Googlebot doesn't follow. Some user-agents obey rules other than the robots.txt standard. If Googlebot encounters one of the more common additional rules, the tool lists them syntax errors.

Known issues

We are working on a few known issues with the tool, including the way the tool processes capitalization and the analysis with Google user-agents other than Googlebot. We'll keep you posted as we get these issues resolved.

Posted by Vanessa Fox