I was looking at my IIS log files, and noticed that many URLs with paths listed in my robots.txt file were being read by a User-Agent claiming to be Googlebot. My first thought was that it might be some bad bot masquerading as Google. So, I checked the IP address, and it looked legit. I double-checked my robots.txt file, and it looked fine. What could the problem be?
Here's a snapshot of the file:
User-agent: Googlebot Disallow: /*-hires. Disallow: /*-big. Disallow: /scripts/ Disallow: /p/s/ Disallow: /clientbin/ Disallow: /ClientBin/ Disallow: /app_themes/ Disallow: /App_Themes/ Disallow: /samples/ Disallow: /p/id.ashx
User-agent: * Disallow: /scripts/ Disallow: /p/s/ Disallow: /clientbin/ Disallow: /ClientBin/ Disallow: /app_themes/ Disallow: /App_Themes/ Disallow: /samples/ Disallow: /p/id.ashx
I decided to double-check the syntax by running the file through an online syntax checker, to see if I was missing something obvious. Sure enough, it found an error, although one that's not readily visible.
It turns out that when you create a .txt file from within Visual Studio, it gets stored in UTF-8 format. That results in a byte order marker (BOM) being written at the beginning of the file. The BOM, in turn, corrupts the first line—which explains why Googlebot was reading files that it shouldn't.
An easy fix is to either have a blank line or a comment as the first line of the file.
If you create a .txt file with notepad, the default is to save it as an ANSI file, which doesn't use a BOM. However, if you do most of your work within Visual Studio, like I do, then this is more likely to be an issue.