by Darrin J. Ward:
Although this is not a new topic, we still see a lot of people attempting to trick search engine robots by using JavaScript "include" files in order to perform nasty redirects or to set a particular element's visibility:none (via JS or CSS) - thus making it invisible to users but visible to search engine robots. Similarly, we see a lot of people that use CSS to set the H (H1, H2, etc) family of elements to a much smaller font size than that of their natural appearance.
The fundamental premise of such implementations is normally that the search engines do not actively look at these include files, thus the "tricks" will remain uncovered. Such an assumption would be incorrect.
It's not exactly "new news" that search engine crawlers do indeed look at these files. In fact, I distinctly remember posting about Google's crawler making hits on .js (JavaScript) and .css (Style sheet) files years ago, literally. To prove it: Here are some hits taken from the raw access log files from this very blog (which has only on this new domain for a number of days):
66.249.72.20 - - [16/Jul/2007:18:27:44 -0400] "GET /inc/js/share-this.js HTTP/1.1" 200 1178 "-" "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)"
66.249.72.20 - - [16/Jul/2007:18:28:40 -0400] "GET /inc/js/prototype.js HTTP/1.1" 200 14471 "-" "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)"
66.249.72.20 - - [15/Jul/2007:19:15:47 -0400] "GET /inc/css/styles.css HTTP/1.1" 200 2180 "-" "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)"
66.249.72.20 - - [17/Jul/2007:22:22:24 -0400] "GET /inc/css/styles.css HTTP/1.1" 200 2264 "-" "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)"
How do I know that these are actually from Google and not a fake? Simple. I take the IP 66.249.72.20 over to the ARIN IP Whois tool and see who owns that IP. OK, so how did I actually get to see these entries? Well, even though SpyderTrax is a great tool for checking on robot activity at the page level, it doesn't show details on hits to these .css and .js files. So. I logged into my server via SSH and performed the following command on my access log:
grep 'Googlebot' FILENAME | grep '.js|.css'
This command shows me all of the hits that contain "Googlebot" along with either ".js" or ".css". If I only wanted to see one or the other, the I would use only '.js' or '.css' for that last part (not escaping backslash).
So, in writing this, I was wondering when all of this activity actually started. I know it's been going on for years. Lucky for me, I'm a fanatical Analytics fan (not Google Analytics), and I know the value of being able to retroactively look at Key Performance Indicators - So I have log files dating back to the start of 2003. And, not just on one site - but on enough sites to actually take a peep and see when Google's activity might have started. So I did.
64.68.89.138 - - [23/Mar/2004:22:49:08 -0500] "GET /includes/js/nav/menu_com.js HTTP/1.1" 200 21960 "-" "Googlebot/Test"
64.68.89.138 - - [24/Mar/2004:01:05:20 -0500] "GET /includes/js/functions.js HTTP/1.1" 200 1085 "-" "Googlebot/Test"
64.68.89.167 - - [25/Mar/2004:08:32:50 -0500] "GET /includes/js/nav/exmplmenu_var.js HTTP/1.1" 200 3406 "-" "Googlebot/Test"
64.68.89.182 - - [25/Mar/2004:21:31:06 -0500] "GET /includes/js/nav/menu_com.js HTTP/1.1" 200 21960 "-" "Googlebot/Test"
el64.68.89.182 - - [26/Mar/2004:04:09:04 -0500] "GET /includes/js/functions.js HTTP/1.1" 200 1085 "-" "Googlebot/Test"
These are the first hits that I have tracked from Googlebot. Admittedly, I only looked at 2004, because I started with that year and the log files were so large they took forever to process. Obviously, you can see that they were using the "Googlebot/Test" User-agent then. But an IP Whois confirms that it's Google's IP block. So - it would appear as though there was a three day test or so going on at that point. One week before my birthday.
I had intentions on stripping out all valid hits from a Googlebot over the last few years and plotting a graph to show activity levels, but there's work to be done and I'm not sure something like that would have all that much value, even if it is super-interesting.
So what is the take-away from all of this nonsense? Simple. Take a look through your JavaScript and CSS files to make sure that they validate and that there's no functions that might accidentally perform redirects in what might be considered a sneaky way. I'm not worried necessarily about re-styling H1 or H2 tags with CSS - I do that myself. However, I wouldn't ever have them 100 pixels off screen or invisibly small, because that's obviously very easy to detect. Of course - only the big G know's that they do with those files for sure!