Preventing Bad Bots / Scraping / Email Harvesting

July 19, 2007 10:12am
Categories: Organic SEO

Protecting your content from bad robots and scrapers is important for SEO. Here's some thoughts.

Every serious webmaster than I know, and especially SEO's, have complained at least once about their content being stolen (scraping). Computer software known as "spiders", "robots" or "bots" regularly crawl the internet and visit our websites. Search engines use this spidering software to visit your website and include your pages in their indices. Unfortunately, people that are out to steal your content also use this software to download websites in huge volumes and then republish it elsewhere, thus detracting from the uniqueness of the "scraped" content. Or, some robots just relentlessly rip through pages in an attempt to harvest as many email addresses as possible - so that they can spam them later. They way to stop either these scrapers or email harvesters is the same.

Obviously, we don't mind the search engines from accessing our website because they are "good" bots, but how do we prevent the content-robbing "bad bots" from accessing our websites?

Many people will tell you about something called mod_rewrite in the ".htaccess" file , which is an Apache directives file. Many of the directives/code which you will find on the internet use a very simple filtering system to prevent known-bad robots from accessing your content. The problem is that almost all of them rely on the spider-supplied "User-Agent" field. Some others rely on blocking known-bad IP address blocks. But, none of them allow you as a webmaster to dynamically detect and block these bad robots.

First, let me give you a brief summary of all of the information that we will have available to us in order to make a sound judgement.

Through analysis of these 3 parameters, it is possible to design an accurate system which will prevent unauthorized robots from accessing (scraping) our content.

I have being using a proprietary script for some time that attempts to solve the bad robots problem. Today, I'll share with you the logic that I have implemented, which has demonstrated great success. I know of a few other people that also use the following method with some variation. In fact you may be able to find a script or service that does this for you.

An brief overview of the steps that I take to detect scraping "bad bots" is (also see additional considerations):

  1. See if client/spider follows a blank URL.
  2. If it does, get the IP address and perform a reverse DNS lookup.
  3. If the DNS lookup resolves to an untrusted domain, block access in .htaccess (or httpd.conf).

So, here is a some more explanation about the proceedure:

I have coded a hidden link into all of my web pages. Why? because robots/spiders follow almost all links, and I want those spiders to follow this special hidden link, not ordinary users - so I made it invisible. By invisible, I mean that it have no anchor text... for example: . Note that this code links to a php file and that I have used the rel=nofollow attribute (to prevent it from being followed by spiders that understand that command).

It is logical to assume that the majority of hits to this php file will be culpable, though we cannot yet accurately form an opinion as to whether the hit is from a good robot, a bad "bot" or some other source (such as a browser pre-fetch). However, we can use the php file to try and help us out, since only automated requests should be made to this file.

So, again, what information do we know from the hit to the php file. Remember, for each hit we have an IP address, requested URL and the User Agent. We can't really use User-Agent field, because it is supplied by the requesting agent, and thus potentially mendacious. We can trust the requested URL because it is logged on our side (but we've already exhausted this to our advantage by detecting the hit to robots.php). We can also trust the IP address.

Yes, IP's can be faked/spoofed, but the majority of scrapers are small time and don't have the resources or know-how to perform IP Spoofing (which is actually very difficult to do over TCP/IP to websites).

OK, so let's look at the IP. If we suspect the request is suspicious (which it is since it followed the invisible link to the php file), then we need to determine whether or not the IP can be trusted i.e. if it's a good or bad bot, and that can be done via. reverse DNS Lookup. Reverse DNS Lookup tries to translate an IP address into the domain name that it belongs to. Here is an example of some IP addresses (taken directly from my log file) and the domain names to which they correspond:

So, we found that the first IP address does actually belong to Google, but the second one doesn't. We immediately ban the second IP addresses by adding it as a "deny from" in the .htaccess file, because it is crawling our site and downloading our content, as evident by its request to the php file in the blank link.

We don't know what that domain is and we don't want to trust it with unrestricted access to our content. When I add an IP address to my list, I regenerate my .htaccess ban file immediately, which ensures the bad bot is banned immediately.

At this point, you should realize that you will need to compile a list of trusted domain names. You should also thoroughly test your script so that you know it won't erroneously block a trusted domain.

I'm going to leave the logisitics of the unban up to you. I present my banned users with a captcha challenge to see if they are really humans. I also automatically unban IP addresses after a set time frame. Please read the additional considerations below.

I'm not going to talk about the script too much other than how it operates. Many people will use different platforms and programming languages. The above described system could be coded for almost any platform, with almost any language, so I won't talk about the actual code. You should however consider the following: -

Additional Considerations:
I'm not going to talk too much about the these, but you should be aware of the following, if you choose to use the above descibed system:

  1. IMPORTANT: You simply must have some kind of feature on the script that will let a user unban themselves. Inevitably, a minute percentage of false positives will occur, and you don't want that visitor to be turned away forever. The way to accomplish this is by having some kind of a challenge response if the user's IP does get banned. I use a custom Captcha system which allows the user to unban their IP. Human users can perform the unban and continue surfing whereas the automated bad bots cannot.
  2. Some users will visit your site through multi-IP proxies (AOL is a good example), so you need to account for hits from multiple IPs, otherwise your ban program may get confused.
  3. Browser Pre-Fetch mechanisms are likely to give a false positive on this system, since they will follow the blank URL, so looking for the pre-fecth header is something you might want to do. However, be aware that this header could also be sent by a bad bot, so don't trust it too much. It's a little more trustworthy than the User-Agent field.
  4. There can be freak occurances where a hit is made to the php script by a legit visitor. So, you may want to think about running multiple instances of the system in parallel, wherein a ban only occurs if hits are made to all of them consecutively. If you do choose to work on a system like this, one thing to look at might be the number of seconds between the hits, because it's only automated systems that will make multiple hits in very short time-frames (1 or 2 seconds).
  5. Spam bots change IP addressess every now and then because they do eventually get blacklisted. When that happens, the old IP address gets reassigned to other web surfers that are not necessarily bad, so only block IP addresses for a maximum amount of time (I go for 1 month).
  6. It is also possible to perform an "IP Whois" on an IP address to find the NetBlock owner, which would give you some extra details, in addition to a simple reverse DNS lookup. This could be helpful for making a ban decision.
  7. Sometimes search engines have IPs that resolve to domain names other than what would seem intuitive. You need to be aware of these so that they don't get banned. These are usually found through trial and error.
  8. You may also wish to consider adding the entire C-Class IP block to the ban list, since it is possible that the bad bot with use multiple IPs from within the same range, and you certainly do want to prevent those from grabbing your content.
  9. You will need to use a database to keep track of all the bans, IPs, unban, times, requested and referring URLs, etc... I use MySQL with about 15 fields (yes, there is that many parameters to track, when you think about it!!).
  10. You must keep the script under close observation during its test run. You will need to be "tailing" and "grepping" log files to keep an eye on your script and what it is doing.

This all sounds like a relatively easy process, and for the most part it is, but it is very very tedious. You simply must try to think outside the box when you are implementing the logisitics of this program. I haven't quite told you everything but you do now have enough information to detect and block bad bots.

I wanted to make the script that I use available to the public (for free of course). The unfortunate reality is that it would take too much effort to tidy it up the code and make documentation, so I'm going to have to refrain from doing that for now, unless I get enough requests for it.

I hope that this will shed some insight on how to block bad bots, because you really need to protect your content from thieves, email harvesters too! Your thoughts and feedback is welcome.

 
Submit RFP
TESTIMONIALS
“I knew that SEO was important for my Internet business, but I never thought that I my company would grow by over seven times from SEO alone. Thank you so much for making my company successful.”
Owner,
$8M+ Electronics Company
CONTACT
Telephone:
+1 (866) 695-3949