Every serious webmaster than I know, and especially SEO's, have complained at least once about their content being stolen (scraping). Computer software known as "spiders", "robots" or "bots" regularly crawl the internet and visit our websites. Search engines use this spidering software to visit your website and include your pages in their indices. Unfortunately, people that are out to steal your content also use this software to download websites in huge volumes and then republish it elsewhere, thus detracting from the uniqueness of the "scraped" content. Or, some robots just relentlessly rip through pages in an attempt to harvest as many email addresses as possible - so that they can spam them later. They way to stop either these scrapers or email harvesters is the same.
Obviously, we don't mind the search engines from accessing our website because they are "good" bots, but how do we prevent the content-robbing "bad bots" from accessing our websites?
Many people will tell you about something called mod_rewrite in the ".htaccess" file , which is an Apache directives file. Many of the directives/code which you will find on the internet use a very simple filtering system to prevent known-bad robots from accessing your content. The problem is that almost all of them rely on the spider-supplied "User-Agent" field. Some others rely on blocking known-bad IP address blocks. But, none of them allow you as a webmaster to dynamically detect and block these bad robots.
First, let me give you a brief summary of all of the information that we will have available to us in order to make a sound judgement.
I have being using a proprietary script for some time that attempts to solve the bad robots problem. Today, I'll share with you the logic that I have implemented, which has demonstrated great success. I know of a few other people that also use the following method with some variation. In fact you may be able to find a script or service that does this for you.
An brief overview of the steps that I take to detect scraping "bad bots" is (also see additional considerations):
So, here is a some more explanation about the proceedure:
I have coded a hidden link into all of my web pages. Why? because robots/spiders follow almost all links, and I want those spiders to follow this special hidden link, not ordinary users - so I made it invisible. By invisible, I mean that it have no anchor text... for example: . Note that this code links to a php file and that I have used the rel=nofollow attribute (to prevent it from being followed by spiders that understand that command).
It is logical to assume that the majority of hits to this php file will be culpable, though we cannot yet accurately form an opinion as to whether the hit is from a good robot, a bad "bot" or some other source (such as a browser pre-fetch). However, we can use the php file to try and help us out, since only automated requests should be made to this file.
So, again, what information do we know from the hit to the php file. Remember, for each hit we have an IP address, requested URL and the User Agent. We can't really use User-Agent field, because it is supplied by the requesting agent, and thus potentially mendacious. We can trust the requested URL because it is logged on our side (but we've already exhausted this to our advantage by detecting the hit to robots.php). We can also trust the IP address.
Yes, IP's can be faked/spoofed, but the majority of scrapers are small time and don't have the resources or know-how to perform IP Spoofing (which is actually very difficult to do over TCP/IP to websites).
OK, so let's look at the IP. If we suspect the request is suspicious (which it is since it followed the invisible link to the php file), then we need to determine whether or not the IP can be trusted i.e. if it's a good or bad bot, and that can be done via. reverse DNS Lookup. Reverse DNS Lookup tries to translate an IP address into the domain name that it belongs to. Here is an example of some IP addresses (taken directly from my log file) and the domain names to which they correspond:
So, we found that the first IP address does actually belong to Google, but the second one doesn't. We immediately ban the second IP addresses by adding it as a "deny from" in the .htaccess file, because it is crawling our site and downloading our content, as evident by its request to the php file in the blank link.
We don't know what that domain is and we don't want to trust it with unrestricted access to our content. When I add an IP address to my list, I regenerate my .htaccess ban file immediately, which ensures the bad bot is banned immediately.
At this point, you should realize that you will need to compile a list of trusted domain names. You should also thoroughly test your script so that you know it won't erroneously block a trusted domain.
I'm going to leave the logisitics of the unban up to you. I present my banned users with a captcha challenge to see if they are really humans. I also automatically unban IP addresses after a set time frame. Please read the additional considerations below.
I'm not going to talk about the script too much other than how it operates. Many people will use different platforms and programming languages. The above described system could be coded for almost any platform, with almost any language, so I won't talk about the actual code. You should however consider the following: -
Additional Considerations:
I'm not going to talk too much about the these, but you should be aware of
the following, if you choose to use the above descibed system:
This all sounds like a relatively easy process, and for the most part it is, but it is very very tedious. You simply must try to think outside the box when you are implementing the logisitics of this program. I haven't quite told you everything but you do now have enough information to detect and block bad bots.
I wanted to make the script that I use available to the public (for free of course). The unfortunate reality is that it would take too much effort to tidy it up the code and make documentation, so I'm going to have to refrain from doing that for now, unless I get enough requests for it.
I hope that this will shed some insight on how to block bad bots, because you really need to protect your content from thieves, email harvesters too! Your thoughts and feedback is welcome.