stop bots on your site with htaccess
How to Identify the Bot You Want to Block
Before you can block a bot, you will need to know at least one of two things: the IP address where the bot is coming from or the "User Agent string" that the bot is using. The easiest way to find this is to look into your raw web log.
Download your web log from your web host, uncompress it using an archiver, and open it in an ASCII text editor. You'll probably need to use a better editor than Notepad if your logs are large. If you have a search and replace utility like those listed on the Free Text Search and Replace Utilities page, you can use those instead of the editor. Search through the file for the bot you want to block. It helps if you know either the page it tried to access or the time it hit your web, so that you can narrow your search down.
Once you've located the entries that belong to the bot, look for the IP address and the user agent string.
The IP address is a series of 4 numbers separated by dots. They look like "127.0.0.1". The "User Agent string" is just the name that the program accessing your site goes by. For example, version 9.51 of the Opera web browser has a user agent string of "Opera/9.51 (Windows NT 5.1; U; en)" (among others) while the Google search engine bot goes by "Googlebot/2.1 (+http://www.google.com/bot.html)" (among others). You won't need to know the entire user-agent string. Just find some part of the user agent string that is unique to that particular bot, that is, that no other bot or web browser uses.
Note the IP addresses used by the bot and the user agent string.
Be careful though. Just because a bad bot has visited your website using a particular IP address does not mean that if you block that IP address, you'll be rid of that bot forever. Some viruses and malware infect a normal computer user's machine to turn it into a machine that sends spam and probes sites for vulnerabilities. The IP address that you plan to block may well belong to such an ordinary person, and you could be blocking an internet provider's IP address. When that user disconnects from the Internet, and another user logs in, the internet provider could assign the new user the same IP address. When you block by IP address, you may end up blocking an entire internet provider, and thus a lot of real users and potential customers.
Likewise, many bad bots intentionally use User Agent names that correspond to normal web browsers. As such you won't be able to tell from the user agent alone whether it's a bot or a real user. If you wantonly block user agents with the name of "Mozilla", for example, you could end up blocking nearly every human from your website.
In general, if you don't know what you're doing, it's best not to block anything, unless you don't mind inadvertently blocking users and perhaps even whole countries (if you're especially careless).
Download Your .htaccess File
Once you know the bot's IP address or user agent string, connect to your site using an FTP or SFTP client. Go to the top web directory of your site, where your home page is located. Look for a file named ".htaccess". If it exists, download it to your computer.
If it doesn't exist, make sure that it is not hidden from your view. Depending on the FTP program you use, you may need to log off, set a "Remote file mask" of "-a" (without the quotation marks) in the options for the program, and log in again to check.
(The "remote file mask" is the term used in the FTP client that I use. Your program may use a different term.)
Alternatively, log into your site using your web host's control panel. Most commercial web hosts allow you to access your web directories from your web browser and download files that way. If your host has a setting to "show hidden files" or the like, make sure you enable it to look for the .htaccess file.
If, after all your efforts to find it, you cannot locate any .htaccess file in the top web directory of your site, don't worry. It's quite normal not to have any in the default setup for most web hosts. You will simply have to create one yourself. The reason we went to all that trouble to locate it is that if one exists, you will need to get it so that you can add to the settings already present in that file. If you don't, and you create one from scratch to overwrite an existing one, you may inadvertently wipe out some other settings that you want for your site.
Open or Create the .htaccess File
If you've managed to get the .htaccess file, open it in an ASCII text editor (like Notepad). If one does not exist, use the editor to create a new blank document. The rest of this article will assume that you have already started the editor with the .htaccess open or with a blank document if no .htaccess file previously existed.
WARNING: do not use a wordprocessor like Word, Office, or WordPad to create or edit your .htaccess file. If you do, your site will mysteriously fail when you upload the file to your web server.
How to Block by IP Addresses
To block a certain IP address, say, 127.0.0.1, add the following lines to your .htaccess file. If your file already has some content, just move your cursor to the end of the file, and add the following on a new line in the file. If you don't have an existing .htaccess file, just type it into your blank document. You should of course change the numbers "127.0.0.1" to point to the correct IP address you want to block.
Order Deny,AllowDeny from 127.0.0.1
The first line has the effect that if the web server encounters a request that matches any Deny rule, it will deny the request. If the request that does not match any Deny rule, it will be allowed. This is generally the behaviour that most people want for the normal web directories on their site.
The second line sets the rule that if a request comes from the IP address "127.0.0.1", the web server is to deny the request. The program making that request will receive the "Forbidden" error instead of the normal page at that address.
If you have more than one IP addresses to block, just add another "Deny from" line with that IP address underneath. For example, if you also want to block "192.168.1.1" in addition to "127.0.0.1", the code to use is as follows:
Order Deny,AllowDeny from 127.0.0.1Deny from 192.168.1.1
You may add as many IP addresses as you wish, although if your .htaccess file becomes very large, your site may become sluggish due to the number of rules the server has to process each time it has to deliver your site's pages.
How to Block by User Agent String
To block a bot by a user agent string, look for a part of the user agent string that is unique to that robot and that contains ordinary letters of the alphabet with no spaces, slashes or punctuation marks (unless you are familiar with regular expressions).
For example, if you are planning to block a robot that has this user agent string, "SpammerRobot/5.1 (+http://www.example.com/bot.html)", and you decide that the portion "SpammerRobot" is unique to this robot, add the following lines to your .htaccess file.
As in the case of blocking by IP address, add the lines to the end of the file if you already have an existing .htaccess file. Otherwise, type the lines into your blank document. You should of course change "SpammerRobot" to the actual user agent you want to block.
BrowserMatchNoCase SpammerRobot bad_botOrder Deny,AllowDeny from env=bad_bot
The first line tells the web server to check the user agent string of the program making the request. If the user agent string contains the word "SpammerRobot", it will set an "environment variable" (a sort of internal flag used by the server) called bad_bot. Note that the word "SpammerRobot" can be in any mixture of capital (uppercase) or small (lowercase) letters. If you only want to match the exact case, use BrowserMatch instead of BrowserMatchNoCase. In addition, I simply made up the name "bad_bot" for the purpose of this article. You can call your environment variable some other name if you wish, although if you're not familiar with the rules for naming such variables, just accept "bad_bot".
The second line has already been explained above, in the section on blocking by IP address.
The third line tells the server to deny the request if finds that an environment variable called "bad_bot" has been set.
To add more user agent strings to your block list, just add another "BrowserMatchNocase" line. For example, if you want to block "SecurityHoleRobot" in addition to "SpammerRobot", the lines to use are:
BrowserMatchNoCase SpammerRobot bad_botBrowserMatchNoCase SecurityHoleRobot bad_botOrder Deny,AllowDeny from env=bad_bot
(Before you ask, "SpammerRobot" and "SecurityHoleRobot" are just names I invented for this article. As far as I know, these robots don't exist.)
Note that your .htaccess file can contain block rules for both user agents and IP address. Just put them all in the same file. An example of a .htaccess file with rules to block both by IP address and user agent strings is as follows:
BrowserMatchNoCase SpammerRobot bad_botBrowserMatchNoCase SecurityHoleRobot bad_botOrder Deny,AllowDeny from env=bad_botDeny from 127.0.0.1Deny from 192.168.1.1
There's no need to repeat the "Order Deny,Allow" line when you combine the rules.
Uploading the .htaccess File
Once you've finished with blocking unwanted bots in your .htaccess file, save the file. If you are using Notepad, and are creating a new document, remember to save the file as ".htaccess", including the quotation marks, otherwise you will encounter the problem of Notepad adding a .txt extension to your filename.
Then upload the file to your web server using an FTP/SFTP program (or with your web host's control panel). If you want to use an FTP program, and don't know how to do so, check out my tutorial on How to Upload a File to Your Website Using the FileZilla FTP Client.
Properly implemented, the method described in this article will allow you to block specific bots from accessing your website by either their IP address or their User Agent string.
This article can be found at http://www.thesitewizard.com/apache/block-bots-with-htaccess.shtml