Limit web crawlers impact on your Apache 2 site

Your web site is slow. What do you do? Quick grep over access.log shows a flood of various crawling bots:

koha:~# grep bot /var/log/apache2/other_vhosts_access.log | tail
koha.ffzg.hr:80 71.181.32.73 - - [28/Oct/2009:14:16:04 +0100] "GET /cgi-bin/koha/opac-search.pl?q=su:in%C5%BEenjerstvo HTTP/1.1" 200 6579 "-" "Mozilla/5.0 (compatible; MJ12bot/v1.2.5; http://www.majestic12.co.uk/bot.php?+)"
koha.ffzg.hr:80 66.249.65.156 - - [28/Oct/2009:14:16:09 +0100] "GET /cgi-bin/koha/opac-detailprint.pl?biblionumber=127887 HTTP/1.1" 200 849 "-" "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)"
koha.ffzg.hr:80 66.249.65.156 - - [28/Oct/2009:14:16:14 +0100] "GET /cgi-bin/koha/opac-detail.pl?biblionumber=81565 HTTP/1.1" 200 4240 "-" "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)"
koha.ffzg.hr:80 66.249.65.156 - - [28/Oct/2009:14:16:19 +0100] "GET /cgi-bin/koha/opac-showmarc.pl?id=164101 HTTP/1.1" 200 1675 "-" "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)"
koha.ffzg.hr:80 66.249.65.156 - - [28/Oct/2009:14:16:24 +0100] "GET /cgi-bin/koha/opac-detail.pl?biblionumber=38700 HTTP/1.1" 200 4343 "-" "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)"
koha.ffzg.hr:80 71.181.32.73 - - [28/Oct/2009:14:16:28 +0100] "GET /cgi-bin/koha/opac-search.pl?q=su:iracionalno HTTP/1.1" 200 5140 "-" "Mozilla/5.0 (compatible; MJ12bot/v1.2.5; http://www.majestic12.co.uk/bot.php?+)"
koha.ffzg.hr:80 66.249.65.156 - - [28/Oct/2009:14:16:30 +0100] "GET /cgi-bin/koha/opac-ISBDdetail.pl?biblionumber=73120 HTTP/1.1" 200 3255 "-" "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)"
koha.ffzg.hr:80 66.249.65.156 - - [28/Oct/2009:14:16:35 +0100] "GET /cgi-bin/koha/opac-detail.pl?biblionumber=53576 HTTP/1.1" 200 4466 "-" "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)"
koha.ffzg.hr:80 66.249.65.156 - - [28/Oct/2009:14:16:40 +0100] "GET /cgi-bin/koha/opac-showmarc.pl?id=85958 HTTP/1.1" 200 1691 "-" "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)"
koha.ffzg.hr:80 66.249.65.156 - - [28/Oct/2009:14:16:45 +0100] "GET /cgi-bin/koha/opac-detail.pl?biblionumber=91684 HTTP/1.1" 200 4477 "-" "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)"
Every 5 seconds? This impacts our performance, especially during working hours. So, for a start let's teach bots which URLs are really interesting. Create robots.txt which all good crawlers should support:
koha:~# cat /var/www/robots.txt 
User-Agent: *
Crawl-Delay: 300
Disallow: /cgi-bin/koha/opac-search.pl
Disallow: /cgi-bin/koha/opac-showmarc.pl
Disallow: /cgi-bin/koha/opac-detailprint.pl
Disallow: /cgi-bin/koha/opac-ISBDdetail.pl

User-Agent: MJ12bot
Crawl-Delay: 300
...and configure it using rule like this (if you web root isn't /var/www):
koha:~# cat /etc/apache2/conf.d/robots.conf 
Alias /robots.txt /var/www/robots.txt
This would work, but Google bot doesn't use Crawl-Delay (sigh!). So we have to hop over to Google's webmaster tools - Site config - Settings and ask Google bot nicely to slow down: google-crawl-rate.png

However, that's not enough, because it takes a while for Google bot to update his crawl speed, and our site is heavily loaded right now! Let's try to limit bandwidth available to all crawlers (identified by bot in User-Agent header) using bandwidth limiting.

First, install mod-bw:

koha:/# apt-get install libapache2-mod-bw

koha:/# a2enmod bw
Enabling module bw.
Run '/etc/init.d/apache2 restart' to activate new configuration!
Edit your configuration to have something like following (very draconian) policy:
        BandwidthModule On
        ForceBandWidthModule On

        BandWidth "u:bot" 8192
        MaxConnection "u:bot" 1

        BandWidth "u:wget" 8192
        MaxConnection "u:wget" 1
It seems that you have to set BandWidth to value which is larger than BandWidthPacket to have effect (which is 8192 by default).
This will slow down download speed to all crawlers (including wget for testing) to 8 k/s. So, does this work?
koha:~# cat /var/log/apache2/other_vhosts_access.log | grep ' 503 ' | cut -d\" -f6 | sort | uniq -c
    107 Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)
     61 Mozilla/5.0 (compatible; MJ12bot/v1.2.5; http://www.majestic12.co.uk/bot.php?+)
     54 Mozilla/5.0 (compatible; MJ12bot/v1.3.1; http://www.majestic12.co.uk/bot.php?+)
This is just for first two hours of operation, so it does help.