Protecting Koha from web crawlers - evolution of idea in last year

I have already written about our attempts to keep Koha running under the load of web crawlers a year ago in Protecting Koha from web crawlers.

However, since then, the situation has gotten worse. There are more crawlers, some of them are really poorly written and get stuck in loops, and some of them are so aggressive that they result in a denial of service.

fail2ban-month.png

So, to fight that, we extended our approach into more comprehensive rules which deny requests, which fail2ban detects and adds to the iptables firewall so that the load of the server is under control.

Our current setup looks like this:

Apache Configuration (/etc/apache2/conf-available/block-aggressive-bots.conf)

# FFZG Global Bot Mitigation Policy - Version 3.15 (Referer Priority)

# 1. LEGITIMATE SOURCE DEFINITIONS
# Mandate HTTPS Referer
SetEnvIfNoCase Referer "^https://koha\.ffzg\.(hr|unizg\.hr)" local_referer
SetEnvIfNoCase Referer "^https://knjiznica\.ffzg\.unizg\.hr" local_referer
SetEnvIfNoCase Referer "^https://.*koha\.rot13\.org" local_referer
SetEnvIfNoCase Referer "^https://.*koha-dev\.rot13\.org" local_referer
SetEnvIfNoCase Referer "^https://katalog\.(bioetika|hpm)\.hr" local_referer

# 2. IDENTIFY EXPENSIVE ENDPOINTS
SetEnvIf Request_URI "^/(cgi-bin/koha/opac-|opac/opac-|search|tags)" expensive_script
SetEnvIf Request_URI "opac-((ISBD)?detail|user|shelves|discharge)\.pl" !expensive_script

<Location />
    <RequireAll>
        Require all granted
        
        # --- GROUP 1: PERIMETER BLOCKS (Hard Denials) ---
        Require not ip 43.0.0.0/8
        Require not ip 47.74.0.0/11
        Require not ip 101.47.0.0/16
        Require not ip 116.179.0.0/16
        Require not ip 119.249.0.0/16
        Require not ip 150.158.0.0/16
        Require not ip 220.181.0.0/16
        
        # --- GROUP 2: BEHAVIORAL/SIGNATURE BLOCKS (Hard Denials) ---
        # Block RSS/Shelf harvest
        Require not expr "%{QUERY_STRING} =~ /format=rss/i"
        Require not expr "%{QUERY_STRING} =~ /shelfbrowse_itemnumber/i"
        
        # Block known high-volume specific malicious bots
        # NOTE: GROUP 2 denials run before the RequireAny/local_referer check,
        # so these block unconditionally even if bot spoofs a local referer.
        Require not expr "req('User-Agent') =~ /(GPTBot|ChatGPT-User|Sogou|Thinkbot|Applebot|YandexBot|OAI-SearchBot|ClaudeBot|TikTokSpider|Bytespider|PetalBot|AhrefsBot|DotBot|PerplexityBot|Barkrowler|megaindex)/i"
        
        # --- GROUP 3: CONDITIONAL ACCESS ---
        <RequireAny>
            # Trusted IPs (Loopback, Private Network, & Local Subnet)
            Require ip 127.0.0.1
            Require ip ::1
            Require ip 10.0.0.0/8
            Require ip 193.198.212.0/22

            # 3A. Allowlisted Search Engines (Global Bypass for Keywords/Shielding)
            Require expr "req('User-Agent') =~ /(Googlebot|bingbot|DuckDuckBot|archive\.org_bot)/i"
            
            # 3B. Valid Internal HTTPS Referer (Human-like browsing)
            # This allows access even if UA has bot keywords (safety for humans)
            Require env local_referer
            
            # 3C. Traffic without Referer (Direct Entry)
            <RequireAll>
                # Must NOT be identified as a generic unidentified bot keyword
                Require expr "! req('User-Agent') =~ /(bot|spider|crawler|indexer|archiver|headless|scrap|preview|discovery)/i"
                
                # AND must NOT be an expensive script (accessibility for humans)
                Require expr "-z reqenv('expensive_script')"
            </RequireAll>
        </RequireAny>
    </RequireAll>
</Location>

This then generates apache log files which are picked up by fail2ban rules:

Fail2ban Jail Definitions (/etc/fail2ban/jail.d/apache-koha.conf)

[apache-koha-hard]
enabled  = true
port     = http,https
filter   = apache-koha-hard
logpath  = %(apache_access_log)s
# 24 hour ban for hard targets
bantime  = 86400
findtime = 600
maxretry = 3
# Full whitelist including major Croatian ISPs + CARNet
ignoreip = 10.111.0.0/16 127.0.0.1/8 193.198.212.0/22 161.53.0.0/16 193.198.0.0/16 109.60.0.0/17 188.129.76.0/22 188.129.80.0/22 188.252.128.0/17 188.252.196.0/22 31.217.0.0/19 31.217.32.0/19 31.217.36.0/22 31.217.48.0/20 46.188.224.0/19 78.0.0.0/15 83.131.0.0/16 86.32.0.0/15 89.172.0.0/16 93.136.0.0/14 94.253.128.0/17 95.168.96.0/19 95.178.160.0/19 212.15.160.0/19 141.136.128.0/17 31.147.0.0/16 31.45.128.0/17 37.244.128.0/17 78.134.128.0/17 82.193.192.0/19 88.207.0.0/17 94.250.128.0/18 109.227.0.0/18 5.39.128.0/19 5.43.160.0/19 80.80.48.0/20 85.114.32.0/19 89.17.0.0/19 89.201.0.0/17 176.10.240.0/20 89.164.0.0/16 141.138.0.0/18 213.191.128.0/19 213.202.64.0/19 178.160.0.0/17 195.29.0.0/16 213.147.96.0/19

[apache-shell-scanner]
enabled  = true
port     = http,https
filter   = apache-shell-scanner
logpath  = %(apache_access_log)s
# 7-day ban -- webshell scanners are always malicious
bantime  = 604800
findtime = 60
maxretry = 1
ignoreip = 10.111.0.0/16 127.0.0.1/8 193.198.212.0/22 161.53.0.0/16 193.198.0.0/16 109.60.0.0/17 188.129.76.0/22 188.129.80.0/22 188.252.128.0/17 188.252.196.0/22 31.217.0.0/19 31.217.32.0/19 31.217.36.0/22 31.217.48.0/20 46.188.224.0/19 78.0.0.0/15 83.131.0.0/16 86.32.0.0/15 89.172.0.0/16 93.136.0.0/14 94.253.128.0/17 95.168.96.0/19 95.178.160.0/19 212.15.160.0/19 141.136.128.0/17 31.147.0.0/16 31.45.128.0/17 37.244.128.0/17 78.134.128.0/17 82.193.192.0/19 88.207.0.0/17 94.250.128.0/18 109.227.0.0/18 5.39.128.0/19 5.43.160.0/19 80.80.48.0/20 85.114.32.0/19 89.17.0.0/19 89.201.0.0/17 176.10.240.0/20 89.164.0.0/16 141.138.0.0/18 213.191.128.0/19 213.202.64.0/19 178.160.0.0/17 195.29.0.0/16 213.147.96.0/19

[apache-koha]
enabled  = true
port     = http,https
filter   = apache-koha
logpath  = %(apache_access_log)s
bantime  = 3600
findtime = 600
maxretry = 100
ignoreip = 10.111.0.0/16 127.0.0.1/8 193.198.212.0/22 161.53.0.0/16 193.198.0.0/16 109.60.0.0/17 188.129.76.0/22 188.129.80.0/22 188.252.128.0/17 188.252.196.0/22 31.217.0.0/19 31.217.32.0/19 31.217.36.0/22 31.217.48.0/20 46.188.224.0/19 78.0.0.0/15 83.131.0.0/16 86.32.0.0/15 89.172.0.0/16 93.136.0.0/14 94.253.128.0/17 95.168.96.0/19 95.178.160.0/19 212.15.160.0/19 141.136.128.0/17 31.147.0.0/16 31.45.128.0/17 37.244.128.0/17 78.134.128.0/17 82.193.192.0/19 88.207.0.0/17 94.250.128.0/18 109.227.0.0/18 5.39.128.0/19 5.43.160.0/19 80.80.48.0/20 85.114.32.0/19 89.17.0.0/19 89.201.0.0/17 176.10.240.0/20 89.164.0.0/16 141.138.0.0/18 213.191.128.0/19 213.202.64.0/19 178.160.0.0/17 195.29.0.0/16 213.147.96.0/19

Fail2ban 403 Jail Definition (/etc/fail2ban/jail.d/apache-s403.conf)

[apache-s403]
enabled = true
port     = http,https
logpath  = %(apache_access_log)s
bantime  = 3600
maxretry = 5
ignoreip = 10.111.0.0/16 127.0.0.1/8 193.198.212.0/22 161.53.0.0/16 193.198.0.0/16 109.60.0.0/17 188.129.76.0/22 188.129.80.0/22 188.252.128.0/17 188.252.196.0/22 31.217.0.0/19 31.217.32.0/19 31.217.36.0/22 31.217.48.0/20 46.188.224.0/19 78.0.0.0/15 83.131.0.0/16 86.32.0.0/15 89.172.0.0/16 93.136.0.0/14 94.253.128.0/17 95.168.96.0/19 95.178.160.0/19 212.15.160.0/19 141.136.128.0/17 31.147.0.0/16 31.45.128.0/17 37.244.128.0/17 78.134.128.0/17 82.193.192.0/19 88.207.0.0/17 94.250.128.0/18 109.227.0.0/18 5.39.128.0/19 5.43.160.0/19 80.80.48.0/20 85.114.32.0/19 89.17.0.0/19 89.201.0.0/17 176.10.240.0/20 89.164.0.0/16 141.138.0.0/18 213.191.128.0/19 213.202.64.0/19 178.160.0.0/17 195.29.0.0/16 213.147.96.0/19

Fail2ban Filter Definitions

1. Hard Koha Targets (/etc/fail2ban/filter.d/apache-koha-hard.conf)

[Definition]

# FAST REACTION: Probes and Confirmed Malicious Signatures
# Designed for low maxretry (e.g. 1-3)

# 1. Target Subnets
failregex = ^\S+ (?=(?:43|47\.82|101\.47|116\.179|119\.249|150\.158|220\.181)\.)<HOST> - - .* "(GET|POST|HEAD) /cgi-bin/koha/.*" \d+ \d+ "(?!https?://[^"]*(?:ffzg|unizg|rot13|bioetika|hpm)).*?" "(?!.*(?:Googlebot|bingbot|DuckDuckBot|archive\.org_bot)).*?"
# 2. Known Malicious User-Agents (no referer exception -- UA is UA regardless of referer)
            ^\S+ <HOST> - - .* "(GET|POST|HEAD) /cgi-bin/koha/.*" \d+ \d+ ".*?" ".*(GPTBot|ChatGPT-User|Sogou|Thinkbot|Applebot|YandexBot|OAI-SearchBot|ClaudeBot|TikTokSpider|Bytespider|PetalBot|AhrefsBot|DotBot|PerplexityBot|Barkrowler|megaindex).*"
# 3. Generic Bot Keywords (no referer exception -- bots faking local referers must not bypass)
            ^\S+ <HOST> - - .* "(GET|POST|HEAD) /cgi-bin/koha/.*" \d+ \d+ ".*?" ".*(bot|spider|crawler|indexer|archiver|headless|scrap|preview|discovery).*"
# 4. Malicious Probes (Exploits like PHP)
            ^\S+ <HOST> - - .* "(GET|POST|HEAD) /.*\.(php|env|git|sh|sql|asp|jsp).*"
# 5. Behavioral Probes
            ^\S+ <HOST> - - .* "(GET|POST|HEAD) /cgi-bin/koha/opac-[^"]*shelfbrowse_itemnumber=.*" \d+ \d+ "(?!https?://[^"]*(?:ffzg|unizg|rot13|bioetika|hpm)).*?" "(?!.*(?:Googlebot|bingbot|DuckDuckBot|archive\.org_bot)).*?"
            ^\S+ <HOST> - - .* "(GET|POST|HEAD) /cgi-bin/koha/[^"]*format=rss.*" \d+ \d+ "(?!https?://[^"]*(?:ffzg|unizg|rot13|bioetika|hpm)).*?" "(?!.*(?:Googlebot|bingbot|DuckDuckBot|archive\.org_bot)).*?"

2. Soft Koha Targets (/etc/fail2ban/filter.d/apache-koha.conf)

[Definition]

# SOFT FILTER: ALL OPAC scripts without local referer
# Designed for a high maxretry (e.g. 100) to allow human browsing

# Matches ALL opac scripts. Uses negative lookaheads to explicitly ignore legitimate 
# referers (ffzg, unizg, etc.) and search engine User-Agents. This avoids fail2ban 0.9.x
# ignoreregex parsing bugs.
failregex = ^\S+ <HOST> - - .* "(GET|POST|HEAD) /cgi-bin/koha/opac-[^ ]+\.pl.*" \d+ \d+ "(?!https?://[^"]*(?:ffzg|unizg|rot13|bioetika|hpm)).*?" "(?!.*(?:Googlebot|bingbot|DuckDuckBot|archive\.org_bot)).*?"

3. Zero-Tolerance Shell Scans (/etc/fail2ban/filter.d/apache-shell-scanner.conf)

[Definition]

# Webshell/exploit scanner detection -- zero tolerance
# Matches classic backdoor/RCE probe paths regardless of IP, UA, or referer.
# Designed for maxretry=1 (one hit = instant ban).

failregex =
# PHP webshell probes (backdoor filenames, wp-content exploits, xmlrpc)
            ^\S+ <HOST> - - .* "(GET|POST|HEAD) /.*(?:shell|webshell|wp_filemanager|eval-stdin|eval\.php|passthru|cmd\.php|c99\.php|r57\.php|b374k|wso\.php|alfa\.php|priv8|indoxploit|symlink|adminer\.php|phpinfo\.php|phpmyadmin|myphpadmin|pma/|/xmlrpc\.php|/wp-content/plugins/[^"]*\.php|/wp-includes/wlwmanifest|/wp-login\.php|/setup-config\.php|/install\.php).*"
# Dotfile/credential harvesting (.env, .git, backup files)
            ^\S+ <HOST> - - .* "(GET|POST|HEAD) /.*(?:\.env|\.git/config|backup\.sql|dump\.sql|db\.sql|\.aws/credentials|wp-config\.php|config\.php\.bak|settings\.php\.bak|web\.config|\.htpasswd).*"
# Common numeric/random PHP shell filenames (e.g. /a5.php, /wo.php, /gptsh.php)
            ^\S+ <HOST> - - .* "(GET|POST) /[a-z0-9_-]{2,12}\.php(?:\?[^"]{0,100})?" \d+ \d+ "-" "(?:-|[^"]{0,200})"

ignoreregex =
# Ignore known legitimate PHP apps on this server (only /cgi-bin/koha/ is legitimate)
            ^\S+ <HOST> - - .* "(GET|POST) /cgi-bin/koha/.*\.php.*"

4. 403 Forbidden Detection Filter (/etc/fail2ban/filter.d/apache-s403.conf)

[Definition]

# koha.ffzg.hr:443 150.109.24.245 - - [16/Sep/2024:04:36:05 +0200] "GET /cgi-bin/koha/opac-detail.pl?biblionumber=216663 HTTP/1.1" 200 16159 "-" "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/110.0.0.0 Safari/537.36" 0 769541
failregex = ^.* <HOST> - - .* "(GET|POST|HEAD) /cgi-bin/koha/.*" 403

ignoreregex =

Some of the rules are redundant (you might call them cruft or slop if you want a negative spin, or defense in depth if you want a positive one), but it serves us well, so you can pick-and-choose parts which might apply to your situation.