I have already written about our attempts to keep Koha running under the load of web crawlers a year ago in Protecting Koha from web crawlers.

However, since then, the situation has gotten worse. There are more crawlers, some of them are really poorly written and get stuck in loops, and some of them are so aggressive that they result in a denial of service.

fail2ban-month.png

So, to fight that, we extended our approach into more comprehensive rules which deny requests, which fail2ban detects and adds to the iptables firewall so that the load of the server is under control.

Our current setup looks like this:

Apache Configuration (/etc/apache2/conf-available/block-aggressive-bots.conf)

# FFZG Global Bot Mitigation Policy - Version 3.15 (Referer Priority)

# 1. LEGITIMATE SOURCE DEFINITIONS
# Mandate HTTPS Referer
SetEnvIfNoCase Referer "^https://koha\.ffzg\.(hr|unizg\.hr)" local_referer
SetEnvIfNoCase Referer "^https://knjiznica\.ffzg\.unizg\.hr" local_referer
SetEnvIfNoCase Referer "^https://.*koha\.rot13\.org" local_referer
SetEnvIfNoCase Referer "^https://.*koha-dev\.rot13\.org" local_referer
SetEnvIfNoCase Referer "^https://katalog\.(bioetika|hpm)\.hr" local_referer

# 2. IDENTIFY EXPENSIVE ENDPOINTS
SetEnvIf Request_URI "^/(cgi-bin/koha/opac-|opac/opac-|search|tags)" expensive_script
SetEnvIf Request_URI "opac-((ISBD)?detail|user|shelves|discharge)\.pl" !expensive_script

<Location />
    <RequireAll>
        Require all granted
        
        # --- GROUP 1: PERIMETER BLOCKS (Hard Denials) ---
        Require not ip 43.0.0.0/8
        Require not ip 47.74.0.0/11
        Require not ip 101.47.0.0/16
        Require not ip 116.179.0.0/16
        Require not ip 119.249.0.0/16
        Require not ip 150.158.0.0/16
        Require not ip 220.181.0.0/16
        
        # --- GROUP 2: BEHAVIORAL/SIGNATURE BLOCKS (Hard Denials) ---
        # Block RSS/Shelf harvest
        Require not expr "%{QUERY_STRING} =~ /format=rss/i"
        Require not expr "%{QUERY_STRING} =~ /shelfbrowse_itemnumber/i"
        
        # Block known high-volume specific malicious bots
        # NOTE: GROUP 2 denials run before the RequireAny/local_referer check,
        # so these block unconditionally even if bot spoofs a local referer.
        Require not expr "req('User-Agent') =~ /(GPTBot|ChatGPT-User|Sogou|Thinkbot|Applebot|YandexBot|OAI-SearchBot|ClaudeBot|TikTokSpider|Bytespider|PetalBot|AhrefsBot|DotBot|PerplexityBot|Barkrowler|megaindex)/i"
        
        # --- GROUP 3: CONDITIONAL ACCESS ---
        <RequireAny>
            # Trusted IPs (Loopback, Private Network, & Local Subnet)
            Require ip 127.0.0.1
            Require ip ::1
            Require ip 10.0.0.0/8
            Require ip 193.198.212.0/22

            # 3A. Allowlisted Search Engines (Global Bypass for Keywords/Shielding)
            Require expr "req('User-Agent') =~ /(Googlebot|bingbot|DuckDuckBot|archive\.org_bot)/i"
            
            # 3B. Valid Internal HTTPS Referer (Human-like browsing)
            # This allows access even if UA has bot keywords (safety for humans)
            Require env local_referer
            
            # 3C. Traffic without Referer (Direct Entry)
            <RequireAll>
                # Must NOT be identified as a generic unidentified bot keyword
                Require expr "! req('User-Agent') =~ /(bot|spider|crawler|indexer|archiver|headless|scrap|preview|discovery)/i"
                
                # AND must NOT be an expensive script (accessibility for humans)
                Require expr "-z reqenv('expensive_script')"
            </RequireAll>
        </RequireAny>
    </RequireAll>
</Location>

This then generates apache log files which are picked up by fail2ban rules:

Fail2ban Jail Definitions (/etc/fail2ban/jail.d/apache-koha.conf)

[apache-koha-hard]
enabled  = true
port     = http,https
filter   = apache-koha-hard
logpath  = %(apache_access_log)s
# 24 hour ban for hard targets
bantime  = 86400
findtime = 600
maxretry = 3
# Full whitelist including major Croatian ISPs + CARNet
ignoreip = 10.111.0.0/16 127.0.0.1/8 193.198.212.0/22 161.53.0.0/16 193.198.0.0/16 109.60.0.0/17 188.129.76.0/22 188.129.80.0/22 188.252.128.0/17 188.252.196.0/22 31.217.0.0/19 31.217.32.0/19 31.217.36.0/22 31.217.48.0/20 46.188.224.0/19 78.0.0.0/15 83.131.0.0/16 86.32.0.0/15 89.172.0.0/16 93.136.0.0/14 94.253.128.0/17 95.168.96.0/19 95.178.160.0/19 212.15.160.0/19 141.136.128.0/17 31.147.0.0/16 31.45.128.0/17 37.244.128.0/17 78.134.128.0/17 82.193.192.0/19 88.207.0.0/17 94.250.128.0/18 109.227.0.0/18 5.39.128.0/19 5.43.160.0/19 80.80.48.0/20 85.114.32.0/19 89.17.0.0/19 89.201.0.0/17 176.10.240.0/20 89.164.0.0/16 141.138.0.0/18 213.191.128.0/19 213.202.64.0/19 178.160.0.0/17 195.29.0.0/16 213.147.96.0/19

[apache-shell-scanner]
enabled  = true
port     = http,https
filter   = apache-shell-scanner
logpath  = %(apache_access_log)s
# 7-day ban -- webshell scanners are always malicious
bantime  = 604800
findtime = 60
maxretry = 1
ignoreip = 10.111.0.0/16 127.0.0.1/8 193.198.212.0/22 161.53.0.0/16 193.198.0.0/16 109.60.0.0/17 188.129.76.0/22 188.129.80.0/22 188.252.128.0/17 188.252.196.0/22 31.217.0.0/19 31.217.32.0/19 31.217.36.0/22 31.217.48.0/20 46.188.224.0/19 78.0.0.0/15 83.131.0.0/16 86.32.0.0/15 89.172.0.0/16 93.136.0.0/14 94.253.128.0/17 95.168.96.0/19 95.178.160.0/19 212.15.160.0/19 141.136.128.0/17 31.147.0.0/16 31.45.128.0/17 37.244.128.0/17 78.134.128.0/17 82.193.192.0/19 88.207.0.0/17 94.250.128.0/18 109.227.0.0/18 5.39.128.0/19 5.43.160.0/19 80.80.48.0/20 85.114.32.0/19 89.17.0.0/19 89.201.0.0/17 176.10.240.0/20 89.164.0.0/16 141.138.0.0/18 213.191.128.0/19 213.202.64.0/19 178.160.0.0/17 195.29.0.0/16 213.147.96.0/19

[apache-koha]
enabled  = true
port     = http,https
filter   = apache-koha
logpath  = %(apache_access_log)s
bantime  = 3600
findtime = 600
maxretry = 100
ignoreip = 10.111.0.0/16 127.0.0.1/8 193.198.212.0/22 161.53.0.0/16 193.198.0.0/16 109.60.0.0/17 188.129.76.0/22 188.129.80.0/22 188.252.128.0/17 188.252.196.0/22 31.217.0.0/19 31.217.32.0/19 31.217.36.0/22 31.217.48.0/20 46.188.224.0/19 78.0.0.0/15 83.131.0.0/16 86.32.0.0/15 89.172.0.0/16 93.136.0.0/14 94.253.128.0/17 95.168.96.0/19 95.178.160.0/19 212.15.160.0/19 141.136.128.0/17 31.147.0.0/16 31.45.128.0/17 37.244.128.0/17 78.134.128.0/17 82.193.192.0/19 88.207.0.0/17 94.250.128.0/18 109.227.0.0/18 5.39.128.0/19 5.43.160.0/19 80.80.48.0/20 85.114.32.0/19 89.17.0.0/19 89.201.0.0/17 176.10.240.0/20 89.164.0.0/16 141.138.0.0/18 213.191.128.0/19 213.202.64.0/19 178.160.0.0/17 195.29.0.0/16 213.147.96.0/19

Fail2ban 403 Jail Definition (/etc/fail2ban/jail.d/apache-s403.conf)

[apache-s403]
enabled = true
port     = http,https
logpath  = %(apache_access_log)s
bantime  = 3600
maxretry = 5
ignoreip = 10.111.0.0/16 127.0.0.1/8 193.198.212.0/22 161.53.0.0/16 193.198.0.0/16 109.60.0.0/17 188.129.76.0/22 188.129.80.0/22 188.252.128.0/17 188.252.196.0/22 31.217.0.0/19 31.217.32.0/19 31.217.36.0/22 31.217.48.0/20 46.188.224.0/19 78.0.0.0/15 83.131.0.0/16 86.32.0.0/15 89.172.0.0/16 93.136.0.0/14 94.253.128.0/17 95.168.96.0/19 95.178.160.0/19 212.15.160.0/19 141.136.128.0/17 31.147.0.0/16 31.45.128.0/17 37.244.128.0/17 78.134.128.0/17 82.193.192.0/19 88.207.0.0/17 94.250.128.0/18 109.227.0.0/18 5.39.128.0/19 5.43.160.0/19 80.80.48.0/20 85.114.32.0/19 89.17.0.0/19 89.201.0.0/17 176.10.240.0/20 89.164.0.0/16 141.138.0.0/18 213.191.128.0/19 213.202.64.0/19 178.160.0.0/17 195.29.0.0/16 213.147.96.0/19

Fail2ban Filter Definitions

1. Hard Koha Targets (/etc/fail2ban/filter.d/apache-koha-hard.conf)

[Definition]

# FAST REACTION: Probes and Confirmed Malicious Signatures
# Designed for low maxretry (e.g. 1-3)

# 1. Target Subnets
failregex = ^\S+ (?=(?:43|47\.82|101\.47|116\.179|119\.249|150\.158|220\.181)\.)<HOST> - - .* "(GET|POST|HEAD) /cgi-bin/koha/.*" \d+ \d+ "(?!https?://[^"]*(?:ffzg|unizg|rot13|bioetika|hpm)).*?" "(?!.*(?:Googlebot|bingbot|DuckDuckBot|archive\.org_bot)).*?"
# 2. Known Malicious User-Agents (no referer exception -- UA is UA regardless of referer)
            ^\S+ <HOST> - - .* "(GET|POST|HEAD) /cgi-bin/koha/.*" \d+ \d+ ".*?" ".*(GPTBot|ChatGPT-User|Sogou|Thinkbot|Applebot|YandexBot|OAI-SearchBot|ClaudeBot|TikTokSpider|Bytespider|PetalBot|AhrefsBot|DotBot|PerplexityBot|Barkrowler|megaindex).*"
# 3. Generic Bot Keywords (no referer exception -- bots faking local referers must not bypass)
            ^\S+ <HOST> - - .* "(GET|POST|HEAD) /cgi-bin/koha/.*" \d+ \d+ ".*?" ".*(bot|spider|crawler|indexer|archiver|headless|scrap|preview|discovery).*"
# 4. Malicious Probes (Exploits like PHP)
            ^\S+ <HOST> - - .* "(GET|POST|HEAD) /.*\.(php|env|git|sh|sql|asp|jsp).*"
# 5. Behavioral Probes
            ^\S+ <HOST> - - .* "(GET|POST|HEAD) /cgi-bin/koha/opac-[^"]*shelfbrowse_itemnumber=.*" \d+ \d+ "(?!https?://[^"]*(?:ffzg|unizg|rot13|bioetika|hpm)).*?" "(?!.*(?:Googlebot|bingbot|DuckDuckBot|archive\.org_bot)).*?"
            ^\S+ <HOST> - - .* "(GET|POST|HEAD) /cgi-bin/koha/[^"]*format=rss.*" \d+ \d+ "(?!https?://[^"]*(?:ffzg|unizg|rot13|bioetika|hpm)).*?" "(?!.*(?:Googlebot|bingbot|DuckDuckBot|archive\.org_bot)).*?"

2. Soft Koha Targets (/etc/fail2ban/filter.d/apache-koha.conf)

[Definition]

# SOFT FILTER: ALL OPAC scripts without local referer
# Designed for a high maxretry (e.g. 100) to allow human browsing

# Matches ALL opac scripts. Uses negative lookaheads to explicitly ignore legitimate 
# referers (ffzg, unizg, etc.) and search engine User-Agents. This avoids fail2ban 0.9.x
# ignoreregex parsing bugs.
failregex = ^\S+ <HOST> - - .* "(GET|POST|HEAD) /cgi-bin/koha/opac-[^ ]+\.pl.*" \d+ \d+ "(?!https?://[^"]*(?:ffzg|unizg|rot13|bioetika|hpm)).*?" "(?!.*(?:Googlebot|bingbot|DuckDuckBot|archive\.org_bot)).*?"

3. Zero-Tolerance Shell Scans (/etc/fail2ban/filter.d/apache-shell-scanner.conf)

[Definition]

# Webshell/exploit scanner detection -- zero tolerance
# Matches classic backdoor/RCE probe paths regardless of IP, UA, or referer.
# Designed for maxretry=1 (one hit = instant ban).

failregex =
# PHP webshell probes (backdoor filenames, wp-content exploits, xmlrpc)
            ^\S+ <HOST> - - .* "(GET|POST|HEAD) /.*(?:shell|webshell|wp_filemanager|eval-stdin|eval\.php|passthru|cmd\.php|c99\.php|r57\.php|b374k|wso\.php|alfa\.php|priv8|indoxploit|symlink|adminer\.php|phpinfo\.php|phpmyadmin|myphpadmin|pma/|/xmlrpc\.php|/wp-content/plugins/[^"]*\.php|/wp-includes/wlwmanifest|/wp-login\.php|/setup-config\.php|/install\.php).*"
# Dotfile/credential harvesting (.env, .git, backup files)
            ^\S+ <HOST> - - .* "(GET|POST|HEAD) /.*(?:\.env|\.git/config|backup\.sql|dump\.sql|db\.sql|\.aws/credentials|wp-config\.php|config\.php\.bak|settings\.php\.bak|web\.config|\.htpasswd).*"
# Common numeric/random PHP shell filenames (e.g. /a5.php, /wo.php, /gptsh.php)
            ^\S+ <HOST> - - .* "(GET|POST) /[a-z0-9_-]{2,12}\.php(?:\?[^"]{0,100})?" \d+ \d+ "-" "(?:-|[^"]{0,200})"

ignoreregex =
# Ignore known legitimate PHP apps on this server (only /cgi-bin/koha/ is legitimate)
            ^\S+ <HOST> - - .* "(GET|POST) /cgi-bin/koha/.*\.php.*"

4. 403 Forbidden Detection Filter (/etc/fail2ban/filter.d/apache-s403.conf)

[Definition]

# koha.ffzg.hr:443 150.109.24.245 - - [16/Sep/2024:04:36:05 +0200] "GET /cgi-bin/koha/opac-detail.pl?biblionumber=216663 HTTP/1.1" 200 16159 "-" "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/110.0.0.0 Safari/537.36" 0 769541
failregex = ^.* <HOST> - - .* "(GET|POST|HEAD) /cgi-bin/koha/.*" 403

ignoreregex =

Some of the rules are redundant (you might call them cruft or slop if you want a negative spin, or defense in depth if you want a positive one), but it serves us well, so you can pick-and-choose parts which might apply to your situation.

I have been doing this job for 30 years. I still love it as much as I did when I first saw the Internet and Linux on November 29, 1995, and got sucked into sysadmin tasks.

To me, it seems that AI agents have enabled me to keep my infrastructure changes in Git (as I always did, using etckeeper and local Git repositories on my machines) but in a code and text form that I can review and store for later reference and documentation.

In fact, documentation is underrated. It is useful for me, so I always kept local notes on my machines in the form of:

yyyy-mm-dd-topic-of-change.txt

This allowed for quick lookups using grep to see what I changed. I always made sure to include all commands and outputs, because this file was the result of testing changes and was used to deploy them to production.

Nowadays, I have a local Git repository with code and documentation in a much more verbose form.

I really like that. I didn't write that content myself, but I read every line of it.

Did my work change? In one sense, no, but the result is totally different.

This is not quite GitOps or DevOps, but it suits my workflow.

Did I get a speedup?

Yes, 10x on most tasks.

If I can think of a solution to a problem in a few seconds, and spend 10-30 seconds more typing it into my agent, the task is done.

Commands can be executed much faster, but I still need time to process output. That's why I love ds4 from antirez, which I run on my local Aimax machine.

Its reading speed is ~10 tok/s for generation, and modem-like (with a progress bar!) for tokenization at ~50 tok/s. This provides just enough time to read everything and make sure that I'm not generating AI slop.

Am I getting dumber?

No, in fact, I can skim-read up to 200 tok/s since I use AI as my second brain. If I see something with which I don't agree, it's time to stop the agent (or use /btw in Gemini agy) and steer it in the right direction.

But more often than not, I'm learning something that the agent knew and I didn't. In the last year, I learned as much as I did in the three years before that. It's amazing what the Internet can surface to you if you use the biggest agents.

I think Gemini has the best web search of all agents, and since the Google index has been in RAM since we last saw the "Cached" option in Google search, I suspect that search results in agents are best in agy. Agents still don't search Google as I do. I know that the secret of Google search is skip lists, so I start with the most probable word and add more words to filter the results. Agents are more verbose than that, but they are quite good.

Are we getting locked-in?

Yes and no. Local agents (like DeepSeek v4) are really as good as SOTA models for everything I tried. I'm not afraid that I'm burning dinosaurs in vain.

My local AI usage of 85W should be offset by solar energy in the future, so it's sustainable for the planet.

On the other hand, I did spend the equivalent of about 8 years of a Gemini subscription to acquire my AMD Ryzen AI Max+ 395 machine, but having something locally which works and doesn't come with strings attached was worth it to me.

Let's see how these thoughts stand up to the test of time, since this was written in May 2026, and who knows what the future will bring.

Antigravity CLI helper utility

As you have seen from my previous blog posts, I'm usually missing some functionality when using AI. Since google killed Gemini CLI and replaced it with Antigravity CLI, I was missing ability to see old sessions, grep through them or commit local brain files to git.

So, I wrote agy-explore which solves this problem for me.

When I started using AI for serious work, I had the same requirements as for any other projects: local backup of my files. To achieve this, I decided to mirror Google AI Studio session files from Google Drive using Rclone.

To view them, I wrote a Google AI Studio HTML viewer. For Gemini CLI, I wrote a similar viewer for .gemini/tmp/*/chats/session*.json session files.

I find having these files locally very useful for searching (Ctrl+F); you might too. You can also copy any part as markdown or html.

Let's assume that you inherited hundreds of WordPress sites installed over 4 hosts during last 20 years and that from time to time these WordPress installations gets infected by malicious actors who somehow acquired the password of an administrator user on the installation and are using these newly acquired privileges to install additional plugins, create new users and insert spam in Google (using User-Agent to emit spam content to Google and normal content to usual pages).

You are also not an administrator on those sites, so your tool of choice is WP-CLI and command-line together with some code snippets which extend it's functionality described here.

Logging user logins

The first logical question is: how can I know which user logged in into WordPress and infected it? Unfortunately, WordPress doesn't emit any log files.

wp-fail2ban

Several year ago, I found another plugin which sends logs to syslog WP-Fail2Ban.
However, this plugin tries to insert the site name into the syslog tag field which is limited in length, so generated logs are less useful if the site name gets truncated.
Newer version are also overly complex, since they include full WordPress interface, with additional tables in each word press installation which we don't want or need.
So I decided to keep using an older, simpler version of WP-Fail2Ban before it had any interface, which is simple enough to audit and modified it to produce more useful syslog messages and implement everything in a single PHP file wp-fail2ban-ffzg.php.
Logs are then sent to central syslog server, which runs fail2ban and inserts firewall rules, but here we are mostly concerned about log generation for user logins.

jeepers-peepers

Honorable mention is jeepers-peepers very interesting plugin with very strange php coding style, which does generate logs on disk, but it uses current user to generate files, so logs from web server will be by nobody user, so wp-cli commands from other users won't be able to update logs.
Even worse, if you run wp-cli under any other user than nobody first, you will create log files which are not updatable from web server.
I really wanted central syslog logging, so this was not suitable solution. It also didn't support multi-site WordPress installations which I also had.

mu-plugins

This seems good so far, but installing a plugin to hundreds of sites is somewhat involved and I want to minimize modifications which I have to do on each site.
WordPress Must Use plugins which are automatically loaded and activated is perfect for such task.
Even better, we can have one mu-plugins directory which is then symlinked to all sites making installation nice and simple.

Mitigation on infected sites

When a site gets infected, WP-CLI can help us find modified files using

wp core verify-checksums
wp plugin verify-checksums --all
Plugin verification works for most plugins, but some paid ones (like WPML) don't have checksums upstream which is a shame.

Disabling compromised user

When a compromised user is identified, it's good to remove administrator privileges from it (which might be somewhat involved if this is a WordPress Multisite, so wp super-admin list might be useful).

wp eval-file disable-user.php login will display current capabilities, reset password, remove all capabilities from the user, list and destroy user sessions, regenerate WordPress salts, iterate over all sites if wordpress is multisite and remove administrator privileges.
Salts regeneration with wp config shuffle-salts is useful because all users are forced to login again, thus invalidating saved logins, but for that script has to be run under correct user, owner of wp-config.php, so there is wrapper script wp-disable-user.sh which ensures that.

Auditing Logins with Wordfence

If you have Wordfence installed, it tracks user logins in user metadata, which can be invaluable for forensics even if you don't have wp-fail2ban logs.
You can use wp eval to extract this information across your sites (based on wp-wordfence-login.sh): This snippet lists users who have logged in, showing their username, last login timestamp, and IP address, sorted by last_login descending.

Scanning WordPress using Wordfence CLI

We have daily backups of all WordPress sites, so an alternative is to check at the backup server which files are changed. However, we can also use wordfence-cli to check if there are exploits using

wordfence malware-scan --match-engine=vectorscan -q -a --output-format csv --output-path malware.csv /path/to/wordpress
This works well on the backup server (vectorscan engine which is much faster requires SSE capable CPU) but, vuln-scan which checks known vulnerabilities in installed plugins works only on WordPress installation and not on plain backup files.

You should really examine all warnings from malware-scan, but gzip fonts will be reported as possible compromise: zamd/cluster/pauk.ffzg.hr/2/www/ffzg.hr/fonet2/eufonija/public_html/wp-content/plugins/easy-digital-downloads/includes/libraries/fpdf/font/c67085188799208adeb5b784b9483ad0_droidserif-italic.z,7741,IOC:ZIP/CompressedZlib.7741,Raw compressed zlib file - occasionally used to store fonts or exports but may be an IOC (Indicator of Compromise), which can be safely ignored.

Finding content created in some time range

If you want to examine content on site to see if there where any spam content added, you can use:
wp eval-file find-modified-content.php 2025-12-05 1025-12-08

syslog geolocation of logins

Best way which I found to detect infected sites is to geolocate all logins to wordpress, and send mails for logins which are outside Croatia. Common pattern is to see several logins from different IPs and countries, which is then trigger for closer investigation.
For that there is simple script tail-wordpress-accepted.sh which in turn uses geolocate_ips.sh to geolocate IPs using geoiplookup and web API which usually has more precise country, town and ISP data.

Last week, a user attempted to upload a large 12GB dataset to our Dataverse 6.2 installation, which consisted of 84 zip files containing TSV files. The upload caused our installation to stop responding to web requests because the disk usage unexpectedly reached 100%, even though we had 15GB of free space available (our data is stored on the local file system).

What went wrong?

When uploading zip files, Dataverse leaves temporary files on the file system in /usr/local/payara6/glassfish/domains/domain1/uploads. If the disk usage limit is reached during this process, temporary files may also be left in /usr/local/dvn/data/temp.

During the ingestion of zipped TSV files, Dataverse creates two uncompressed copies of each TSV file. One copy has an .orig extension (the original file), and the other is an identical version but without the header in the first line. This behavior is highly sub-optimal. In our case, the uncompressed TSV files would have required 42GB of space, and creating two copies was not a feasible option for our storage.

The solution was to keep the TSV files compressed inside their zip archives.

A suggestion from a mailing list was to upload a single zip file that contains all the other zip files. This method preserves the compression of the inner zip files. However, it's important to note that the outer zip file will remain in the upload directory, meaning you will need at least twice the amount of disk space for the upload to complete successfully.

This approach will also generate a number of "out of memory" errors from Solr after publishing dataset, as it cannot decompress these nested zip files. In our case, this was an acceptable outcome.

To monitor disk usage I used this snippet:

dpavlin@debian-crossda:~$ cat du-check.sh
df -h /
sudo du -hcs /usr/local/payara6/glassfish/domains/domain1/uploads
sudo du -hcs /usr/local/dvn/data/temp

Workflow was upload one file, monitor du and atop 2 for cpu usage, wait for upload to finish, cleanup temporary files, upload another part. Whole upload was split into 4 parts, which where logical based on dataset, 3 zip files and README.

Issue about temporary files left on disk after upload is reported and fixed in 6.4.

Hopefully, this post will help someone else who encounters the same problem.