Publishing Dataverse datasets with TSV files inside ZIP archives

Last week, a user attempted to upload a large 12GB dataset to our Dataverse 6.2 installation, which consisted of 84 zip files containing TSV files. The upload caused our installation to stop responding to web requests because the disk usage unexpectedly reached 100%, even though we had 15GB of free space available (our data is stored on the local file system).

What went wrong?

When uploading zip files, Dataverse leaves temporary files on the file system in /usr/local/payara6/glassfish/domains/domain1/uploads. If the disk usage limit is reached during this process, temporary files may also be left in /usr/local/dvn/data/temp.

During the ingestion of zipped TSV files, Dataverse creates two uncompressed copies of each TSV file. One copy has an .orig extension (the original file), and the other is an identical version but without the header in the first line. This behavior is highly sub-optimal. In our case, the uncompressed TSV files would have required 42GB of space, and creating two copies was not a feasible option for our storage.

The solution was to keep the TSV files compressed inside their zip archives.

A suggestion from a mailing list was to upload a single zip file that contains all the other zip files. This method preserves the compression of the inner zip files. However, it's important to note that the outer zip file will remain in the upload directory, meaning you will need at least twice the amount of disk space for the upload to complete successfully.

This approach will also generate a number of "out of memory" errors from Solr after publishing dataset, as it cannot decompress these nested zip files. In our case, this was an acceptable outcome.

To monitor disk usage I used this snippet:

dpavlin@debian-crossda:~$ cat du-check.sh
df -h /
sudo du -hcs /usr/local/payara6/glassfish/domains/domain1/uploads
sudo du -hcs /usr/local/dvn/data/temp

Workflow was upload one file, monitor du and atop 2 for cpu usage, wait for upload to finish, cleanup temporary files, upload another part. Whole upload was split into 4 parts, which where logical based on dataset, 3 zip files and README.

Issue about temporary files left on disk after upload is reported and fixed in 6.4.

Hopefully, this post will help someone else who encounters the same problem.