dovecot maildir on compressed zfs pool

This is a story about our mail server which is coming close to it's disk space capacity:

root@mudrac:/home/prof/dpavlin# df -h
Filesystem      Size  Used Avail Use% Mounted on
/dev/vda1        20G  7.7G   11G  42% /
/dev/vdb        4.0T  3.9T   74G  99% /home
/dev/vdc        591G  502G   89G  85% /home/stud

You might say that it's easy to resize disk and provide more storage, but unfortunately it's not so easy. We are using ganeti for our virtualization platform, and current version of ganeti has limit of 4T for single drbd disk.

This can be solved by increasing third (vdc) disk and moving some users to it, but this is not ideal. Another possibility is to use dovecot's zlib plugin to compress mails. However, since our Maildir doesn't have required S=12345 as part of filename to describe size of mail, this solution also wasn't applicable to us.

Installing lvm would allow us to use more than one disk to provide additional storage, but since ganeti already uses lvm to provide virtual disks to instance this also isn't ideal.

OpenZFS comes to rescue

Another solution is to use OpenZFS to provide multiple disks as single filesystem storage, and at the same time provide disk compression. Let's create a pool:

zpool create -o ashift=9 mudrac /dev/vdb
zfs create mudrac/mudrac
zfs set compression=zstd-6 mudrac
zfs set atime=off mudrac

We are using ashift of 9 instead of 12 since it uses 512 bytes blocks on storage (which is supported by our SSD storage) that saves quite a bit of space:

root@t1:~# df | grep mudrac
Filesystem      1K-blocks       Used Available Use% Mounted on
mudrac/mudrac  3104245632 3062591616  41654016  99% /mudrac/mudrac # ashift=12
m2/mudrac      3104303872 2917941376 186362496  94% /m2/mudrac     # ashift=9

This is saving of 137Gb just by choosing smaller ashift.

Most of our e-mail are messages kept on server, but rarely accessed. Because of that I opted to use zstd-6 (instead of default zstd-3) to compress it as much as possible. But, to be sure it's right choice, I also tested zstd-12 and zstd-19 and results are available below:

LEVEL	USED	COMP	H:S
zstd-6	2987971933184	60%	11:2400
zstd-12	2980591115776	59%	15:600
zstd-19	2972514841600	59%	52:600

Compression levels higher than 6 seem to need at least 6 cores to compress data, so zstd-6 seemed like best performance/space tradeoff, especially if we take into account additional time needed for compression to finish.

bullseye kernel for zfs and systemd-nspawn

To have zfs, we need recent kernel. Instead of upgrading whole server to bullseye at this moment, I decided to boot bullseye with zfs and start unmodified installation using systemd-nspawn. This is easy using following command line:

systemd-nspawn --directory /mudrac/mudrac/ --boot --machine mudrac --network-interface=eth1010 --hostname mudrac

but it's not ideal for automatic start of machine, so better solution is to use machinectl and systemd service for this. Converting this command-line into nspawn is non-trivial, but after reading man systemd.nspawn configuration needed is:

root@t1:~# cat /etc/systemd/nspawn/mudrac.nspawn
[Exec]
Boot=on
#WorkingDirectory=/mudrac/mudrac
# ln -s /mudrac/mudrac /var/lib/machines/
# don't chown files
PrivateUsers=false

[Network]
Interface=eth1010

Please note that we are not using WorkingDirectory (which would copy files from /var/lib/machines/name) but instead just created symlink to zfs filesystem in /var/lib/machines/.

To enable and start container on boot, we can use:

systemctl enable systemd-nspawn@mudrac
systemctl start systemd-nspawn@mudrac

Keep network device linked to mac address

Predictable network device names which bullseye uses should provide stable network device names. This seems like clean solution, but in testing I figured out that adding additional disks will change name of network devices. Previously Debian used udev to provide mapping between network interface name and device mac using /etc/udev/rules.d/70-persistent-net.rules. Since this is no longer the case, solution is to define similar mapping using systemd network like this:

root@t1:~# cat /etc/systemd/network/11-eth1010.link
[Match]
MACAddress=aa:00:00:39:90:0f

[Link]
Name=eth1010

Increasing disk space

When we do run out of disk space again, we could add new disk and add it to zfs pool using:

root@t2:~# zpool set autoexpand=on mudrac
root@t2:~# zpool add mudrac /dev/vdc

Thanks to autoexpand=on above, this will automatically make new space available. However, if we increase existing disk up to 4T new space isn't visible immediately since zfs has partition table on disk, so we need to extend device to use all space available using:

root@t2:~# zpool online -e mudrac vdb

zfs snapshots for backup

Now that we have zfs under our mail server, it's logical to also use zfs snapshots to provide nice, low overhead incremental backup. It's as easy as:

zfs snap mudrac/mudrac@$( date +%Y-%m-%d )

in cron.daliy and than shipping snapshots to backup machine. I did look into existing zfs snapshot solutions, but they all seemed a little bit too complicated for my use-case, so I wrote zfs-snap-to-dr.pl which copies snapshots to backup site.

To keep just and two last snapshots on mail server simple shell snippet is enough:

zfs list -r -t snapshot -o name -H mudrac/mudrac > /dev/shm/zfs.all
tail -2 /dev/shm/zfs.all > /dev/shm/zfs.tail-2
grep -v -f /dev/shm/zfs.tail-2 /dev/shm/zfs.all | xargs -i zfs destroy {}

Using shell to create and expire snapshots and simpler script to just transfer snapshots seems to me like better and more flexible solution than implementing it all in single perl script. In a sense, it's the unix way of small tools which do one thing well. Only feature which zfs-snap-to-dr.pl has aside from snapshot transfer is ability to keep just configurable number of snapshots on destination which enables it to keep disk usage under check (and re-users already collected data about snapshots).

This was interesting journey. In future, we will migrate mail server to bullseye and remove systemd-nspawn (it feels like we are twisting it's hand using it like this). But it does work, and is simple solution which will come handy in future.