This is a story about our mail server which is coming close to
it's disk space capacity:
root@mudrac:/home/prof/dpavlin# df -h
Filesystem Size Used Avail Use% Mounted on
/dev/vda1 20G 7.7G 11G 42% /
/dev/vdb 4.0T 3.9T 74G 99% /home
/dev/vdc 591G 502G 89G 85% /home/stud
You might say that it's easy to resize disk and provide more
storage, but unfortunately it's not so easy. We are using ganeti
for our virtualization platform, and current version of ganeti
has limit of 4T for single drbd disk.
This can be solved by increasing third (vdc) disk and moving some
users to it, but this is not ideal. Another possibility is to
use dovecot's zlib plugin to compress mails. However, since
our Maildir doesn't have required S=12345 as part of filename
to describe size of mail, this solution also wasn't applicable to us.
Installing lvm would allow us to use more than one
disk to provide additional storage, but since ganeti already uses
lvm to provide virtual disks to instance this also isn't ideal.
OpenZFS comes to rescue
Another solution is to use OpenZFS to provide multiple disks
as single filesystem storage, and at the same time provide disk
compression. Let's create a pool:
zpool create -o ashift=9 mudrac /dev/vdb
zfs create mudrac/mudrac
zfs set compression=zstd-6 mudrac
zfs set atime=off mudrac
We are using ashift of 9 instead of 12 since it uses 512 bytes
blocks on storage (which is supported by our SSD storage) that
saves quite a bit of space:
root@t1:~# df | grep mudrac
Filesystem 1K-blocks Used Available Use% Mounted on
mudrac/mudrac 3104245632 3062591616 41654016 99% /mudrac/mudrac # ashift=12
m2/mudrac 3104303872 2917941376 186362496 94% /m2/mudrac # ashift=9
This is saving of 137Gb just by choosing smaller ashift.
Most of our e-mail are messages kept on server, but rarely accessed. Because
of that I opted to use zstd-6 (instead of default zstd-3) to compress it
as much as possible. But, to be sure it's right choice, I also tested
zstd-12 and zstd-19 and results are available below:
LEVEL | USED | COMP | H:S |
zstd-6 | 2987971933184 | 60% | 11:2400 |
zstd-12 | 2980591115776 | 59% | 15:600 |
zstd-19 | 2972514841600 | 59% | 52:600 |
Compression levels higher than 6 seem to need at least 6 cores to compress
data, so zstd-6 seemed like best performance/space tradeoff, especially
if we take into account additional time needed for compression to finish.
bullseye kernel for zfs and systemd-nspawn
To have zfs, we need recent kernel. Instead of upgrading whole
server to bullseye at this moment, I decided to boot bullseye
with zfs and start unmodified installation using systemd-nspawn.
This is easy using following command line:
systemd-nspawn --directory /mudrac/mudrac/ --boot --machine mudrac --network-interface=eth1010 --hostname mudrac
but it's not ideal for automatic start of machine, so better solution
is to use
machinectl and systemd service for this. Converting
this command-line into nspawn is non-trivial, but after reading
man systemd.nspawn configuration needed is:
root@t1:~# cat /etc/systemd/nspawn/mudrac.nspawn
[Exec]
Boot=on
#WorkingDirectory=/mudrac/mudrac
# ln -s /mudrac/mudrac /var/lib/machines/
# don't chown files
PrivateUsers=false
[Network]
Interface=eth1010
Please note that we are not using WorkingDirectory (which would copy
files from /var/lib/machines/name) but instead just created symlink
to zfs filesystem in /var/lib/machines/.
To enable and start container on boot, we can use:
systemctl enable systemd-nspawn@mudrac
systemctl start systemd-nspawn@mudrac
Keep network device linked to mac address
Predictable network device names which bullseye uses should provide stable network
device names.
This seems like clean solution, but in testing I figured out that
adding additional disks will change name of network devices. Previously
Debian used udev to provide mapping between network interface name and
device mac using /etc/udev/rules.d/70-persistent-net.rules.
Since this is no longer the case, solution is to define similar mapping
using systemd network like this:
root@t1:~# cat /etc/systemd/network/11-eth1010.link
[Match]
MACAddress=aa:00:00:39:90:0f
[Link]
Name=eth1010
Increasing disk space
When we do run out of disk space again, we could add new disk and
add it to zfs pool using:
root@t2:~# zpool set autoexpand=on mudrac
root@t2:~# zpool add mudrac /dev/vdc
Thanks to
autoexpand=on above, this will automatically
make new space available. However, if we increase existing disk up to 4T
new space isn't visible immediately since zfs has partition table on disk,
so we need to extend device to use all space available using:
root@t2:~# zpool online -e mudrac vdb
zfs snapshots for backup
Now that we have zfs under our mail server, it's logical to
also use zfs snapshots to provide nice, low overhead incremental
backup. It's as easy as:
zfs snap mudrac/mudrac@$( date +%Y-%m-%d )
in cron.daliy and than shipping snapshots to backup machine.
I did look into existing zfs snapshot solutions, but they all
seemed a little bit too complicated for my use-case, so I wrote
zfs-snap-to-dr.pl which copies snapshots to backup site.
To keep just and two last snapshots on mail server simple shell snippet is enough:
zfs list -r -t snapshot -o name -H mudrac/mudrac > /dev/shm/zfs.all
tail -2 /dev/shm/zfs.all > /dev/shm/zfs.tail-2
grep -v -f /dev/shm/zfs.tail-2 /dev/shm/zfs.all | xargs -i zfs destroy {}
Using shell to create and expire snapshots and simpler script to just
transfer snapshots seems to me like better and more flexible solution
than implementing it all in single perl script. In a sense, it's the
unix way of small tools which do one thing well. Only feature which
zfs-snap-to-dr.pl has aside from snapshot transfer is ability
to keep just configurable number of snapshots on destination which
enables it to keep disk usage under check (and re-users already
collected data about snapshots).
This was interesting journey. In future, we will migrate
mail server to bullseye and remove systemd-nspawn (it feels like we
are twisting it's hand using it like this). But it does work,
and is simple solution which will come handy in future.