Collecting disk SMART informations from large disk pool

Last few weeks, I was configuring huge ZFS pool of 50 disks over three machines. Aside from benchmarking, I wanted to setup monitoring of this disk pool. smartctl comes as natural candidate for getting smart data, but where should I keep it? I recently learned of git log -p output format which shows nicely changes in your source files, so natural question was can I use git to track smart disk statistics?

As it turns out, getting overview of disk layout is really easy under Linux if you know where to look. /proc/partitions first comes to mind, but it lacks one really important peace of information: disk serial number. It's only peace of information which won't change between reboots when you have to spin up 30+ disks, so you really want to use it as identification for disks, instead of device name for example (which I tried on first try and learned that disks move around).

Good naming of dump files is as important as always. In the end, I opted to use smart.id where id part is from /dev/disk/by-id/scsi-something. Paths in /dev/disk/by-id/ are essential useful when creating storage pools because they also don't change between reboots.

Now that we know where to look for disk identification and serial number, we are ready to start collecting smart data. However, this data is much more useful if coupled with info from controllers, so final version of smart-dump.sh script also supports dumping of controller status for LSI Logic / Symbios Logic and 3ware controllers. Have in mind that collecting smart info from disks does interrupt data transfers, so if you have huge pool you might want to spread those requests (or even issue them in parallel if you want one huge interruption as opposed to several smaller ones).

So was all this worth an effort? In fact, it was! In our sample of 50 3T disks, one disk reported errors after just 192 hours of lifetime. It would probably report it earlier, but this was second time that I run smartctl -t long on it. On the other side, it passed long check on first test which was 8 hours of LifeTime. Even if you read Failure Trends in a Large Disk Drive Population paper from Google, and concluded that smart is lying to you and you could ignore it, please monitor your drives!