My point of view
First, let me explain my position. I was working for quite a few years in big corporation, and followed EMC storage systems (one from end of of last century and improvement that Clarion did on our production SAP deployment). I even visited EMC factory in Cork, Ireland, and it was very eye-opening experience. They claim that 95% of customers who visited factory did buy EMC storage, and I believe them (we did upgrade to Clarion btw).
In my Linux based deployments on HP, Compaq and IBM hardware I did various crazy RAID configurations (RAID5 across disks on controller and then stripe across other controller, for example). Those where the easy parts: you got RAID controller with DRAM cache (~256Mb) and some kind of battery backup which greatly improved write performance.
Later on in CARNet we had HP EVA storage which proved quite flaky. I heard from friend in one enterprise deployment that they use them only for testing. And you know, it's just shelf of disks with redundant controllers and fiber interface...
Solid state drives
However, solid state drives changed a lot of that. I still haven't had pleasure to use Intel SSD which are supposed to be good, but USB sticks are also flash storage, but with quaky characteristics.
This particular one is ID 0951:1603 Kingston Technology Data Traveler 1GB/2GB Pen Drive as reported by Linux, but in fact 8Gb model which seem to have 128Mb of memory which is writable at about 6Mb/s and after that write speed drops to 45K/s.
On the other hand, there is ZFS on FUSE project which enables some really interesting applications of Sun's (and now Oracle) file-system. I do have to mention Sun at this point. Ever since I heard about Oracle's acquisition of Sun, I have wondered what will happen with ZFS. I might even suspect that ZFS is the main reason why Oracle bought Sun. Let me explain...
If you look at database market (where Oracle is), the only interesting thing to improve relational databases is to make them extremely fast. And that revolution is already here. Don MacAskill from SmugMug makes compelling case about performance of SSD storage. If you don't believe words, watch this video from 24:50 to see solutions to MySQL storage performance problem: hardware!. Sun's hardware. Do you think that Oracle didn't noticed that?
Enterprise storage cheaply
Did you watched the video? I really don't agree that it's hardware. Common! It's Opteron boxes with custom built SSD disks optimized for write speed. SSD with super-capacitors instead of batteries in old RAID controller.
But, to make it really fun, I will try to re-create at least some of those abilities using commodity hardware in my university environment. I have Dell's OptiPlex boxes which come loaded with a lot of goodies to put together a commodity storage cluster:
- Intel(R) Core(TM)2 Duo CPU E6550 @ 2.33GHz
- 3Gb RAM
- 2 SATA disks with ~80Mb/s of read/write performance
- multi-card reader and 8 USB slots
- fake software RAID on Intel chipset (supported by dmraid but even it's documentation suggests not to use it)
Why ZFS? Isn't btrfs way to go? For this particular application, I don't think so. Let me list features of ZFS which excite me:
- ability to store log to separate (mirror) device (SSD, USB sticks if that helps)
- scrub: read all bytes on disk and rewrite it (beats smartctl -t long because it also re-allocates bad blocks, I've seen 80Mb/s scrub)
- balancing of IO over devices (I will use this over nbd to split mirror between machines for fail-over)
- arbitrary number of copies (nice for bigger clusters of storage machines)
- nice snapshots which display it's size and can be cloned to writable ones
- snapshot send/receive to make off-site backup copies
- L2ARC - balance read and write cache over SSD devices with different characteristics (USB sticks have fast read and slow write, so they might be good fit)
You might think of it as git with POSIX file-system semantics.
But, it's in user space, you say, it must be slow! It isn't. Really. Linux user-space is much faster than disk speed and having separate process is nice for monitoring purposes. File-system overhead gets counted into user time, not system, so system time is clear indicator of driver (hardware) activity and not file-system overhead.
I have most parts of this setup ready, and I'm using it to backup OpenVZ containers. So, I'm running OpenVZ kernel and I can even make virtual machines from backup snapshots to recover into some point in time. After I finish this setup, expect a detailed guide (it will probably be part of my upcoming virtualization workshop as alternative to LVM).