Know How -- ZFS
ZFS is lightweight in administration. You just tell the machine what you have, the rest is up to ZFS. You nearly cannot do wrong. And do not be stunned, it is even more easy as you might think with your trainload of system administration experience from the last decades.
I am now running Linux/ZFS-FUSE for some month now. And I had a lot of trouble. Mainly caused by bad disks. RAIDZ2 blew this all away like a fresh breeze and I now trust ZFS to be good enough to protect my valuable data.
The interesting part is, that ZFS (with raidz2, I wish there would be raidz3 and raidz4) is able to also repair disk arrays which are extremely faulty.
Advantages:
- I do not know any other filesystem with a better fault detection and self-healing capability.
- It informs you about any trouble it encounters. 90% of this is trouble you never noticed, because nobody else told it to you.
- Internally it is highly optimized, this means: It is fast as long as there is enough free RAM, CPU and IO. (You never can have all free enough.)
- ZFS still is not the latest word in filesystems (it's a good leap ahead, though).
- ZFS IS STILL NO REPLACEMENT FOR BACKUPS! But I don't think there is missing much to even fix that.
- ZFS-FUSE is dog slow when it comes to external USB drives: This is because RAIDZ2 has a high IO overhead, and USB is bad when it comes to highly concurrent IO loads.
- ZFS uses trainloads of CPU: If you have more than one CPU, expect to dedicate 1 CPU to ZFS, when your programs do IO on ZFS! If you have a single CPU, expect 95% CPU spent for ZFS when copying. (This disadvantage is due to internal compression, bookkeeping, checksum and error correction of the ZFS code. It cannot be evaded as long as no dedicated RAID controllers natively speak ZFS.)
- ZFS-FUSE uses much memory: The virtual size exceeds 2.5 GB RAM while the resident size normally is about 600 MB RAM, the rest is swapped out. (This is a memory management problem. Usually ZFS runs in the kernel and is designed to use up all free memory. Userland always has as much free memory as allowed.) ZFS probably does not run if it is not allowed to allocate less than 1 GB of memory.
- ZFS lacks a bad sector relocation plan. (Some drives seem to have "weak sectors" which quickly get bit errors. ZFS could detect that and abstain from using those sectors. Disks often only do relocation on faulty writes, which does not happen here.)
Preface
SUN's ZFS is nice, but today the way to gain with ZFS is everything but easy.
(Information below will slowly sift into subpages. Leave it time.)
Wrong way
The funny thing is, that all sources which must be considered "near" ZFS failed for me:- OpenSolaris: I faild with OpenSolaris. Not only does it need enourmous amount on hardware (well, ZFS needs even more, read below), but also I failed miserably to get the networking up and running. Without networking I do not even need to start ZFS. Note that I did not try to install a separate networking card, I switched the OS, this was more easy to do.
- FreeBSD: FreeBSD did better on the networking side. However the kernel then crashed now and then on the hardware while using ZFS. I think not ZFS is the source of the problem, it only is triggering a major flaw in the IO subsystem of Open-BSD, because it is too aggressive demanding DMA buffers. The high error level on USB connected drives seems to drive the kernel into serious OOM conditions such that it halts.
Right way
Guess what? Linux! Well, ZFS is not in the kernel, though. But it is available as FUSE userland driver. And the good thing first:It works as expected
Note that my expectations are not high. All I expect is, that it works reliably and the machine stays stable. Everything else can be fixed later.Missing things and issues seen
- FUSE does not support certain features of ZFS, which mainly have to do with advanced features of ZFS like mapping snapshots into the FS tree. Either there are workarounds (like explicitly creating a snapshot) or I can live without those features.
- The userland driver cored at my side while mounting an ZFS volume. The problem was, that I removed the mount point while mounting ZFS. Yes, I did it in parallel, so I managed to trigger something like a race condition. However there really was no harm done. Linux staied stable and rejected IO to the FS. And after restarting the ZFS kernel and re-mounting everything was back to normal.
- Setup wasn't as easy as expected. FYI, read the doc which comes with ZFS and use debugging, it really is helpful, but it misses some important steps. Perhaps read my ZFS FUSE introduction. (Coming soon)
Additional info
- zpool scrub pool: Works asynchronously on Linux (on BSD it blocked access to the mounted area, dunno why). This means: You can use this on a productive system without any drawback.
- It is dog slow: On a 466 MHz Celeron with 640 MB RAM it takes up more than half the memory (nearly 400 MB), eats 100% CPU (FYI: The processes accessing ZFS must have a highe nice value than the userland driver, else you will run into trouble) and slows down the machine extremely. Additionally this delivers raw 400 KB/s write speed to the FS (with a peak of 1 MB/s at good times) with raidz2 and compression on. So in this configuration (which is the only one reasonable) this roughly means that you need 1 MHz per KB written to the drives. (But note that reading is a magnitude faster!)
- Compression: Only delivers a compression ratio of 1.2 for me (well, vast parts of my data are compressed archives anyway, so this perhaps sounds worse than it is).
- It survives as expected: Unplugging an USB drive while it is in use does not halt anything. That's it. Well, it is anything but easy to get the drive back online, but I also managed to do this. All I need is a data area which survives harddrive crashes and is able to tell me about broken data.
- It ignores typical problems of external hard drives: Currently I use Firewire (ieee1394), but this can be seen on USB likewise: Write errors and timeouts. Up to now there are no problems at all (except for the kernel is choking until the timeout hits, but this is normal).