Know How - ZFS - Why ZFS: ZFS is a lifesaver

In this article I present some experiences with ZFS which makes it to a lifesaver.

Even that I do not think that fuse-ZFS under linux is ready for production use, it is good for having around to protect valuable data. See the conclusion below.

Disk Drive = Trouble Guarantee

I have a legion of disk drives. Most of them are external USB drives in the range 300 to 500 GB. I have terabytes of data on that drives. And I lost oversight.

Why do I have so many drives? Because drives were failing. Not only failing like "I am dead", but failing like "I am acting irregularily". You might think, that this is due to the "cheap" nature of big "USB drives". Well, those were not cheap drives. I have paid several thousand EUR for them. However, of course, those are IDE drives, so they are "cheap" compared to SCSI.

SCSI drives are much more stable, that's true. But are you really sure that your SCSI drives never err?

MTBF != time to failure

Think about MTBF. If you have an MTBF of 10 years, then you can expect one error all 10 years, that's quite good enough, you might think. That's wrong!

An MTBF of 100000 hours tells you that each hour one drive out of 100000 will show some error. That may be it fails completely, but most often this means, that it behaves erratically. Also MTBF is not the only thing which counts. There is an error probability on each hardware equipment based on the CRC/ECC used. The error probability is really low, like 10^-15. You might think, that such a chance is very low to ever happen. But that's wrong either!

Drives are talking bits. This means, there is such a chance that a bit error will hit you with a 10^-15 probability. 10^-15 roughly translates to one error each 2^50 bit. This are 2^42 byte or 4 TeraByte. Enough for you, you think. But with today's data rates, 4 TeraByte are quickly reached.

Other bugs

Also there are other types of bugs affecting drives. Like bugs in the Kernel, bugs in some driver hardware, unnoticed RAM errors (even with ECC) etc. pp. The most bugs are harmless, but some are real killers when it comes to data.

The threat

The most dangerous threat is data alteration which goes unnoticed.

Think about following example:

There are two files on the drive. Now sector2 of file1 shall be written to. Due to some bug, sector1 of file2 is written to.

Now you might notice that file1 is corrupt, as it contains wrong data. But can you find out that file2 now is corrupt, too?

With only 2 files there seems to be no problem. But what if file2 is one out of some million files? Are you prepared to check all those files?

Backups are not the solution

You might think, that backups are the solution, but usual backups are only taken for, say, the last year. So what happens if you think that file2 is corrupt now, but you do not know when the corruption happened? You take the latest backup you can find, which is 1 year old, and you see that the backup is identical to the file on the drive. Is the file really still correct or happened the corruption more than 1 year ago? You cannot tell from backups.

Archives are no real solution either

If you have an Archive, you now can dig into the archive. You can check when the last archived version differs from the backed up version. However looking into the archive you can still not be sure that the version you see in the archive is the right one. Perhaps there was some version between the last archived version and the version which was corrupted. It's even worse that you cannot tell if the corruption took place before the archived version or not (so it might be that the archive only contains a previous, already corrupted, version).

Safe applications are the solution

If an application is able to check the data for corruption, this is a solution. In this case you can find out (with a high probability) that your data is not corrupted, and then rebuild the data based on the last valid version. But only few applications are constructed that way.

The solution

Filesystems like ZFS are the solution.

If a file becomes corrupted on ZFS, regardless due to which cause, ZFS detects that problem with a very high probability (high enough for me, so it's likely to be high enough for you, too) and is able to react accordingly. If ZFS has no redundancy, it will not allow you to access the corrupted data (note that I never tested that). With redundancy, ZFS can repair the data. And more important: It can tell if the repair process was successful or not.

The latter is the key feature. You can repair a RAID5, too. But you cannot tell afterwards, if the recovered data is correct or not. It is only correct from the RAID5 perspective, but not necessarily from the applications's viewpoint.

RAID5 does not protect data. It only protects against a single type of fault when a drive dies.

My experience shows me, that you cannot rely on that single problem. With SCSI you are on a safer side as expensive server SCSI drives seem to either function properly or die completely (which is a security measure). With real inexpensive disks (the I in RAiD) RAID5 fails completely, as my observation is, that most times IDE/SATA drives start to behave erratic before they die.

With RAID5 you have no chance to detect, which drive behaves erratic! There is no way to even access a single drive. Because of RAID5 I lost more data then through single drive failures ever. This happened, because one drive died and in the rebuild process (=stress) one other drive started to behave erratic. The result was a filesystem with random data afterwards.

With SCSI, RAID5 and a standalone controller which does a RAID5 parity verify in cycles, I never observed this. However for the price of the controller alone you can have a 4 TeraByte ZFS running!

RAID is no solution

RAID is good if you need a lot of heads to speed data access. So if you need as sustained random read rate of 1 GB/s and above then RAID is for you. You then have at least 64 SCSI drives with 100 GB (smaller = faster, as server drives only operate on the outer 30% of the disks, to have a higher data rate and shorter seek time) each on your SAN, with several data movers of 10 GB or more battery buffered RAM. And your datacenter infrastructure costs more than 1 Million $.

RAID then is essential, as with the number of drives the probability of replacement rises. Usually you will not do RAID5, you will do RAID51, such that you can always replace a drive quickly. RAID51 often sustains a two drive failure as well.

However you must still be prepared to have a backup ready in such a situation, as shit happens and it might be that more than two many drives fail and therefor render the RAID unpredictable (it's called the Birthday Phenomenon, as the probability of two persons out of a group having the same birthday is higher than it seems).

ZFS is a solution

With ZRAID2 two drives can fail. If a third one fails, too, then ZRAID still can tell which data was corrupted.

ZFS protects this way for typical bugs where data is written to the wrong place (or not written at all).

Example

Here is a life example of a drive which behaves erratic:
zoo:~# zpool status
  pool: zoo
 state: ONLINE
status: One or more devices has experienced an unrecoverable error.  An
 attempt was made to correct the error.  Applications are unaffected.
action: Determine if the device needs to be replaced, and clear the errors
 using 'zpool clear' or replace the device with 'zpool replace'.
   see: http://www.sun.com/msg/ZFS-8000-9P
 scrub: scrub in progress, 1.41% done, 99h20m to go
config:

 NAME        STATE     READ WRITE CKSUM
 zoo         ONLINE       0     0     0
   raidz2    ONLINE       0     0     0
     sde     ONLINE       0     0     0
     sdd     ONLINE       0     0     0
     sdc     ONLINE       0     0     0
     sdb     ONLINE       0     0     0
     sdf     ONLINE       0     0 22.6K
     sdh     ONLINE       0     0     0
     sdg     ONLINE       0     0     0
     sdi     ONLINE       0     0     0

For some reason the USB drive "sdf" has checksum errors. The reason is unknown, however ZFS was able to repair the problem.

FYI: The drive needed a power cycle.

For some unknown reason it ignored writes without telling that error (at least that is what I think what happened). Some weeks before one of my USB Switched crashed, which might have switched this drive in a "write ignore" mode. If so this is very strange!

Note that the drive did not behave erratic all the way. Only some hundredthousand sectors became corrupt. Currently it is unknown why this problem happened, perhaps it's a FUSE thing. It showed up after I copied a 300 GB image of a faulted drive onto ZFS and did the e2fsck on this image (and not on the original drive).

ZFS-FUSE for Linux

ZFS-FUSE for Linux still has some downsides. The major one is that the RAM usage is too high. You cannot run ZFS-FUSE on machines with 512 MB or less (yes, that includes machines with 512 MB):
zoo:~# ps u13664
USER       PID %CPU %MEM    VSZ   RSS TTY      STAT START   TIME COMMAND
root     13664  7.8 21.8 1977024 454204 pts/0  Ssl+ Apr03 1243:52 ./zfs-fuse --no-daemon

firebird:~# ps u4227
USER       PID %CPU %MEM    VSZ   RSS TTY      STAT START   TIME COMMAND
root      4227 51.8 67.0 2610964 433764 pts/1  SNl+ Jan20 63798:17 ./zfs-fuse --no-daemon

Do not be irritated by the VSZ size. The footprint of ZFS-FUSE is roughly 512 MB. And you need more RAM for your system.

Also the FUSE driver uses a lot CPU. On a 466 MHz Celeron (firebird) I hardly get more than 2 MB/s which mainly is because the machine has not enough CPU. With random reads (when seek times of the drives rise) you shall not expect data rates higher than 100 KB/s in this situation. (The situation is similar with standalone disks. If you get 20 MB/s linear read, then you can expect more than 1 MB/s with random reads, probably less.)

As I attach dives using USB (or FireWire), this even slows down a lot. On my fast machine (zoo) 3.6 GHz P4 HT I get something like 8 MB/s linear read with a single core usage between 10% to 20% CPU.

Conclusion

ZFS is good to have. If your OS supports a production version of ZFS, then use it. Probably this only is true for Solaris fileservers (I am not quite sure about OpenSolaris, though).

However I can recommend to have a dedicated Linux based ZFS-FUSE machine around (if you have 2 GB RAM or more and can spare 512 MB RAM, then you can run it on your local machine as well. However be sure to have ECC RAM, you have been warned).

On this machine you can attach a lot of storage. You then can use it like an online archive and can use it as disk backup system (see also: Acronis True Image). In my experience, ZFS works very reliable. More reliable than single drives. Much more reliable than RAID.

However you must be able to take down the ZFS sometimes, as it is not completely stable. This means, you must be able to do a reboot of the ZFS-FUSE machine without harming other things (most times you only need to take down the ZFS process, however this needs an unmounting of all FUSE drives to re-activate ZFS).

At my side it has replaced following: Single drive NAS solutions, Backup disks and RAID fileservers.

I need it

Personally, I do not need a high throughput. I do not have any problems with processes taking a little bit longer. However I do not want them to fail. And I do not want them to fail later on.

So my method is as following:

  • Use local hard drives if speed is needed.
  • Have a backup of those machines. The backup is written to the network on the ZFS server.
  • Have a fileservice on ZFS (ZFS can deliver 8 MB/s which is good enough for 100 MBit Ethernet).
  • Have an archive server on a different location. (Planned)
The idea behind the archive server is, that a virus could probably access all data and destroy it on the first machine. The secondary machine therefor must keep an archive, such that even a disaster (like fire) cannot destroy all data.

Note that the archive server does not need a permanent Internet connection, does not need high bandwidth and does not need to be accessible from the outside (it can open the connection itself). So it will be highly secure. However it needs a lot of storage.

As it must run unattended and is for the disaster case, this means, that it must be protected against simple failures (like disk failures). This probably means that it must run a ZRAID2 with a spare drive as well. And it needs to be able to store a sustained data rate of 100 KB/s for around 1 year, which is 3 TB. However it must have this "free", which means it must have double this space, so it needs 6 TB on raw space. With a spare and 2 redundancy this make 15 drives a 500 GB.

That is only FYI such that you can see what others are doing.

-Tino, 2008-04-14