Can ZFS cope with sudden power loss? (What events cause a pool to be irrecoverable if the disk itself hasn’t failed or become unreliable)

data integritydata-loss-preventionzfs

All the resources say ZFS doesn't have fsck, or recovery tools, use battery backed SSD for ZIL, etc.

If the plug is suddenly somehow pulled (total power loss despite UPS etc, but assuming no physical damage, no head crashes etc), the SSDs will write cache to nvram and then go quiet….

What chance does ZFS have of being in a consistent state (even if some data was lost) and the pool being usable/readable, when it reboots?

update

I realise I actually mean to ask something closer to, what events would lead to a situation where ZFS gives up on being able to read the pool, despite the data basically being intact? Its not clear what ZFS can recover from (or can recover given the right hardware) and what it can't (or can't without the right hardware), because it does so much internally to self check and fix things. Clearly insufficient redundancy+ disk failure (or other major hardware issue) is one case, and complete wipe/overwrite due to firmware/software bug is another. But assuming the storage media, hardware and software are still working reliably/properly, what else has to have gone wrong, for the result to be loss of a pool? Where are its limits on pool fixing? Which situations have to arise before it can't, and what has to happen to give rise to them?

Best Answer

What chance does ZFS have of being in a consistent state (even if some data was lost) and the pool being usable/readable, when it reboots?

ZFS operates like a transactional database management system in that old data is not overwritten in place when being updated, as with traditional filesystems. Instead, the new data is written elsewhere on the disk, then the filesystem metadata structures are updated to point to the new data, and only then is the old data's block freed for reuse by the filesystem. In this way, a sudden power loss will leave the old copy of the data in place if the new data updates are not 100% committed to persistent storage. You won't have half the block replaced or anything like that, causing data corruption.

On top of that, ZFS uses a sophisticated checksumming scheme that allows the filesystem to detect miswritten or corrupted data.

If you're using ZFS with redundant storage, this same scheme allows the filesystem to choose between two or more redundant copies of the data when repairing the filesystem. That is, if you have two copies of a given block and only one of them matches its stored checksum, the filesystem knows that it should repair the bad copy/copies with the clean one.

These repairs may happen on the fly, when you try to read or modify the data — whereupon the filesystem may realize that the requested blocks aren't entirely kosher — or during a zfs scrub operation. It is common to schedule a scrub to run periodically on ZFS pools that have files that are rarely accessed, since the filesystem wouldn't otherwise discover hardware data loss in the normal course of operation. It is common for ZFS pools running on dodgy hardware to show some number of fixed blocks after every scrub.

Scrubbing is kinda sorta like fsck for other Unix type filesystems, except that it happens online, while the filesystem is mounted and usable; it happens in the background and only when the pool is otherwise idle. Also, fsck implementations typically only check metadata, not data, but ZFS checksums both, and so can detect errors in both. If these integrity mechanisms decide that one of the blocks needs to be replaced, it can use the checksums to decide which copy to replace the corrupted copies with.

assuming the storage media, hardware and software are still working reliably/properly, what else has to have gone wrong, for the result to be loss of a pool?

As far as I'm aware, there is no such case. Either one of the three things you mention has failed or ZFS will mount the pool and read from it.

Clearly insufficient redundancy+ disk failure (or other major hardware issue) is one case

Yes, though that can happen in a subtler case than I think you're considering.

Take a simple 2-way mirror. I think you're thinking about one of the disks being physically removed from the computer, or at least inaccessible for some reason. But, imagine sector 12345 being corrupted on both disks. Then all the clever checksums and redundancy in ZFS can't help you: both copies are damaged, so the whole block containing that sector cannot be read.

But here's the clever bit: because ZFS is both a filesystem and a volume manager — as opposed to lash-up like hardware RAID + ext4 or LVM2 + ext4 — a zpool status command will tell you which file is irrecoverably damaged. If you remove that file, the pool immediately returns to an undamaged state; the problem has been removed. The lash-ups that separate the filesystem from the RAID and LVM pieces can't do that.

Which situations have to arise before it can't, and what has to happen to give rise to them?

The only case I'm aware of is something like the above example, where data corruption has damaged enough of the redundant copies of key filesystem metadata that ZFS is unable to read it.

For that reason, with today's extremely large disks — 100 trillion bits! — I recommend that you configure ZFS (or any other RAID or LVM system for that matter) with at least dual redundancy. In ZFS terms, that means raidz2, 3-way mirrors, or higher.

That said, ZFS normally stores additional copies of all filesystem metadata beyond the normal levels of redundancy used for regular file data. For example, a 2-way mirror will store 2 copies of regular user data but 4 copies of all metadata. You can dial this back for performance, but you can't turn it off entirely.


There is a chapter in the ZFS manual on ZFS failure modes which you may find enlightening.