Upgrading ZFS Pool size with different sized disks

hard driveraidzfs

I currently have a setup where I am using an old desktop as a media server, but I have no fault tolerance and the amount of media on there is too large for me to reasonably back it all up. I'm not terribly concerned about losing it in case of a drive failure, since it's just movies and TV shows and the like (many of which I still have the DVDs for, packed away somewhere), but I'm currently upgrading my system and I'd like to add in some fault tolerance here.

Currently, I have a 1TB, 2TB and 3TB drive, with probably around 5.5TB used, and I'm thinking that I will buy two more 4TB drives and set up a 4TB x 4TB x 3TB RAIDZ array.

My questions are:

  1. My understanding is that for heterogeneous disk arrays like this, the pool size will be limited to the size of the smallest disk, which would mean I'd be looking at a 3 x 3 x 3 TB RAIDZ with 6TB usable space and 2TB of non-fault-tolerant space – is this true?*
  2. Assuming #1 is true, when I eventually need more space, if I add a single 4TB or 6TB drive to the array, will it be a simple matter to extend the pool to become a 4 x 4 x 4 TB array, or will I need to find somewhere to stash the 6TB of data while upgrading the array?

*The 2TB of non-fault-tolerant space is not that big a deal, because I was planning on setting aside around 2TB for "stuff that needs proper backup" (personal photos, computer snapshots, etc), which I would mirror to the remaining 2TB disk and a 2nd 2TB external drive that I will keep somewhere else.

Best Answer

Currently, I have a 1TB, 2TB and 3TB drive, with probably around 5.5TB used, and I'm thinking that I will buy two more 4TB drives and set up a 4TB x 4TB x 3TB RAIDZ array.

With 4 TB drives, you shouldn't be looking at single redundancy RAIDZ. I would recommend RAIDZ2 because of the additional protection it affords in case one drive somehow breaks or otherwise develops problems.

Remember that consumer drives are usually spec'd to a URE rate of one failed sector per 10^14 bits read. 1 TB (hard disk drive manufacturer terabyte, that is) is 10^12 bytes or close to 10^13 bits, give or take a small amount of change. A full read pass of the array you have in mind is statistically likely to encounter a problem, and in practice, read problems tend to develop in batches.

I'm not sure why you are suggesting RAIDZ2. Is it more likely that I will develop two simultaneous drive failures if I use RAIDZ1 than if I use no RAID? I want some improvement to the fault tolerance of my system. Nothing unrecoverable will exist in only one place, so the RAID, array is just a matter of convenience.

RAIDZ1 uses a single disk to provide redundancy in a vdev, whereas RAIDZ2 uses two (and some more complex calculations, but you are unlikely to be throughput limited by RAIDZ calculations anyway). The benefit of a second redundant disk is in case the first fails or otherwise becomes unavailable. With only one disk's worth of redundancy, any additional errors are now critical. With 4+4+3 TB, you have 11 TB of raw storage, initially 6 TB of which may need to be read to reconstruct a lost disk (8 TB once you upgrade that 3 TB drive to a 4 TB one and expand the pool to match). For order-of-magnitude estimates, that rounds nicely to somewhere between 10^13 and 10^14 bits. Statistically, you have something like a 50% to 100% probability of hitting an unrecoverable read error during resilvering when using single redundancy with an array of that order of magnitude size. Sure, you may very well luck out, but it suddenly means that you have next to no protection in case of a drive failure.

My understanding is that for heterogeneous disk arrays like this, the pool size will be limited to the size of the smallest disk, which would mean I'd be looking at a 3 x 3 x 3 TB RAIDZ with 6TB usable space and 2TB of non-fault-tolerant space - is this true?

Almost. ZFS will restrict the vdev to the size of the smallest constituent device, so you get the effective capacity of a three-device RAIDZ vdev made up of 3 TB devices, so 6 TB of user-accessible storage (give or take metadata). The remaining 2 TB of raw storage space are wasted; it is not available for use even without redundancy. (They will show up in the EXPANDSZ column in zpool list, but they aren't being used.)

Once you replace the 3 TB drive with a 4 TB drive and expand the vdev (both of which are online operations in ZFS), the pool can use the additional storage space.

There are ways around this -- for example, you could partition the drives to present three 3 TB devices and two 1 TB (remainder of the two 4 TB drives) devices to ZFS -- but it's going to seriously complicate your setup and it's unlikely to work the way you plan. I strongly recommend against that.

The 2 TB of non-fault tolerant space would not be backed up by ZFS to the offline disks, sorry if that was not clear. I was suggesting that I would back it up by normal disk syncing operations like rsync.

That implies that ZFS has no knowledge of those 2 x 1TB, and that you are creating some other file system in the space. Yes, you can do that, but again, it's going to seriously complicate your setup for, quite frankly, what appears to be very little gain.

Assuming #1 is true, when I eventually need more space, if I add a single 4TB or 6TB drive to the array, will it be a simple matter to extend the pool to become a 4 x 4 x 4 TB array, or will I need to find somewhere to stash the 6TB of data while upgrading the array?

As I said above, ZFS vdevs and pools can be grown as an online operation, if you do it by gradually replacing devices. (It is not, however, possible to shrink a ZFS pool or vdev.) What you cannot however do is add additional devices to an existing vdev (such as the three-device RAIDZ vdev that you are envisioning creating); an entirely new vdev must be added to the pool, and the data that is later written is then striped between the two vdevs in the pool. Each vdev has its own redundancy requirements, but they can share hot spares. You also cannot remove devices from a vdev, except in the case of mirrors (where removing a device only reduces the redundancy level of that particular mirror vdev, and does not affect the amount of user-accessible storage space), and you cannot remove vdevs from a pool. The only way to do the latter (and by consequence, the only way to fix some pool configuration mishaps) is to recreate the pool and transfer the data from the old pool, possibly by way of backups, to the new pool.

The 2TB of non-fault-tolerant space is not that big a deal, because I was planning on setting aside around 2TB for "stuff that needs proper backup" (personal photos, computer snapshots, etc), which I would mirror to the remaining 2TB disk and a 2nd 2TB external drive that I will keep somewhere else.

ZFS redundancy isn't really designed for the mostly-offline offsite-backup-drive use case. I discuss this in some depth in Would a ZFS mirror with one drive mostly offline work?, but the bread and butter of it is that it's better to use zfs send/zfs receive to copy the contents of a ZFS file system (including snapshots and other periphernalia), or plain rsync if you don't care about snapshots, than to use mirrors in a mostly-offline setup.

If I'm using half my disks for fault tolerance, I might as well just use traditional offline backups.

This admittedly depends a lot on your situation. What are your time to recovery requirements in different situations? RAID is about uptime and time to recovery, not about safeguarding data; you need backups anyway.

Related Question