Current gptzfsboot limitations

Tue Dec 8 00:23:26 UTC 2009

On Fri, Nov 20, 2009 at 4:46 PM, Matt Reimer <mattjreimer at gmail.com> wrote:
> I've been analyzing gptzfsboot to see what its limitations are. I
> think it should now work fine for a healthy pool with any number of
> disks, with any type of vdev, whether single disk, stripe, mirror,
> raidz or raidz2.
>
> But there are currently several limitations (likely in loader.zfs
> too), mostly due to the limited amount of memory available (< 640KB)
> and the simple memory allocators used (a simple malloc() and
> zfs_alloc_temp()).

With some help from John Baldwin I've been able to fix these
limitations. I've posted three patches to fs at . I've successfully
booted from a degraded raidz2 pool, with a drive offline, and also
with single and double errors in a given block.

> 1. gptzfsboot might fail to read compressed files on raidz/raidz2
> pools. The reason is that the temporary buffer used for I/O
> (zfs_temp_buf in zfsimpl.c) is 128KB by default, but a 128KB
> compressed block will require a 128KB buffer to be allocated before
> the I/O is done, leaving nothing for the raidz code further on. The
> fix would be to make more the temporary buffer larger, but for some
> reason it's not as simple as just changing the TEMP_SIZE define
> (possibly a stack overflow results; more debugging needed).
> Workaround: don't enable compression on your root filesystem (aka
> bootfs).

The heap size has been increased from something around 400KB or so to
48M, and the ZFS temp buffer has been increased from 128KB to 1MB. I
think 1MB should be enough for the worst case scenario where
compression is enabled and two child vdevs in a raidz2 vdev are
offline.

> 2. gptzfsboot might fail to reconstruct a file that is read from a
> degraded raidz/raidz2 pool, or if the file is corrupt somehow (i.e.
> the pool is healthy but the checksums don't match). The reason again
> is that the temporary buffer gets exhausted. I think this will only
> happen in the case where more than one physical block is corrupt or
> unreadable. The fix has several aspects: 1) make the temporary buffer
> much larger, perhaps larger than 640KB; 2) change
> zfssubr.c:vdev_raidz_read() to reuse the temp buffers it allocates
> when possible; and 3) either restructure
> zfssubr.c:vdev_raidz_reconstruct_pq() to only allocate its temporary
> buffers once per I/O, or use a malloc that has free() implemented.
> Workaround: repair your pool somehow (e.g. pxeboot) so one or no disks
> are bad.

This is fixed by the increased heap size and temp buffer size, and
also by tweaking the raidz code a bit to reuse its temp buffers.
Before temp buffer usage could grow exponentially on the number of
drives in the raidz vdev, but now it uses at most 4x the size of the
largest column's I/O. For example, if a 128KB raidz2 block is broken
down into 4 * 32KB + 2 * 32KB parity, then the largest column I/O size
is 32KB and the max memory use is 4 * 32KB.

> 3. gptzfsboot might fail to boot from a degraded pool that has one or
> more drives marked offline, removed, or faulted. The reason is that
> vdev_probe() assumes that all vdevs are healthy, regardless of their
> true state. gptzfsboot then will read from an offline/removed/faulted
> vdev as if it were healthy, likely resulting in failed checksums,
> resulting in the recovery code path being run in vdev_raidz_read(),
> possibly leading to zfs_temp_buf exhaustion as in #2 above.

This is fixed by getting each drive's status from the newest vdev
label, rather than assuming all are healthy.

> I think I've also hit a stack overflow a couple of times while debugging.

Actually what was happening was that the heap was overrunning the
stack. I fixed this by moving the heap to the range 16M-64M. Now the
stack has much more room to grow.

If these three patches are acceptable, can someone commit them and MFC?

Matt