Current gptzfsboot limitations

Sat Nov 21 00:46:55 UTC 2009

I've been analyzing gptzfsboot to see what its limitations are. I
think it should now work fine for a healthy pool with any number of
disks, with any type of vdev, whether single disk, stripe, mirror,
raidz or raidz2.

But there are currently several limitations (likely in loader.zfs
too), mostly due to the limited amount of memory available (< 640KB)
and the simple memory allocators used (a simple malloc() and
zfs_alloc_temp()).

1. gptzfsboot might fail to read compressed files on raidz/raidz2
pools. The reason is that the temporary buffer used for I/O
(zfs_temp_buf in zfsimpl.c) is 128KB by default, but a 128KB
compressed block will require a 128KB buffer to be allocated before
the I/O is done, leaving nothing for the raidz code further on. The
fix would be to make more the temporary buffer larger, but for some
reason it's not as simple as just changing the TEMP_SIZE define
(possibly a stack overflow results; more debugging needed).
Workaround: don't enable compression on your root filesystem (aka
bootfs).

2. gptzfsboot might fail to reconstruct a file that is read from a
degraded raidz/raidz2 pool, or if the file is corrupt somehow (i.e.
the pool is healthy but the checksums don't match). The reason again
is that the temporary buffer gets exhausted. I think this will only
happen in the case where more than one physical block is corrupt or
unreadable. The fix has several aspects: 1) make the temporary buffer
much larger, perhaps larger than 640KB; 2) change
zfssubr.c:vdev_raidz_read() to reuse the temp buffers it allocates
when possible; and 3) either restructure
zfssubr.c:vdev_raidz_reconstruct_pq() to only allocate its temporary
buffers once per I/O, or use a malloc that has free() implemented.
Workaround: repair your pool somehow (e.g. pxeboot) so one or no disks
are bad.

3. gptzfsboot might fail to boot from a degraded pool that has one or
more drives marked offline, removed, or faulted. The reason is that
vdev_probe() assumes that all vdevs are healthy, regardless of their
true state. gptzfsboot then will read from an offline/removed/faulted
vdev as if it were healthy, likely resulting in failed checksums,
resulting in the recovery code path being run in vdev_raidz_read(),
possibly leading to zfs_temp_buf exhaustion as in #2 above.

A partial patch for #3 is attached, but it is inadequate because it
only reads a vdev's status from the first device's (in BIOS order)
vdev_label, with the result that if the first device is marked
offline, gptzfsboot won't see this because only the other devices'
vdev_labels will indicate that the first device is offline. (Since
after a device is offlined no further writes will be made to the
device, its vdev_label is not updated to reflect that it's offline.)
To complete the patch it would be necessary to set each leaf vdev's
status from the newest vdev_label rather than from the first
vdev_label seen.

I think I've also hit a stack overflow a couple of times while debugging.

I don't know enough about the gptzfsboot/loader.zfs environment to
know whether the heap size could be easily enlarged, or whether there
is room for a real malloc() with free(). loader(8) seems to use the
malloc() in libstand. Can anyone shed some light on the memory
limitations and possible solutions?

I won't be able to spend much more time on this, but I wanted to pass
on what I've learned in case someone else has the time and boot fu to
take it the next step.

Matt
-------------- next part --------------
A non-text attachment was scrubbed...
Name: zfsboot-status.patch
Type: application/octet-stream
Size: 3687 bytes
Desc: not available
Url : http://lists.freebsd.org/pipermail/freebsd-fs/attachments/20091121/59e95c29/zfsboot-status.obj