Current gptzfsboot limitations
John Baldwin
jhb at freebsd.org
Mon Nov 23 15:40:56 UTC 2009
On Friday 20 November 2009 7:46:54 pm Matt Reimer wrote:
> I've been analyzing gptzfsboot to see what its limitations are. I
> think it should now work fine for a healthy pool with any number of
> disks, with any type of vdev, whether single disk, stripe, mirror,
> raidz or raidz2.
>
> But there are currently several limitations (likely in loader.zfs
> too), mostly due to the limited amount of memory available (< 640KB)
> and the simple memory allocators used (a simple malloc() and
> zfs_alloc_temp()).
>
> 1. gptzfsboot might fail to read compressed files on raidz/raidz2
> pools. The reason is that the temporary buffer used for I/O
> (zfs_temp_buf in zfsimpl.c) is 128KB by default, but a 128KB
> compressed block will require a 128KB buffer to be allocated before
> the I/O is done, leaving nothing for the raidz code further on. The
> fix would be to make more the temporary buffer larger, but for some
> reason it's not as simple as just changing the TEMP_SIZE define
> (possibly a stack overflow results; more debugging needed).
> Workaround: don't enable compression on your root filesystem (aka
> bootfs).
>
> 2. gptzfsboot might fail to reconstruct a file that is read from a
> degraded raidz/raidz2 pool, or if the file is corrupt somehow (i.e.
> the pool is healthy but the checksums don't match). The reason again
> is that the temporary buffer gets exhausted. I think this will only
> happen in the case where more than one physical block is corrupt or
> unreadable. The fix has several aspects: 1) make the temporary buffer
> much larger, perhaps larger than 640KB; 2) change
> zfssubr.c:vdev_raidz_read() to reuse the temp buffers it allocates
> when possible; and 3) either restructure
> zfssubr.c:vdev_raidz_reconstruct_pq() to only allocate its temporary
> buffers once per I/O, or use a malloc that has free() implemented.
> Workaround: repair your pool somehow (e.g. pxeboot) so one or no disks
> are bad.
>
> 3. gptzfsboot might fail to boot from a degraded pool that has one or
> more drives marked offline, removed, or faulted. The reason is that
> vdev_probe() assumes that all vdevs are healthy, regardless of their
> true state. gptzfsboot then will read from an offline/removed/faulted
> vdev as if it were healthy, likely resulting in failed checksums,
> resulting in the recovery code path being run in vdev_raidz_read(),
> possibly leading to zfs_temp_buf exhaustion as in #2 above.
>
> A partial patch for #3 is attached, but it is inadequate because it
> only reads a vdev's status from the first device's (in BIOS order)
> vdev_label, with the result that if the first device is marked
> offline, gptzfsboot won't see this because only the other devices'
> vdev_labels will indicate that the first device is offline. (Since
> after a device is offlined no further writes will be made to the
> device, its vdev_label is not updated to reflect that it's offline.)
> To complete the patch it would be necessary to set each leaf vdev's
> status from the newest vdev_label rather than from the first
> vdev_label seen.
>
> I think I've also hit a stack overflow a couple of times while debugging.
>
> I don't know enough about the gptzfsboot/loader.zfs environment to
> know whether the heap size could be easily enlarged, or whether there
> is room for a real malloc() with free(). loader(8) seems to use the
> malloc() in libstand. Can anyone shed some light on the memory
> limitations and possible solutions?
>
> I won't be able to spend much more time on this, but I wanted to pass
> on what I've learned in case someone else has the time and boot fu to
> take it the next step.
One issue is that disk transfers need to happen in the lower 1MB due to BIOS
limitations. The loader uses a bounce buffer (in biosdisk.c in libi386) to
make this work ok. The loader uses memory > 1MB for malloc(). You could
probably change zfsboot to do that as well if not already. Just note that
drvread() has to bounce buffer requests in that case. The text + data + bss
+ stack is all in the lower 640k and there's not much you can do about that.
The stack grows down from 640k, and the boot program text + data starts at
64k with the bss following. Hmm, drvread() might already be bounce buffering
since boot2 has to do so since it copies the loader up to memory > 1MB as
well. You might need to use memory > 2MB for zfsboot's malloc() so that the
loader can be copied up to 1MB. It looks like you could patch malloc() in
zfsboot.c to use 4*1024*1024 as heap_next and maybe 64*1024*1024 as heap_end
(this assumes all machines that boot ZFS have at least 64MB of RAM, which is
probably safe).
--
John Baldwin
More information about the freebsd-fs
mailing list