Current gptzfsboot limitations

Mon Nov 23 15:40:56 UTC 2009

On Friday 20 November 2009 7:46:54 pm Matt Reimer wrote:
> I've been analyzing gptzfsboot to see what its limitations are. I
> think it should now work fine for a healthy pool with any number of
> disks, with any type of vdev, whether single disk, stripe, mirror,
> raidz or raidz2.
> 
> But there are currently several limitations (likely in loader.zfs
> too), mostly due to the limited amount of memory available (< 640KB)
> and the simple memory allocators used (a simple malloc() and
> zfs_alloc_temp()).
> 
> 1. gptzfsboot might fail to read compressed files on raidz/raidz2
> pools. The reason is that the temporary buffer used for I/O
> (zfs_temp_buf in zfsimpl.c) is 128KB by default, but a 128KB
> compressed block will require a 128KB buffer to be allocated before
> the I/O is done, leaving nothing for the raidz code further on. The
> fix would be to make more the temporary buffer larger, but for some
> reason it's not as simple as just changing the TEMP_SIZE define
> (possibly a stack overflow results; more debugging needed).
> Workaround: don't enable compression on your root filesystem (aka
> bootfs).
> 
> 2. gptzfsboot might fail to reconstruct a file that is read from a
> degraded raidz/raidz2 pool, or if the file is corrupt somehow (i.e.
> the pool is healthy but the checksums don't match). The reason again
> is that the temporary buffer gets exhausted. I think this will only
> happen in the case where more than one physical block is corrupt or
> unreadable. The fix has several aspects: 1) make the temporary buffer
> much larger, perhaps larger than 640KB; 2) change
> zfssubr.c:vdev_raidz_read() to reuse the temp buffers it allocates
> when possible; and 3) either restructure
> zfssubr.c:vdev_raidz_reconstruct_pq() to only allocate its temporary
> buffers once per I/O, or use a malloc that has free() implemented.
> Workaround: repair your pool somehow (e.g. pxeboot) so one or no disks
> are bad.
> 
> 3. gptzfsboot might fail to boot from a degraded pool that has one or
> more drives marked offline, removed, or faulted. The reason is that
> vdev_probe() assumes that all vdevs are healthy, regardless of their
> true state. gptzfsboot then will read from an offline/removed/faulted
> vdev as if it were healthy, likely resulting in failed checksums,
> resulting in the recovery code path being run in vdev_raidz_read(),
> possibly leading to zfs_temp_buf exhaustion as in #2 above.
> 
> A partial patch for #3 is attached, but it is inadequate because it
> only reads a vdev's status from the first device's (in BIOS order)
> vdev_label, with the result that if the first device is marked
> offline, gptzfsboot won't see this because only the other devices'
> vdev_labels will indicate that the first device is offline. (Since
> after a device is offlined no further writes will be made to the
> device, its vdev_label is not updated to reflect that it's offline.)
> To complete the patch it would be necessary to set each leaf vdev's
> status from the newest vdev_label rather than from the first
> vdev_label seen.
> 
> I think I've also hit a stack overflow a couple of times while debugging.
> 
> I don't know enough about the gptzfsboot/loader.zfs environment to
> know whether the heap size could be easily enlarged, or whether there
> is room for a real malloc() with free(). loader(8) seems to use the
> malloc() in libstand. Can anyone shed some light on the memory
> limitations and possible solutions?
> 
> I won't be able to spend much more time on this, but I wanted to pass
> on what I've learned in case someone else has the time and boot fu to
> take it the next step.

One issue is that disk transfers need to happen in the lower 1MB due to BIOS 
limitations.  The loader uses a bounce buffer (in biosdisk.c in libi386) to 
make this work ok.  The loader uses memory > 1MB for malloc().  You could 
probably change zfsboot to do that as well if not already.  Just note that 
drvread() has to bounce buffer requests in that case.  The text + data + bss 
+ stack is all in the lower 640k and there's not much you can do about that.  
The stack grows down from 640k, and the boot program text + data starts at 
64k with the bss following.  Hmm, drvread() might already be bounce buffering 
since boot2 has to do so since it copies the loader up to memory > 1MB as 
well.  You might need to use memory > 2MB for zfsboot's malloc() so that the 
loader can be copied up to 1MB.  It looks like you could patch malloc() in 
zfsboot.c to use 4*1024*1024 as heap_next and maybe 64*1024*1024 as heap_end 
(this assumes all machines that boot ZFS have at least 64MB of RAM, which is 
probably safe).

-- 
John Baldwin