bin/144214: zfsboot fails on gang block after upgrade to zfs v14

Thu May 27 15:03:18 UTC 2010

On 27 May 2010 09:35, Andriy Gapon <avg at freebsd.org> wrote:

>
>
> I think I nailed this problem now.
> What was additionally needed was the following change:
>        if (!vdev || !vdev->v_read)
>                return (EIO);
> -       if (vdev->v_read(vdev, bp, &zio_gb, offset, SPA_GANGBLOCKSIZE))
> +       if (vdev->v_read(vdev, NULL, &zio_gb, offset, SPA_GANGBLOCKSIZE))
>                return (EIO);
>
> Full patch is here:
> http://people.freebsd.org/~avg/boot-zfs-gang.diff
>
> Apparently I am not as smart as Roman :) because I couldn't find the bug by
> just
> starring at this rather small function (for couple of hours), so I had to
> reproduce the problem to catch it.  Hence I am copying hackers@ to share
> couple
> of tricks that were new to me.  Perhaps, they could help someone else some
> other
> day.
>
> First, after very helpful hints that I received in parallel from pjd and
> two
> Oracle/Sun developers it became very easy to reproduce a pool with files
> with
> gang blocks in them.
> One can set metaslab_gang_bang variable in metaslab.c to some value < 128K
> and
> then blocks with size greater than metaslab_gang_bang will be allocated as
> gang
> blocks with 25% chance.  I personally did something similar but slightly
> more
> deterministic:
> --- a/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/zio.c
> +++ b/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/zio.c
> @@ -1572,6 +1572,12 @@ zio_dva_allocate(zio_t *zio)
>        ASSERT3U(zio->io_prop.zp_ndvas, <=, spa_max_replication(spa));
>        ASSERT3U(zio->io_size, ==, BP_GET_PSIZE(bp));
>
> +       /*XXX XXX XXX XXX*/
> +       if (zio->io_size > 8 * 1024) {
> +               return (zio_write_gang_block(zio));
> +       }
> +       /*XXX XXX XXX XXX*/
> +
>        error = metaslab_alloc(spa, mc, zio->io_size, bp,
>            zio->io_prop.zp_ndvas, zio->io_txg, NULL, 0);
>
> This ensured that any block > 8K would be a gang block.
> Then I compiled zfs.ko with this change and put it into a virtual machine
> where
> I created a pool and populated its root/boot filesystem with /boot
> directory.
> Booted in virtual machine from the new virtual disk and immediately hit the
> problem.
>
> So far, so good, but still no clue why zfsboot crashes upon encountering a
> gang
> block.
>
> So I decided to debug the crash with gdb.
> Standard steps:
> $ qemu ... -S -s
> $ gdb
> ...
> (gdb) target remote localhost:1234
>
> Now I didn't want to single-step through the whole boot process, so I
> decided to
> get some help from gdb. Here's a trick:
> (gdb) add-symbol-file
> /usr/obj/usr/src/sys/boot/i386/gptzfsboot/gptzfsboot.out
> 0xa000
>
> gptzfsboot.out is an ELF image produced by GCC, which then gets transformed
> into
> a raw binary and then into final BTX binary (gptzfsboot).
> gptzfsboot.out is built without much debugging data but at least it
> contains
> information about function names.  Perhaps it's even possible to compile
> gptzfsboot.out with higher debug level, then debugging would be much more
> pleasant.
>
> 0xA000 is where _code_ from gptzfsboot.out ends up being loaded in memory.
> BTW, having only shallow knowledge about boot chain and BTX I didn't know
> this
> address. Another GDB trick helped me:
> (gdb) append memory boot.memdump  0x0 0x10000
>
> This command dumps memory content in range 0x0-0x10000 to a file named
> boot.memdump.  Then I produced a hex dump and searched for byte sequence
> with
> which gptzfsboot.bin starts (raw binary produced produced from
> gptzfsboot.out).
>
> Of course, memory dump should be taken after gptzfsboot is loaded into
> memory :)
> Catching the right moment requires a little bit of boot process knowledge.
> I caught it with:
> (gdb) b *0xC000
>
> That is, memory dump was taken after gdb stopped at the above break point.
>
> After that it was a piece of cake.  I set break point on zio_read_gang
> function
> (after add-symbol-file command) and the stepi-ed through the code (that is,
> instruction by instruction).  The following command made it easier to see
> what's
> getting executed:
> (gdb) display/i 0xA000 + $eip
>
> I quickly stepped though the code and saw that a large value was passed to
> vdev_read as 'bytes' parameter.  But this should have been 512.  The
> oversized
> read into a buffer allocated on stack smashed the stack and that was the
> end.
>
> Backtracking the call chain in source code I immediately noticed the bp
> condition in vdev_read_phys and realized what the problem was.
>
> Hope this would be a useful reading.
>

Excellent work - thanks for looking into this. I still think its easier to
debug this code in userland using a shim that redirects the zfsboot i/o
calls to simple read system calls...