bin/144214: zfsboot fails on gang block after upgrade to zfs v14

Thu May 27 14:35:19 UTC 2010

Andriy Gapon wrote:
> 
> I think I nailed this problem now.
> What was additionally needed was the following change:
>  	if (!vdev || !vdev->v_read)
>  		return (EIO);
> -	if (vdev->v_read(vdev, bp, &zio_gb, offset, SPA_GANGBLOCKSIZE))
> +	if (vdev->v_read(vdev, NULL, &zio_gb, offset, SPA_GANGBLOCKSIZE))
>  		return (EIO);
> 
> Full patch is here:
> http://people.freebsd.org/~avg/boot-zfs-gang.diff
> 
> Apparently I am not as smart as Roman :) because I couldn't find the bug by just
> starring at this rather small function (for couple of hours), so I had to
> reproduce the problem to catch it.  Hence I am copying hackers@ to share couple
> of tricks that were new to me.  Perhaps, they could help someone else some other
> day.

Excellent, I'm glad that this is finally tested and seems to be working. 
  When I initially added the code, I wasn't able to test it and it 
turned out the the issue that I was trying to resolve wasn't actually 
gang block related anyway.

robert.

> First, after very helpful hints that I received in parallel from pjd and two
> Oracle/Sun developers it became very easy to reproduce a pool with files with
> gang blocks in them.
> One can set metaslab_gang_bang variable in metaslab.c to some value < 128K and
> then blocks with size greater than metaslab_gang_bang will be allocated as gang
> blocks with 25% chance.  I personally did something similar but slightly more
> deterministic:
> --- a/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/zio.c
> +++ b/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/zio.c
> @@ -1572,6 +1572,12 @@ zio_dva_allocate(zio_t *zio)
>  	ASSERT3U(zio->io_prop.zp_ndvas, <=, spa_max_replication(spa));
>  	ASSERT3U(zio->io_size, ==, BP_GET_PSIZE(bp));
> 
> +	/*XXX XXX XXX XXX*/
> +	if (zio->io_size > 8 * 1024) {
> +		return (zio_write_gang_block(zio));
> +	}
> +	/*XXX XXX XXX XXX*/
> +
>  	error = metaslab_alloc(spa, mc, zio->io_size, bp,
>  	    zio->io_prop.zp_ndvas, zio->io_txg, NULL, 0);
> 
> This ensured that any block > 8K would be a gang block.
> Then I compiled zfs.ko with this change and put it into a virtual machine where
> I created a pool and populated its root/boot filesystem with /boot directory.
> Booted in virtual machine from the new virtual disk and immediately hit the problem.
> 
> So far, so good, but still no clue why zfsboot crashes upon encountering a gang
> block.
> 
> So I decided to debug the crash with gdb.
> Standard steps:
> $ qemu ... -S -s
> $ gdb
> ...
> (gdb) target remote localhost:1234
> 
> Now I didn't want to single-step through the whole boot process, so I decided to
> get some help from gdb. Here's a trick:
> (gdb) add-symbol-file /usr/obj/usr/src/sys/boot/i386/gptzfsboot/gptzfsboot.out
> 0xa000
> 
> gptzfsboot.out is an ELF image produced by GCC, which then gets transformed into
> a raw binary and then into final BTX binary (gptzfsboot).
> gptzfsboot.out is built without much debugging data but at least it contains
> information about function names.  Perhaps it's even possible to compile
> gptzfsboot.out with higher debug level, then debugging would be much more pleasant.
> 
> 0xA000 is where _code_ from gptzfsboot.out ends up being loaded in memory.
> BTW, having only shallow knowledge about boot chain and BTX I didn't know this
> address. Another GDB trick helped me:
> (gdb) append memory boot.memdump  0x0 0x10000
> 
> This command dumps memory content in range 0x0-0x10000 to a file named
> boot.memdump.  Then I produced a hex dump and searched for byte sequence with
> which gptzfsboot.bin starts (raw binary produced produced from gptzfsboot.out).
> 
> Of course, memory dump should be taken after gptzfsboot is loaded into memory :)
> Catching the right moment requires a little bit of boot process knowledge.
> I caught it with:
> (gdb) b *0xC000
> 
> That is, memory dump was taken after gdb stopped at the above break point.
> 
> After that it was a piece of cake.  I set break point on zio_read_gang function
> (after add-symbol-file command) and the stepi-ed through the code (that is,
> instruction by instruction).  The following command made it easier to see what's
> getting executed:
> (gdb) display/i 0xA000 + $eip
> 
> I quickly stepped though the code and saw that a large value was passed to
> vdev_read as 'bytes' parameter.  But this should have been 512.  The oversized
> read into a buffer allocated on stack smashed the stack and that was the end.
> 
> Backtracking the call chain in source code I immediately noticed the bp
> condition in vdev_read_phys and realized what the problem was.
> 
> Hope this would be a useful reading.