bin/144214: zfsboot fails on gang block after upgrade to zfs v14

Thu May 27 08:35:39 UTC 2010

I think I nailed this problem now.
What was additionally needed was the following change:
 	if (!vdev || !vdev->v_read)
 		return (EIO);
-	if (vdev->v_read(vdev, bp, &zio_gb, offset, SPA_GANGBLOCKSIZE))
+	if (vdev->v_read(vdev, NULL, &zio_gb, offset, SPA_GANGBLOCKSIZE))
 		return (EIO);

Full patch is here:
http://people.freebsd.org/~avg/boot-zfs-gang.diff

Apparently I am not as smart as Roman :) because I couldn't find the bug by just
starring at this rather small function (for couple of hours), so I had to
reproduce the problem to catch it.  Hence I am copying hackers@ to share couple
of tricks that were new to me.  Perhaps, they could help someone else some other
day.

First, after very helpful hints that I received in parallel from pjd and two
Oracle/Sun developers it became very easy to reproduce a pool with files with
gang blocks in them.
One can set metaslab_gang_bang variable in metaslab.c to some value < 128K and
then blocks with size greater than metaslab_gang_bang will be allocated as gang
blocks with 25% chance.  I personally did something similar but slightly more
deterministic:

--- a/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/zio.c
+++ b/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/zio.c
@@ -1572,6 +1572,12 @@ zio_dva_allocate(zio_t *zio)
 	ASSERT3U(zio->io_prop.zp_ndvas, <=, spa_max_replication(spa));
 	ASSERT3U(zio->io_size, ==, BP_GET_PSIZE(bp));

+	/*XXX XXX XXX XXX*/
+	if (zio->io_size > 8 * 1024) {
+		return (zio_write_gang_block(zio));
+	}
+	/*XXX XXX XXX XXX*/
+
 	error = metaslab_alloc(spa, mc, zio->io_size, bp,
 	    zio->io_prop.zp_ndvas, zio->io_txg, NULL, 0);

This ensured that any block > 8K would be a gang block.
Then I compiled zfs.ko with this change and put it into a virtual machine where
I created a pool and populated its root/boot filesystem with /boot directory.
Booted in virtual machine from the new virtual disk and immediately hit the problem.

So far, so good, but still no clue why zfsboot crashes upon encountering a gang
block.

So I decided to debug the crash with gdb.
Standard steps:
$ qemu ... -S -s
$ gdb
...
(gdb) target remote localhost:1234

Now I didn't want to single-step through the whole boot process, so I decided to
get some help from gdb. Here's a trick:
(gdb) add-symbol-file /usr/obj/usr/src/sys/boot/i386/gptzfsboot/gptzfsboot.out
0xa000

gptzfsboot.out is an ELF image produced by GCC, which then gets transformed into
a raw binary and then into final BTX binary (gptzfsboot).
gptzfsboot.out is built without much debugging data but at least it contains
information about function names.  Perhaps it's even possible to compile
gptzfsboot.out with higher debug level, then debugging would be much more pleasant.

0xA000 is where _code_ from gptzfsboot.out ends up being loaded in memory.
BTW, having only shallow knowledge about boot chain and BTX I didn't know this
address. Another GDB trick helped me:
(gdb) append memory boot.memdump  0x0 0x10000

This command dumps memory content in range 0x0-0x10000 to a file named
boot.memdump.  Then I produced a hex dump and searched for byte sequence with
which gptzfsboot.bin starts (raw binary produced produced from gptzfsboot.out).

Of course, memory dump should be taken after gptzfsboot is loaded into memory :)
Catching the right moment requires a little bit of boot process knowledge.
I caught it with:
(gdb) b *0xC000

That is, memory dump was taken after gdb stopped at the above break point.

After that it was a piece of cake.  I set break point on zio_read_gang function
(after add-symbol-file command) and the stepi-ed through the code (that is,
instruction by instruction).  The following command made it easier to see what's
getting executed:
(gdb) display/i 0xA000 + $eip

I quickly stepped though the code and saw that a large value was passed to
vdev_read as 'bytes' parameter.  But this should have been 512.  The oversized
read into a buffer allocated on stack smashed the stack and that was the end.

Backtracking the call chain in source code I immediately noticed the bp
condition in vdev_read_phys and realized what the problem was.

Hope this would be a useful reading.
-- 
Andriy Gapon