ZFS Assertion Fault with FreeBSD 7.2

Thu Jun 25 18:40:03 UTC 2009

Good Evening,

This is a heads up really, I think I've got this sorted. I'm writing this
as my system backs up data to another array in case it all explodes.

This afternoon I was performing some MPEG4 encoding with ffmpeg source
file and destination file where both located on the same ZFS partition.

Part way through the ffmpeg encode the process went to the "zfs:lo" state
and hung, all processes that attempted to browse to the partition
"data/domains" hung immediately.

I attempted to reboot the machine in order to restore normality however
the system stuck half way through shutting down. In the end a hard power
off was issued to shut the machine down.

Upon reboot during the ZFS rc.d init script I saw the following:

panic: solaris assert: 0 == dmu_bonus_hold(zfsvfs->z_os, *oid, NULL,
&dbp), file:
/usr/src/sys/modules/zfs/../../cddl/contrib/opensolaris/uts/common/fs/zfs/zfs_znode.c,
line: 472

Apologies for single character errors that's typed from an image. Through
diagnosis I was able to determine the error was being caused by a mirror
zpool called "store".

I booted into single user mode, /etc/rc.d/hostname and /etc/rc.d/hostid.

Looking at the ZFS rc.d file I was able to "zfs volinit" with no issues,
the panic was reproducable on "zfs mount -a".

I then began to load each mount point one by one until I found the one
causing the issue. This is "store/sara/unix/Maildir", it is a compressed
volume, otherwise nothing custom.

Following my ancient ufs logic I attempted to mount it read only, this
worked and spat out the following kernel warning:

Solaris: WARNING: ZFS replay transaction error 30, dataset
store/sara/unix/Maildir, seq 0x77001, txtype 5

To aid diagnosis and because I'd damaged rc environment while debugging, I
rebooted, single user-ed, and mounted the whole of store as read only.
However this time the warning did not show. I am currently in the process
of copying the entirety of "store" to "data", I was planning to attempt
remounting the entire volumes mount points read/write once the backup is
done.

Is there anything else that I should be doing to a) attempt to ensure my
data structures are now okay and b) help find the problem. I understand a)
will probably prevent b), but the data is too important to risk, sorry. A
scrub of the volume came to mind as a double check.

Any thoughts are greatly appreciated, apologies if this email comes out
badly, my email is on this server, so I'm webmailing and scrounging
through mqueue's on the upstream.

Peter.
-- 
Peter Wood :: peter at alastria.net