Agree that i wasn't disable journaling completely before doing a clean full FSCK.

Taking actions requested, i wasn't able to recover this race condition with SUJ.
but snapshot still OK with only SU :

So here are some few investigations i have taken: (Sorry being too long) 

This test system was freshly installed by ISO 9.0 RC1 (18 OCTOBER  / after the fix) and is csuped on 9_RELENG 
(40G avail) very basic setup, just dovecot running, on GENERIC.

Operations and results:

Since it's the rootfs:
Clean shutdown, boot single, disable SUJ, mount in RW and remove the .sujournal and the bad snapshots file,
clean halt.

I reboot in single again, then fsck_ufs -y /dev/ufs/ROOTFS
I got some very minor fixup with freeblock count wrong and summary information bad and BLK missing in bitmaps.

After a normal reboot, issue a successful snapshot without softupdate journaling just su.

I reboot in single again, and reactivate SUJ then reboot in normal mode.

Issue snapshot: and again mksnap_ffs eating all cpu, not suspendable, not killable.

So i try to figure out what's going on: with systat -v / gstat / top -SCHP
and  strace / truss / ktrace on ramfs and nfs for tracking mksanp_ffs:

Here some results:

gstat : 26 seconds intense io activity: like normal snapshot.
Bad spare snapshot file created ( UFS label (ROOTFS) not present and some garbage on the beginning.
real and sparse size of file 'very' near to a normal snapshot file.

Truss begin showing info then hang before being usefull.
mksnap_ffs is in running / runnable mode eating 100% cpu in kernel mode, 0% in user mode.
systat : hang
top still running correctly : 15 to 25 % CPU in interrupt SWI4 : CLOCK  ( CPU 2 cores )

strace : only for i386 :-(
ktrace: block before showing valuable info, even on remote nfs.
regular process hanging on suspfs.

hard power cycle:

After normal reboot , after regular SUJ FIXUP:
Got Panic  at the login prompt: ( bg_fsck not started )

panic: ffs_sync: rofs mod ( it's physical machine , no screen shots, )

backtrace show ffs_write_suspend+0x...before the ffs_sync

So i retry to reboot with the 9 RC1 CD in live mode, disable suj, disable su, fsck, renable su, suj, 
mount the fs,without doing something on it, issue a snapshot ( still in live mode) , 
and this time, the snapshot was OK even with SUJ.

So i wrongly figure out that touching the root fs in single user is not as best as touching it with a live CD.

But after returning in normal operation, this race is still there.

After various tracking tests, and rebooting in normal mode after the SUJ standard recovery:

I sometime got a double panic after the login prompt

and just after the backtrace softdep_process_worklist ...
-> panic: bufwrite: bufwrite is not busy.

I also saw, when there is more io activity while taking snapshot, a kernel panic saying:

panic: softdep_deallocate_dependencies: dangling deps

Sure something wrong in this setup, because SUJ snapshot work well on other systems and on 9.0-RELEASE so
 i am lost in cyberspace :-)
If i don't issue snapshot , the system is very stable, even with heavy activity.

(Smartd has never showing bad things.)

Since it's not a production system, i can fresh reinstall with 9.0-RELEASE but since some other people 
have troubles we prefer investigate.


It Look like journal get out of sync after these race situation ?
Idea: Does it can make sense to reinit the log file at shutdown time ?

Is it possible that some bad drive write caching (or too aggressive caching in vm) with bad ordering, can 
trigger this kind of issues with the journal when snapshot is quiescing the fs ?

(ada0: <ST9320423AS 0002SDM1> ATA-8 SATA 2.x device)

If required, i can do some more tests, with KDB compiled in, or whatever.

Thanks again and again for your wonderful work.

Very best regards.


FreeBSD: The way to go :-)

