zfs process hang on pool access

Tue Aug 2 18:01:27 UTC 2011

On Aug 1, 2011, at 5:55 AM, Martin Matuska wrote:

>>> ===================================================================
>>> --- sys/cddl/contrib/opensolaris/uts/common/fs/zfs/txg.c (revision
>>> 224527)
>>> +++ sys/cddl/contrib/opensolaris/uts/common/fs/zfs/txg.c (working copy)
>>> @@ -488,7 +488,7 @@
>>> txg_delay(dsl_pool_t *dp, uint64_t txg, int ticks)
>>> {
>>> tx_state_t *tx = &dp->dp_tx;
>>> - int timeout = ddi_get_lbolt() + ticks;
>>> + clock_t timeout = ddi_get_lbolt() + ticks;
>> 
>> So you recon that one line will fix the 100+ days overflow David's
>> talking about?
> No, this bug causes that after 28,5 days of uptime this function won't
> delay txg sync threads anymore (ddi_get_lbolt() will be always larger
> than timeout).This may lead to a slowdown of writes.

Thanks Martin.  I see you committed to -head as well.

However, a bit confused with my issue still.  I've been able to recreating it and I can't explain why LBOLT is still coming back negative.   I'm in 8.1-RELEASE in this case, with the patches that I posted before for LBOLT and typedef of clock_t. 

I added some debugging and am seeing LBOLT still return a negative number, which then (timeout-LBOLT) because a large number of ticks to wait.

I put some debugging in txg_delay, without fixing the timeout's type:

	Index: sys/cddl/contrib/opensolaris/uts/common/fs/zfs/txg.c
	===================================================================
	--- sys/cddl/contrib/opensolaris/uts/common/fs/zfs/txg.c        (revision 3343)
	+++ sys/cddl/contrib/opensolaris/uts/common/fs/zfs/txg.c        (working copy)
	@@ -427,6 +427,7 @@
	 {
			tx_state_t *tx = &dp->dp_tx;
			int timeout = LBOLT + ticks;
	+       /* if ( timeout < 0 ) { printf ("txg_delay: timeout negative timeout=%d, ticks=%d\n", timeout, ticks); } */

			/* don't delay if this txg could transition to quiesing immediately */
			if (tx->tx_open_txg > txg ||
	@@ -439,10 +440,10 @@
					return;
			}

	-       while (LBOLT < timeout &&
	-           tx->tx_syncing_txg < txg-1 && !txg_stalled(dp))
	-               (void) cv_timedwait(&tx->tx_quiesce_more_cv, &tx->tx_sync_lock,
	-                   timeout - LBOLT);
	+       while (LBOLT < timeout && tx->tx_syncing_txg < txg-1 && !txg_stalled(dp)) {
	+                       printf ("txg_delay::cv_timeoutwait: timeout negative timeout=%d, ticks=%d %lld \n", timeout, ticks, LBOLT );
	+                       (void) cv_timedwait(&tx->tx_quiesce_more_cv, &tx->tx_sync_lock, timeout - LBOLT);
	+               }

			mutex_exit(&tx->tx_sync_lock);
	 }

After two hours of doing 9 parallel file creates ... I got it deadlocked ... well, probably just a long wait ...

	txg_delay::cv_timeoutwait: timeout negative timeout=-518109823, ticks=1 -9182683359751551616
	txg_delay::cv_timeoutwait: timeout negative timeout=-355729919, ticks=1 -9182562881461551616
	txg_delay::cv_timeoutwait: timeout negative timeout=1272384705, ticks=1 -9182430354322551616
	txg_delay::cv_timeoutwait: timeout negative timeout=1272384705, ticks=1 -9182308872293551616
	txg_delay::cv_timeoutwait: timeout negative timeout=1272384705, ticks=1 -9182180363183551616
	txg_delay::cv_timeoutwait: timeout negative timeout=1272384705, ticks=1 -9182051853654551616

Grabbed a core, and did some more poking :

	(kgdb) up
	#3  0xffffffff804d222b in _cv_timedwait (cvp=0xffffff0029524260, 
		lock=0xffffff00295241d0, timo=1914894593)
		at /usr/src/sys/kern/kern_condvar.c:323
	323             rval = sleepq_timedwait(cvp, 0);
	(kgdb) up
	#4  0xffffffff8106ef7e in txg_delay (dp=0xffffff0029524000, txg=9264, ticks=1)
		at /usr/src/sys/modules/zfs/../../cddl/contrib/opensolaris/uts/common/fs/zfs/txg.c:445
	445                             (void) cv_timedwait(&tx->tx_quiesce_more_cv, &tx->tx_sync_lock, timeout - LBOLT);
	(kgdb) info locals
	timeout = Variable "timeout" is not available.
	(kgdb) 

As you can see, LBOLT is apparently negative - if I'm getting my printf formatting correct - but regardless the timo value to cv_timedwait() is timo=1914894593 ... which is trying to wait for :

	1914894593 / 1000hz / 60s / 60m /24h  = 22.16 hours

Regardless of setting timeout's type correctly ... why is LBOLT still negative ?  It's not clear to me that fixing timeout's type will fix this issue.  However, I will do so, and re-run this same test. 

My patches earlier show my changes to  sys/cddl/compat/opensolaris/sys/time.h  but in summary :

 37 #define NANOSEC     1000000000
 39 typedef longlong_t  hrtime_t;
 41 #define LBOLT   (gethrtime() * (NANOSEC/hz))

And gerhrtime() still remains unchanged from 8.1:

 51 #ifdef _KERNEL
 52 #define lbolt64 (int64_t)(LBOLT)
 53 
 54 static __inline hrtime_t
 55 gethrtime(void) {
 56 
 57     struct timespec ts;
 58     hrtime_t nsec;
 59 
 60 #if 1
 61     getnanouptime(&ts);
 62 #else
 63     nanouptime(&ts);
 64 #endif
 65     nsec = (hrtime_t)ts.tv_sec * NANOSEC + ts.tv_nsec;
 66     return (nsec);
 67 }
 68 
 69 #define gethrestime_sec()   (time_second)
 70 #define gethrestime(ts)     getnanotime(ts)

---
David P. Discher
dpd at bitgravity.com * AIM: bgDavidDPD
BITGRAVITY * http://www.bitgravity.com