zfs process hang on pool access

Sun Jul 31 20:29:14 UTC 2011

I've actually found a second issue that my working theory is related to the *fix* of LBOLT, in zio_wait()/txg_delay() when calling _cv_wait()/_cv_timedwait().  This maybe aggravated by setting vfs.zfs.txg.timeout=1.  And in fact these functions are using using LBOLT with signed 32bit ints. 

I got some cores, and ideas, and will dig into the debugging this week.  And of course will post my findings (and pleads for help) here on freebsd-fs at .

Rolling back the two patches I posted early for the 26+ day and 106+ days bugs, seemed to avoid the new issue.

---
David P. Discher
dpd at bitgravity.com * AIM: bgDavidDPD
BITGRAVITY * http://www.bitgravity.com

On Jul 31, 2011, at 12:50 PM, Steven Hartland wrote:

> Is there a PR related to this so we can track progress. Having to reboot machines
> every 100+ days to ensure they don't break is a bit of a PITA when you've got hundreds
> of machines :(
> 
> ----- Original Message ----- From: "David P Discher" <dpd at bitgravity.com>
> To: "Steven Hartland" <killing at multiplay.co.uk>
> Cc: <freebsd-fs at FreeBSD.org>; "Andriy Gapon" <avg at freebsd.org>
> Sent: Wednesday, July 27, 2011 9:41 PM
> Subject: Re: zfs process hang on pool access
> 
> 
> The way I found this was breaking into the debugger, do some back traces, continue, break in again, do some more back traces on the hung processes ... see what is going on, then walk through the code.
> 
> Then what I had specific loops and code locations, asking the higher powers of the freebsd kernel world.
>