LOR on sleepqueue chain locks, Was: LOR sleepq/scrlock

John Baldwin jhb at freebsd.org
Tue Apr 22 17:58:14 UTC 2008


On Saturday 19 April 2008 07:38:27 am Aristedes Maniatis wrote:
> 
> On 19/04/2008, at 3:14 AM, John Baldwin wrote:
> > On Thursday 10 April 2008 06:33:40 pm Aristedes Maniatis wrote:
> >>
> >>>> http://www.ish.com.au/s/LOR/1.jpg
> >>>> http://www.ish.com.au/s/LOR/2.jpg
> >>>> http://www.ish.com.au/s/LOR/3.jpg (this overlaps with [2])
> >>>
> >>> These are all garbage in kuickshow. :(
> >>
> >> They work fine for me in Firefox. But don't know what sort of jpegs
> >> the Sony camera saves. Anyhow I've also now resaved them as png  
> >> (about
> >> twice the size). Please let me know if that worked.
> >>
> >> http://www.ish.com.au/s/LOR/1.png , etc
> >
> > kuickshow had issues still, but FF worked ok.  The specific LOR at  
> > the end is
> > real, but a minor one.  Basically, the console driver locks
> > (e.g. "sio", "scrlock") are higher in the order than the various  
> > thread
> > locks, so any printf while holding a thread lock will trigger a  
> > LOR.  The
> > real problem at the bottom of the screen though is a real issue.   
> > It's a LOR
> > of two different sleepqueue chain locks.  The problem is that when
> > setrunnable() encounters a swapped out thread it tries to wakeup  
> > proc0, but
> > if proc0 is asleep (which is typical) then its thread lock is a  
> > sleep queue
> > chain lock, so waking up a swapped out thread from wakeup() will  
> > usually
> > trigger this LOR.
> >
> > I think the best fix is to not have setrunnable() kick proc0 directly.
> > Perhaps setrunnable() should return an int and return true if proc0  
> > needs to
> > be awakened and false otherwise.  Then the the sleepq code (b/c only  
> > sleeping
> > threads can be swapped out anyway) can return that value from
> > sleepq_resume_thread() and can call kick_proc0() directly once it  
> > has dropped
> > all of its own locks.
> >
> > -- 
> > John Baldwin
> 
> The way you describe it, it almost sounds like this LOR should be  
> happening for everyone, all the time. To try and eliminate the factors  
> which trigger it for us, we tried the following: removed PAE from  
> kernel, disabled PF. Neither of these things made any difference and  
> the error is fairly quickly reproducible (within a couple of hours  
> running various things to load the machine). The one thing we did not  
> test yet is removing ZFS from the picture. Note also that this box ran  
> for years and years on FreeBSD 4.x without a hiccup (non PAE, ipfw  
> instead of pf and no ZFS of course).

There are two things.  1) Most people who run witness (that I know of) don't 
run it on spinlocks because of the overhead, so LORs of spin locks are less 
well-reported than LORs of other locks (mutexes, rwlocks, etc.).  2) You have 
to have enough load on the box to swap out active processes to get into this 
situation.  Between those I think that is why this is not more widely 
reported.

> Since I've ordered a replacement machine to go into production now, I  
> am happy to make this one available for whatever testing would benefit  
> the FreeBSD community to track down the problem.
> 
> If useful, we could upgrade this machine to 7 STABLE branch and use  
> the new tools Robert Watson recently wrote to dump better crash logs.  
> Let me know, but I don't know a lot about them yet apart from what I  
> read on this list.
> 
> Regards
> Ari Maniatis
> 
> 
> 
> -------------------------->
> ish
> http://www.ish.com.au
> Level 1, 30 Wilson Street Newtown 2042 Australia
> phone +61 2 9550 5001   fax +61 2 9550 4001
> GPG fingerprint CBFB 84B4 738D 4E87 5E5C  5EFA EF6A 7D2E 3E49 102A
> 
> 
> 



-- 
John Baldwin


More information about the freebsd-stable mailing list