FreeBSD 6.3 deadlock (vm_map?) with DDB output

Tue Jun 24 21:53:51 UTC 2008

On Monday 23 June 2008 03:16:40 pm James Gritton wrote:
> John Baldwin wrote:
> > On Thursday 19 June 2008 11:57:51 am James Gritton wrote:
> >   
> >> John Baldwin wrote:
> >>     
> >>> On Sunday 15 June 2008 07:23:19 am Stef Walter wrote:
> >>>   
> >>>       
> >>>> I've been trying to track down a deadlock on some newish production
> >>>> servers running FreeBSD 6.3-RELEASE-p2. The deadlock occurs on a
> >>>> specific (although mundane) hardware configuration, and each of several
> >>>> servers running this hardware deadlock about once per week.
> >>>>
> >>>> Although I suspect that this is not hardware related, from a (naive)
> >>>> perusal of the attached stack traces.
> >>>>
> >>>> Forgive me if my interpretation of this is all wrong, but I'm pretty
> >>>> desperate for help. So here's my basic understanding of the deadlock:
> >>>>
> >>>> These processes seem to be waiting on the page queue mutex:
> >>>>  sendmail (in vm_mmap > vm_map_find > vm_map_insert > 
vm_map_pmap_enter)
> >>>>  bsnmpd (in malloc, uma_large_malloc > page_alloc > kmem_malloc)
> >>>>  httpd (in trap > trap_pfault > vm_fault)
> >>>>  [g_up] (in g_vfs_done > bufdone)
> >>>>
> >>>> The page queue mutex is held by rsync process:
> >>>>  rsync (in trap > trap_pfault > vm_fault > pmap_enter)
> >>>>
> >>>> Rsync kernel process (in pmap_enter) was interrupted while holding the
> >>>> page queue lock?
> >>>>
> >>>>
> >>>> Giant is enabled in loader.conf due to the needs of the pf firewall 
when
> >>>> dealing with user credentials lookups. I do not believe that Giant 
plays
> >>>> into this deadlock. Kernel config attached.
> >>>>
> >>>> Any and all help or info is welcome. Thanks in advance.
> >>>>     
> >>>>         
> >>> Try this change:
> >>>
> >>> jhb         2007-10-27 22:07:40 UTC
> >>>
> >>>   FreeBSD src repository
> >>>
> >>>   Modified files:
> >>>     sys/kern             sched_4bsd.c
> >>>   Log:
> >>>   Change the roundrobin implementation in the 4BSD scheduler to trigger 
a
> >>>   userland preemption directly from hardclock() via sched_clock() when a
> >>>   thread uses up a full quantum instead of using a periodic timeout to 
> >>>       
> > cause
> >   
> >>>   a userland preemption every so often.  This fixes a potential deadlock
> >>>   when IPI_PREEMPTION isn't enabled where softclock blocks on a lock 
held
> >>>   by a thread pinned or bound to another CPU.  The current thread on 
that
> >>>   CPU will never be preempted while softclock is blocked.
> >>>
> >>>   Note that ULE already drives its round-robin userland preemption from
> >>>   sched_clock() as well and always enables IPI_PREEMPT.
> >>>
> >>>   MFC after:      1 week
> >>>
> >>>   Revision  Changes    Path
> >>>   1.108     +8 -29     src/sys/kern/sched_4bsd.c
> >>>
> >>> We use it at work on 6.x.  W/o this fix, round-robin stops working on 
4BSD 
> >>> when softclock() (swi4: clock) blocks on a lock like Giant.
> >>>   
> >>>       
> >> I've been seeing similar troubles on 6.2 and I'll have to give this a 
> >> try as we upgrade to 6.3.  I notice "MFC after: 1 week" in the log; it's 
> >> been a week - any chance of seeing this fix rolled into 6.x?
> >>     
> >
> > If people confirm it fixes issues I will MFC it.  There was some pushback 
when 
> > I first committed it so I waited on the MFC.
> 
> I can confirm that on 6.3 I can recreate the deadlock without the patch, 
> and can't recreate it with the patch.

Ok, I've merged it to RELENG_[67].

-- 
John Baldwin