Deadlock in state 'sysctl lock'

Thu Feb 22 22:21:21 UTC 2007

Rink Springer wrote:
> Hi people,
>
> At work, one of our SpamAssassin/ClamAV filtering machines just entered
> a deadlock state:
>
> FreeBSD/i386 (xxx.qsp.nl) (cuad0)
>
> login: root
> load: 0.00  cmd: login 683 [sysctl lock] 0.00u 0.00s 0% 148k
> load: 0.00  cmd: login 683 [sysctl lock] 0.00u 0.00s 0% 148k
> load: 0.00  cmd: login 683 [sysctl lock] 0.00u 0.00s 0% 148k
> load: 0.00  cmd: login 683 [sysctl lock] 0.00u 0.00s 0% 148k
> load: 0.00  cmd: login 683 [sysctl lock] 0.00u 0.00s 0% 148k
>
> After inspection, I believe the following code in
> kern/kern_sysctl.c:userland_sysctl() is the culprit:
>
>         SYSCTL_LOCK();
>
>         do {
>                 req.oldidx = 0;
>                 req.newidx = 0;
>                 error = sysctl_root(0, name, namelen, &req);
>         } while (error == EAGAIN);
>
>         if (req.lock == REQ_WIRED && req.validlen > 0)
>                 vsunlock(req.oldptr, req.validlen);
>
>         SYSCTL_UNLOCK();
>
> Clearly, should sysctl_root() always return EAGAIN, this will cause a
> serious deadlock condition. It appears this is possible.
>
> The only plausible reference to sysctl's returning EGAIN seems to be in 
> kern/kern_proc.c:sysctl_out_proc(). However, this code returns ESRCH
> if the process couldn't have been found in the fast place, and since the
> complete handler function will be called by sysctl_root() every
> iteration, and thus will do a pfind() and return ESRCH if it failed and
> not EAGAIN as it will later on in the code path.
>
> The machine is a 6.0-STABLE SMP machine of 30-Mar-2006. No debugging
> options are in the kernel as the machine has quite some load. The only
> console messages were a lot of 'calcru' messages.
>
> Any help is very much appreciated. For now, I'd like to propose a change
> to kern/kern_sysctl.c:userland_sysctl(), to ensure this will never keep
> looping on EAGAIN states (preferably, it should trigger a panic or at
> least a KASSERT should such a condition occour). I know this is a
> bandaid for a problem we don't really quite understand yet, but this may
> ease debugging later on (especially as it will help us understand where
> exactly it is going bad)
>
> Any comments? It looks to me this deadlock is quite rare (in fact, I've
> never seen it before), but I believe it is serious enough to be
> addressed, even with such a bandaid until the real solution is presented
> by someone who knows the sysctl internals better than I do.
>
>   
Interesting.  Twice I have had a 6.2 system stuck where sendmail was 
holding the sysctl lock while another process was holding the proctree 
and/or allproc lock, if I remember correctly.

Guy