vmstat 'b' (disk busy?) field keeps climbing ...

Sat Jun 24 19:56:29 UTC 2006

On Sat, Jun 24, 2006 at 04:45:49PM -0300, Marc G. Fournier wrote:
> On Sat, 24 Jun 2006, Kostik Belousov wrote:
> 
> >On Sat, Jun 24, 2006 at 09:52:03PM +0300, Kostik Belousov wrote:
> >>On Sat, Jun 24, 2006 at 02:57:27PM -0300, Marc G. Fournier wrote:
> >>>On Sat, 24 Jun 2006, Kostik Belousov wrote:
> >>>
> >>>>On Sat, Jun 24, 2006 at 11:55:26AM +0400, Dmitry Morozovsky wrote:
> >>>>>On Sat, 24 Jun 2006, Marc G. Fournier wrote:
> >>>>>
> >>>>>MGF> > 'b' stands for "blocked", not "busy".  Judging by your page 
> >>>>>fault
> >>>>>rate
> >>>>>MGF> > and the high number of frees and pages being scanned, you're
> >>>>>probably
> >>>>>MGF> > swapping tasks in and out and are waiting on disk.  Take a look 
> >>>>>at
> >>>>>MGF> > "vmstat -s", and consider adding more RAM if this is correct...
> >>>>>MGF>
> >>>>>MGF> is there a way of finding out what processes are blocked?
> >>>>>
> >>>>>Aren't they in 'D' status by ps?
> >>>>Use ps axlww. In this way, at least actual blocking points are shown.
> >>>
> >>>'k, stupid question then ... what am I searching for?
> >>>
> >>># ps axlww | awk '{print $9}' | sort | uniq -c | sort -nr
> >>> 654 select
> >>> 230 lockf
> >>> 166 wait
> >>>  85 -
> >>>  80 piperd
> >>>  71 nanslp
> >>>  33 kserel
> >>>  22 user
> >>>  10 pause
> >>>   9 ttyin
> >>>   5 sbwait
> >>>   3 psleep
> >>>   3 accept
> >>>   2 kqread
> >>>   2 Giant
> >>>   1 vlruwt
> >>>   1 syncer
> >>>   1 sdflus
> >>>   1 ppwait
> >>>   1 ktrace
> >>>   1 MWCHAN
> >>>
> >>>According to vmstat, I'm holding at '4 blocked' for the most part ...
> >>>sbwwait is socket related, not disk ... and none of the others look right
> >>>...
> >>I would say, using big magic cristall ball, that you problems are
> >>not kernel-related. I see only too suspicious points:
> >>
> >>1. high number of pipe readers and waiters for file locks. It may be
> >>normal for your load.
> >>
> >>2. 2 Giant holders/lockers. Is it constant ? Are the processes 
> >>holding/waiting
> >>for Giant are the same ?
> >>
> >>Anyway, being in your shoes, I would start looking at applications.
> >>
> >>Ah, and does dmesg show anything ?
> >
> >And another question: what are the processes in the state "user" ?
> >I never see that state. More, search thru the sources does not show
> >what this could be.
> 
> Odd, I'm not finding any, but, I did get a Giant on a grep of the ps 
> listing::
> 
> pluto# ps axlww | grep " user "
>     0 93055 46540   0  96  0   348   212 Giant  L+    p4    0:00.00 grep  
>     user
> 
> Not sure where those 'user' came from though ... just ran the above again:
> 
> # ps axlww | awk '{print $9}' | sort | uniq -c | sort -nr
>  603 select
>  231 lockf
>   71 nanslp
>   33 -
>   30 kserel
>   23 wait
>    9 ttyin
>    9 sbwait
>    7 pause
>    6 accept
>    4 piperd
>    3 psleep
>    3 kqread
>    3 Giant
>    1 syncer
>    1 sdflus
>    1 ppwait
>    1 pgzero
>    1 ktrace
>    1 MWCHAN
> 
> And nothing ...
> 
> Got a Giant lock on sshd too?
> 
> pluto# ps axlww | grep Giant
>     0   693   556   1  96  0  6096  2080 Giant  Ls    ??    0:02.18 sshd: 
>     root at ttyp0 (sshd)
>     0 94334 46540   0  96  0   348   208 -      R+    p4    0:00.00 grep 
>     Giant
Everything looks normal, transient Giant aquire/contention is quite normal,
esp. when you have several Giant-locked kernel parts.

I strongly suggest to move point of investigation to the application(s)
itself. Kernel seems to be innocent.

[Deadlock due to disk driver/Giant/fs immediately shows as HUGE number
of processes in D state with completely different set of wait states.
All your processes do select/wait for file lock/read from pipe/something
threaded.]
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 187 bytes
Desc: not available
Url : http://lists.freebsd.org/pipermail/freebsd-stable/attachments/20060624/e810c3a8/attachment.pgp