massive load average spikes

Wed Aug 11 21:43:47 UTC 2010

> load average is a time averaged thing and in the case of a
> 'thundering herd' problem you will see the LA spike up and
> come down again over time.
>
> Do you see any problem as a result of this? Or is it just curiosity?
>
> you might want to use KTR or ktrace with scheduling events if you
> really want to see the reason for this. It could just be a sampling
> error when some 'tick' coincides with the sampling..
>
>
I have not seen any noticeable performance degradation when the LA spikes like this, and
the main nuisance of this was Sendmail's behaviour.  I have since set the options
"RefuseLA=0" and "QueueLA=0" to avoid long stretches of SMTP being unavailable while the
load averaged itself out.

At this point it is really just a nagging feeling that something is misbehaving and it's
going to bite me when I least expect it (it always does!), so I would like to try and
track down the source of the problems, but I'm not even sure where to begin looking. 

I have run some ktrace on sendmail and dovecot, but did not see anything that stood out,
although I don't really know if I would recognize the problem in a kdump anyway (Too much
information!)  I'm not at all familiar with KTR, however.  Is this something that can be
run on a production host or should it be isolated to a dev box?  I have cloned the jail
into a dev environment on identical hardware, but only see the issue under production. 
I'm not sure if this is a factor of insufficient load or just not enough random
strangeness outside of production. 

Any suggestions for how KTR might help pin this down or what to look for?