More data on 7.2-RELEASE "hangs"

Wed May 13 17:44:56 UTC 2009

On Wed, 13 May 2009, John Baldwin wrote:

> Well, you had a whole lot of page faults and other VM activity, plus 500k
> syscalls.  The 'w' is a count of swapped processes, so basically your box is
> swapping a whole lot it seems.  I think your box is just overloaded.

I knew I was going to regret posting that :(

What I posted was what vmstat 5 shows after the issue *starts*, not what 
it normally looks like ... right now, after 10 hours of uptime, and all 
the same processes running, it looks like:

io# vmstat 5 (10 hours uptime now)
  procs      memory      page                    disks     faults         cpu
  r b w     avm    fre   flt  re  pi  po    fr  sr da0 pa0   in   sy   cs us sy id
  0 1 0  10477M   301M  3503  13   1   2  3620 286   0   0  331 45491 4566 26  8 66
  0 1 0  10430M   305M   278   7   0   0   550   0  18   0  186 19243 2917 4  3 93
  1 1 0  10474M   295M   511   0   0   0   359   0  91   0  253 11632 3516 7  3 90
  0 1 0  10447M   310M   819   3   0   0  1473   0  14   0  143 29575 2486 8  3 89
  0 1 0  10558M   295M  5008  18  13   5  4128   0 121   0  345 24212 4215 16  7 77

Right now, IO is running ~775 processes ... at the time of the vmstat I 
provided earlier, it was up to 1400 processes ... since there is only 5 
minutes between script runs, something is causing it to go from zero swap 
-> high swap within a very short period of time, but since things get 
badly locked up when it happens, I can't isolate where ...

I've got the following two ps outputs at the time of the high paging:

/bin/ps -aucxHl -O jid > ps-long.out
/bin/ps -aux -O jid > ps-short.out

Is there anything in there that I could look at as far as what is putting 
things over the edge?

====

As to the 'overloaded server', here is another server, with more running 
on it, but exact same configuration:

neptune# vmstat 5 (3 days, 18 hours uptime now)
  procs      memory      page                    disks     faults         cpu
  r b w     avm    fre   flt  re  pi  po    fr  sr da0 pa0   in   sy   cs us sy id
  0 0 0  12521M   303M  3969  15   5   3  2271 1603   0   0  444 6491 5165 37 19 44
  0 0 0  12464M   309M  3009   1   0  15  2833   0 104   0  296 9378 3689  7  5 88
23 0 0  12476M   297M  3845   3   0   0  2627   0  31   0  279 10545 2986 14  5 81
  0 1 0  12530M   266M  5259   0   1   0  2551   0 145   0  432 18070 4133 45  8 47
  1 0 0  12587M   237M  7049   0   1   0  4484   0 171   0  357 15953 4715 29  7 64

So, normally these servers purr ... and are highly responsive ...

In fact, here is an older 32bit server, less RAM, run about 50% more 
processes then neptune:

mercury# vmstat 5
  procs      memory      page                    disks     faults         cpu
  r b w     avm    fre  flt  re  pi  po  fr  sr da0 pa0   in   sy  cs us sy id
  3 14 1   6817M   114M  641   7   3   1 1036 386   0   0 1109  464 157  5  5 90
  0 8 0   6817M   224M  596  33   0   5 5667 3850  86   0 1303 5768 3885  6 7 87
  1 10 0   6824M   220M 4332  32   2   0 3228   0  17   0  755 9689 3057  8 7 85
  0 9 0   6798M   219M  430   0   0   0 712   0  12   0 1274 4276 3877  2  2 95
  0 11 0   6830M   205M 1026   4   1   3 481   0  84   0 1503 5586 4370  6 4 89

----
Marc G. Fournier           Hub.Org Networking Services (http://www.hub.org)
Email . scrappy at hub.org                              MSN . scrappy at hub.org
Yahoo . yscrappy               Skype: hub.org        ICQ . 7615664