Intermittent system hangs on 7.2-RELEASE-p1

Wed Aug 26 19:03:15 UTC 2009

I'm trying to troubleshoot an intermittent Apache performance problem,
and I've narrowed it down using to what appears to be a brief
whole-system hang that last from 0.5 - 3 seconds.  They occur every
few minutes.

I took the rather extreme step of doing "ktrace -t cnisuwt -i -d -p 1"
and then I waited for the hang.

This is what I got:

 54937 httpd    1251302859.375313 CALL  shutdown(0x3,<invalid=1>)
 54937 httpd    1251302859.375333 RET   shutdown 0
 54937 httpd    1251302859.375348 CALL  select(0x4,0xbfbfe92c,0,0,0xbfbfe9ac)
 54937 httpd    1251302859.375363 CSW   stop kernel
 54937 httpd    1251302859.376402 CSW   resume kernel
 54937 httpd    1251302859.376439 RET   select 1
 54937 httpd    1251302859.376453 CALL  read(0x3,0xbfbfe9b4,0x200)
 54937 httpd    1251302859.376470 GIO   fd 3 read 0 bytes
 54937 httpd    1251302859.376482 RET   read 0
 54937 httpd    1251302859.376495 CALL  close(0x3)
 54937 httpd    1251302859.376511 RET   close 0
 54937 httpd    1251302859.376525 CALL  sigaction(SIGUSR1,0xbfbfebb0,0xbfbfeb98)
 54937 httpd    1251302859.376538 RET   sigaction 0
 54937 httpd    1251302859.376552 CALL  munmap(0x282ff000,0x11)
 54937 httpd    1251302859.376607 RET   munmap 0
 54937 httpd    1251302859.376633 CALL  accept(0x11,0xbfbfebf0,0xbfbfec10)
 54937 httpd    1251302859.376649 CSW   stop kernel
   796 svscan   1251302859.481064 CSW   resume kernel
 54937 httpd    1251302859.489374 CSW   resume kernel
 54937 httpd    1251302859.489391 STRU  struct sockaddr { AF_INET,
172.17.0.143:61610 }
 98229 httpd    1251302859.601850 CSW   resume kernel
 46517 httpd    1251302859.601900 CSW   resume kernel
 98202 httpd    1251302859.611661 CSW   resume kernel
   837 nrpe2    1251302859.622681 CSW   resume kernel
 54454 httpd    1251302859.655422 CSW   resume kernel
 54454 httpd    1251302859.655443 STRU  struct sockaddr { AF_INET,
172.17.0.131:59011 }
  7182 httpd    1251302859.722381 CSW   resume kernel
 98178 httpd    1251302859.722438 CSW   resume kernel
   858 gmond    1251302859.794996 CSW   resume kernel
   858 gmond    1251302859.794998 GIO   fd 5 wrote 0 bytes
   770 ntpd     1251302860.076501 CSW   resume kernel
 98346 httpd    1251302860.086261 CSW   resume kernel
 65277 httpd    1251302860.086300 CSW   resume kernel
 98514 httpd    1251302860.106849 CSW   resume kernel
  7191 httpd    1251302860.106894 CSW   resume kernel
   796 svscan   1251302861.403335 RET   nanosleep 0
   796 svscan   1251302861.403370 CALL  wait4(0xffffffff,0xbfbfee18,WNOHANG,0)
   796 svscan   1251302861.403405 RET   wait4 0
 54454 httpd    1251302861.403481 RET   accept 3
 98229 httpd    1251302861.403532 RET   select 0
   796 svscan   1251302861.403553 CALL  stat(0x804a3bb,0xbfbfed6c)
   858 gmond    1251302861.403601 GIO   fd 5 read 20 bytes
 54454 httpd    1251302861.403619 CSW   stop user
 46517 httpd    1251302861.403647 RET   select 0
   858 gmond    1251302861.403674 RET   kevent 1
   858 gmond    1251302861.403710 CALL  socket(PF_INET,SOCK_DGRAM,IPPROTO_IP)
 98202 httpd    1251302861.403714 RET   select 0
   858 gmond    1251302861.403752 RET   socket 9
   837 nrpe2    1251302861.403756 RET   select 0

There is a gap between 1251302860.106894 and 1251302861.403335 of over
one second, and the "effective gap" starts around 1251302859.376649
and thus lasts for about two seconds.

This machine runs Apache and during this sample it was being hit every
0.1 seconds with a test request for a simple static file (in addition
to production traffic).  It is a 2-processor machine that is 85-95%
idle; there's nothing in userspace that runs that long without
yielding.  According to systat, it handles 5000+ syscalls every
second.  But according to ktrace, nothing happens at all during the
hang.  This matches user experience.  (The static file request, which
usually completes in <0.01s suddenly takes 2 seconds as observed from
the remote machine issuing the requests.)

Here's the relevant snip from the httpd process handling that static
file at the time of the hang:

 54937 httpd    1251302859.376633 CALL  accept(0x11,0xbfbfebf0,0xbfbfec10)
 54937 httpd    1251302859.376649 CSW   stop kernel
 54937 httpd    1251302859.489374 CSW   resume kernel
 54937 httpd    1251302859.489391 STRU  struct sockaddr { AF_INET,
172.17.0.143:61610 }
 54937 httpd    1251302861.403862 RET   accept 3

It's stuck in accept, but does *not* get context-switched away from
during the delay.  (The earlier context switch corresponds to the 0.1
seconds between requests; there is an Apache instance configured to
handle just the test requests with one child process; that process has
nothing else to do or block on.)

I'll include some other processes below.

I think it's weird that all these processes get context-switched-into
before/during the hang, and I wonder if it's a clue.  The kernel is
obviously still running, since it wakes these processes up, but
nothing is happening.  That and the fact that it happens on multiple
machines (though we've only tested this one) leads me away from
suspecting the hardware.   Are we out of a kernel resource or is
something getting garbage collected / stuck / temporarily locked
inside the kernel somewhere?

This is a production server, so our debugging options are very
limited.  We also can't recreate the problem on a test server.  So KTR
and/or the kernel debugger are probably out.  There's nothing in
syslog or dmesg or on the console.

Could anyone suggest away to gather more information about this, given
our constraints?

More samples:

Here's an Apache "parent" process; its hang is in select().  The
immediately following select() call demonstrates what the middle one
should also have done:

 98178 httpd    1251302858.703733 CALL  select(0,0,0,0,0xbfbfed3c)
 98178 httpd    1251302858.703789 CSW   stop kernel
 98178 httpd    1251302859.722438 CSW   resume kernel
 98178 httpd    1251302861.403908 RET   select 0
 98178 httpd    1251302861.404118 CALL  gettimeofday(0xbfbfec58,0)
 98178 httpd    1251302861.404453 RET   gettimeofday 0
 98178 httpd    1251302861.404478 CALL  wait4(0xffffffff,0xbfbfed80,WNOHANG,0)
 98178 httpd    1251302861.405288 RET   wait4 0
 98178 httpd    1251302861.405576 CALL  select(0,0,0,0,0xbfbfed3c)
 98178 httpd    1251302861.405603 CSW   stop kernel
 98178 httpd    1251302862.417549 CSW   resume kernel
 98178 httpd    1251302862.417581 RET   select 0

Same behavior from nrpe2:

   837 nrpe2    1251302859.107587 CALL
select(0x5,0xbfbfe16c,0,0xbfbfe16c,0xbfbfe20c)
   837 nrpe2    1251302859.107607 CSW   stop kernel
   837 nrpe2    1251302859.622681 CSW   resume kernel
   837 nrpe2    1251302861.403756 RET   select 0

Here's a production Apache child process. Also hangs in accept(), but
the delay does happen inside the context switch:

 55062 httpd    1251302859.044661 CALL  accept(0x12,0xbfbfec00,0xbfbfec20)
 55062 httpd    1251302859.044681 CSW   stop kernel
 55062 httpd    1251302861.792647 CSW   resume kernel
 55062 httpd    1251302861.792662 STRU  struct sockaddr { AF_INET,
172.17.0.131:59093 }
 55062 httpd    1251302861.792880 RET   accept 3

gmond hangs in kevent() on some kind of I/O:

   858 gmond    1251302859.233805 CALL
kevent(0x5,0,0,0x2833d960,0x2,0xbfbfe344)
   858 gmond    1251302859.233820 CSW   stop kernel
   858 gmond    1251302859.794996 CSW   resume kernel
   858 gmond    1251302859.794998 GIO   fd 5 wrote 0 bytes
   858 gmond    1251302861.403601 GIO   fd 5 read 20 bytes
   858 gmond    1251302861.403674 RET   kevent 1

svscan hangs in nanosleep().  It sleeps for 5 seconds, not 7.

   796 svscan   1251302854.429189 CALL  lseek(0x3,0,SEEK_SET,0)   796
svscan   1251302854.429269 RET   lseek 0
   796 svscan   1251302854.429346 CALL  close(0x3)
   796 svscan   1251302854.429709 RET   close 0
   796 svscan   1251302854.429894 CALL  nanosleep(0xbfbfee18,0xbfbfee10)
   796 svscan   1251302854.430033 CSW   stop kernel
   796 svscan   1251302859.481064 CSW   resume kernel
   796 svscan   1251302861.403335 RET   nanosleep 0
   796 svscan   1251302861.403370 CALL  wait4(0xffffffff,0xbfbfee18,WNOHANG,0)
   796 svscan   1251302861.403405 RET   wait4 0
   796 svscan   1251302861.403553 CALL  stat(0x804a3bb,0xbfbfed6c)
   796 svscan   1251302861.404027 NAMI  "."

ntpd shows the same behavior, except it wakes up by alarm.  Again, the
normal behavior (no gap between CSW and RET) is shown right after:

   770 ntpd     1251302859.078305 CALL  select(0x3b,0xbfbfed7c,0,0,0)
   770 ntpd     1251302859.078312 CSW   stop kernel
   770 ntpd     1251302860.076501 CSW   resume kernel
   770 ntpd     1251302861.403952 RET   select -1 errno 4 Interrupted
system call
   770 ntpd     1251302861.404168 PSIG  SIGALRM caught
handler=0x806ad90 mask=0x0 code=0x0
   770 ntpd     1251302861.404191 CALL  sigreturn(0xbfbfea50)
   770 ntpd     1251302861.404553 RET   sigreturn JUSTRETURN
   770 ntpd     1251302861.404890 CALL  select(0x3b,0xbfbfed7c,0,0,0)
   770 ntpd     1251302861.405955 CSW   stop kernel
   770 ntpd     1251302862.077114 CSW   resume kernel
   770 ntpd     1251302862.077171 RET   select -1 errno 4 Interrupted
system call

There were other processes running, mainly more httpds, but I went
through each and every one of them and I don't see anything that I
haven't already shown an example of.

Thanks very much for any advice!