Intermittent system hangs on 7.2-RELEASE-p1
Linda Messerschmidt
linda.messerschmidt at gmail.com
Wed Aug 26 19:03:15 UTC 2009
I'm trying to troubleshoot an intermittent Apache performance problem,
and I've narrowed it down using to what appears to be a brief
whole-system hang that last from 0.5 - 3 seconds. They occur every
few minutes.
I took the rather extreme step of doing "ktrace -t cnisuwt -i -d -p 1"
and then I waited for the hang.
This is what I got:
54937 httpd 1251302859.375313 CALL shutdown(0x3,<invalid=1>)
54937 httpd 1251302859.375333 RET shutdown 0
54937 httpd 1251302859.375348 CALL select(0x4,0xbfbfe92c,0,0,0xbfbfe9ac)
54937 httpd 1251302859.375363 CSW stop kernel
54937 httpd 1251302859.376402 CSW resume kernel
54937 httpd 1251302859.376439 RET select 1
54937 httpd 1251302859.376453 CALL read(0x3,0xbfbfe9b4,0x200)
54937 httpd 1251302859.376470 GIO fd 3 read 0 bytes
54937 httpd 1251302859.376482 RET read 0
54937 httpd 1251302859.376495 CALL close(0x3)
54937 httpd 1251302859.376511 RET close 0
54937 httpd 1251302859.376525 CALL sigaction(SIGUSR1,0xbfbfebb0,0xbfbfeb98)
54937 httpd 1251302859.376538 RET sigaction 0
54937 httpd 1251302859.376552 CALL munmap(0x282ff000,0x11)
54937 httpd 1251302859.376607 RET munmap 0
54937 httpd 1251302859.376633 CALL accept(0x11,0xbfbfebf0,0xbfbfec10)
54937 httpd 1251302859.376649 CSW stop kernel
796 svscan 1251302859.481064 CSW resume kernel
54937 httpd 1251302859.489374 CSW resume kernel
54937 httpd 1251302859.489391 STRU struct sockaddr { AF_INET,
172.17.0.143:61610 }
98229 httpd 1251302859.601850 CSW resume kernel
46517 httpd 1251302859.601900 CSW resume kernel
98202 httpd 1251302859.611661 CSW resume kernel
837 nrpe2 1251302859.622681 CSW resume kernel
54454 httpd 1251302859.655422 CSW resume kernel
54454 httpd 1251302859.655443 STRU struct sockaddr { AF_INET,
172.17.0.131:59011 }
7182 httpd 1251302859.722381 CSW resume kernel
98178 httpd 1251302859.722438 CSW resume kernel
858 gmond 1251302859.794996 CSW resume kernel
858 gmond 1251302859.794998 GIO fd 5 wrote 0 bytes
770 ntpd 1251302860.076501 CSW resume kernel
98346 httpd 1251302860.086261 CSW resume kernel
65277 httpd 1251302860.086300 CSW resume kernel
98514 httpd 1251302860.106849 CSW resume kernel
7191 httpd 1251302860.106894 CSW resume kernel
796 svscan 1251302861.403335 RET nanosleep 0
796 svscan 1251302861.403370 CALL wait4(0xffffffff,0xbfbfee18,WNOHANG,0)
796 svscan 1251302861.403405 RET wait4 0
54454 httpd 1251302861.403481 RET accept 3
98229 httpd 1251302861.403532 RET select 0
796 svscan 1251302861.403553 CALL stat(0x804a3bb,0xbfbfed6c)
858 gmond 1251302861.403601 GIO fd 5 read 20 bytes
54454 httpd 1251302861.403619 CSW stop user
46517 httpd 1251302861.403647 RET select 0
858 gmond 1251302861.403674 RET kevent 1
858 gmond 1251302861.403710 CALL socket(PF_INET,SOCK_DGRAM,IPPROTO_IP)
98202 httpd 1251302861.403714 RET select 0
858 gmond 1251302861.403752 RET socket 9
837 nrpe2 1251302861.403756 RET select 0
There is a gap between 1251302860.106894 and 1251302861.403335 of over
one second, and the "effective gap" starts around 1251302859.376649
and thus lasts for about two seconds.
This machine runs Apache and during this sample it was being hit every
0.1 seconds with a test request for a simple static file (in addition
to production traffic). It is a 2-processor machine that is 85-95%
idle; there's nothing in userspace that runs that long without
yielding. According to systat, it handles 5000+ syscalls every
second. But according to ktrace, nothing happens at all during the
hang. This matches user experience. (The static file request, which
usually completes in <0.01s suddenly takes 2 seconds as observed from
the remote machine issuing the requests.)
Here's the relevant snip from the httpd process handling that static
file at the time of the hang:
54937 httpd 1251302859.376633 CALL accept(0x11,0xbfbfebf0,0xbfbfec10)
54937 httpd 1251302859.376649 CSW stop kernel
54937 httpd 1251302859.489374 CSW resume kernel
54937 httpd 1251302859.489391 STRU struct sockaddr { AF_INET,
172.17.0.143:61610 }
54937 httpd 1251302861.403862 RET accept 3
It's stuck in accept, but does *not* get context-switched away from
during the delay. (The earlier context switch corresponds to the 0.1
seconds between requests; there is an Apache instance configured to
handle just the test requests with one child process; that process has
nothing else to do or block on.)
I'll include some other processes below.
I think it's weird that all these processes get context-switched-into
before/during the hang, and I wonder if it's a clue. The kernel is
obviously still running, since it wakes these processes up, but
nothing is happening. That and the fact that it happens on multiple
machines (though we've only tested this one) leads me away from
suspecting the hardware. Are we out of a kernel resource or is
something getting garbage collected / stuck / temporarily locked
inside the kernel somewhere?
This is a production server, so our debugging options are very
limited. We also can't recreate the problem on a test server. So KTR
and/or the kernel debugger are probably out. There's nothing in
syslog or dmesg or on the console.
Could anyone suggest away to gather more information about this, given
our constraints?
More samples:
Here's an Apache "parent" process; its hang is in select(). The
immediately following select() call demonstrates what the middle one
should also have done:
98178 httpd 1251302858.703733 CALL select(0,0,0,0,0xbfbfed3c)
98178 httpd 1251302858.703789 CSW stop kernel
98178 httpd 1251302859.722438 CSW resume kernel
98178 httpd 1251302861.403908 RET select 0
98178 httpd 1251302861.404118 CALL gettimeofday(0xbfbfec58,0)
98178 httpd 1251302861.404453 RET gettimeofday 0
98178 httpd 1251302861.404478 CALL wait4(0xffffffff,0xbfbfed80,WNOHANG,0)
98178 httpd 1251302861.405288 RET wait4 0
98178 httpd 1251302861.405576 CALL select(0,0,0,0,0xbfbfed3c)
98178 httpd 1251302861.405603 CSW stop kernel
98178 httpd 1251302862.417549 CSW resume kernel
98178 httpd 1251302862.417581 RET select 0
Same behavior from nrpe2:
837 nrpe2 1251302859.107587 CALL
select(0x5,0xbfbfe16c,0,0xbfbfe16c,0xbfbfe20c)
837 nrpe2 1251302859.107607 CSW stop kernel
837 nrpe2 1251302859.622681 CSW resume kernel
837 nrpe2 1251302861.403756 RET select 0
Here's a production Apache child process. Also hangs in accept(), but
the delay does happen inside the context switch:
55062 httpd 1251302859.044661 CALL accept(0x12,0xbfbfec00,0xbfbfec20)
55062 httpd 1251302859.044681 CSW stop kernel
55062 httpd 1251302861.792647 CSW resume kernel
55062 httpd 1251302861.792662 STRU struct sockaddr { AF_INET,
172.17.0.131:59093 }
55062 httpd 1251302861.792880 RET accept 3
gmond hangs in kevent() on some kind of I/O:
858 gmond 1251302859.233805 CALL
kevent(0x5,0,0,0x2833d960,0x2,0xbfbfe344)
858 gmond 1251302859.233820 CSW stop kernel
858 gmond 1251302859.794996 CSW resume kernel
858 gmond 1251302859.794998 GIO fd 5 wrote 0 bytes
858 gmond 1251302861.403601 GIO fd 5 read 20 bytes
858 gmond 1251302861.403674 RET kevent 1
svscan hangs in nanosleep(). It sleeps for 5 seconds, not 7.
796 svscan 1251302854.429189 CALL lseek(0x3,0,SEEK_SET,0) 796
svscan 1251302854.429269 RET lseek 0
796 svscan 1251302854.429346 CALL close(0x3)
796 svscan 1251302854.429709 RET close 0
796 svscan 1251302854.429894 CALL nanosleep(0xbfbfee18,0xbfbfee10)
796 svscan 1251302854.430033 CSW stop kernel
796 svscan 1251302859.481064 CSW resume kernel
796 svscan 1251302861.403335 RET nanosleep 0
796 svscan 1251302861.403370 CALL wait4(0xffffffff,0xbfbfee18,WNOHANG,0)
796 svscan 1251302861.403405 RET wait4 0
796 svscan 1251302861.403553 CALL stat(0x804a3bb,0xbfbfed6c)
796 svscan 1251302861.404027 NAMI "."
ntpd shows the same behavior, except it wakes up by alarm. Again, the
normal behavior (no gap between CSW and RET) is shown right after:
770 ntpd 1251302859.078305 CALL select(0x3b,0xbfbfed7c,0,0,0)
770 ntpd 1251302859.078312 CSW stop kernel
770 ntpd 1251302860.076501 CSW resume kernel
770 ntpd 1251302861.403952 RET select -1 errno 4 Interrupted
system call
770 ntpd 1251302861.404168 PSIG SIGALRM caught
handler=0x806ad90 mask=0x0 code=0x0
770 ntpd 1251302861.404191 CALL sigreturn(0xbfbfea50)
770 ntpd 1251302861.404553 RET sigreturn JUSTRETURN
770 ntpd 1251302861.404890 CALL select(0x3b,0xbfbfed7c,0,0,0)
770 ntpd 1251302861.405955 CSW stop kernel
770 ntpd 1251302862.077114 CSW resume kernel
770 ntpd 1251302862.077171 RET select -1 errno 4 Interrupted
system call
There were other processes running, mainly more httpds, but I went
through each and every one of them and I don't see anything that I
haven't already shown an example of.
Thanks very much for any advice!
More information about the freebsd-hackers
mailing list