8.0: OpenSSL stat()'s NLS 500+ times causing extreme system load

Thu Dec 10 17:09:04 UTC 2009

We have a Nagios server (ports/net-mgmt/nagios) that has a lot of
check_nrpe2 (ports/net-mgmt/nrpe2) checks.

We recently upgraded the server it runs on to 8.0-STABLE (r199975).
The performance has never been great, but now it's really atrocious
and I'm trying to figure out what's going on.

The machine (a dual-core Nehalem) has a load average of 5 - 10 at all
times, and "top" shows 100% CPU usage, 75% "system" CPU usage.  No
process has more than a few % CPU though.

This is due to the large number of very short-lived processes doing
individual Nagios checks that don't live long enough to appear in top.

I investigated in some more detail with ktrace and found that each
execution of check_nrpe2 performs 520 stat() calls.  The bulk of them
look like this:

 81915 check_nrpe2 CALL  stat(0x7fbfde28,0x7fbfddc4)
 81915 check_nrpe2 NAMI  "/usr/share/nls/C/libc.cat"
 81915 check_nrpe2 RET   stat -1 errno 2 No such file or directory
 81915 check_nrpe2 CALL  stat(0x7fbfde28,0x7fbfddc4)
 81915 check_nrpe2 NAMI  "/usr/share/nls/libc/C"
 81915 check_nrpe2 RET   stat -1 errno 2 No such file or directory
 81915 check_nrpe2 CALL  stat(0x7fbfde28,0x7fbfddc4)
 81915 check_nrpe2 NAMI  "/usr/local/share/nls/C/libc.cat"
 81915 check_nrpe2 RET   stat -1 errno 2 No such file or directory
 81915 check_nrpe2 CALL  stat(0x7fbfde28,0x7fbfddc4)
 81915 check_nrpe2 NAMI  "/usr/local/share/nls/libc/C"
 81915 check_nrpe2 RET   stat -1 errno 2 No such file or directory
 81915 check_nrpe2 CALL  stat(0x7fbfde28,0x7fbfddc4)

kdump also shows 70 calls to getpid, which seems excessive.  (About 50
of them appear to be in a tight loop.)

The check_nrpe2 program simply opens an SSL socket to a remote server,
sends a short request and gets a short response.  It is a pretty
simple program. (~22k of source)

The calls to getpid() bother me a bit, but I think the NLS is the real problem:

$ kdump -E -t n | fgrep /nls/ | head -1
 81915 check_nrpe2 0.016815 NAMI  "/usr/share/nls/C/libc.cat"
$ kdump -E -t n | fgrep /nls/ | tail -1
 81915 check_nrpe2 0.135663 NAMI  "/usr/local/share/nls/libc/C"
$ kdump -E | tail -1
 81915 check_nrpe2 0.222510 CALL  exit(0x1)
$ kdump -E -t n | fgrep /nls/ | wc
     508    2540   32004

So this program spends over half its life looping over 508 stat()
calls looking for a nonexistent libc.cat file.  And then another chunk
(probably a lot smaller, but not measured) looping over getpid().

Both appear to be related to SSL; if I set up nrpe not to use it, both
excesses go away and the program finishes in about half the time,
using about half the CPU resources.

To confirm that it was SSL-related, I tried:

$ ktrace openssl s_client -connect x2:5666

And I got the exact same stat() & getpid() behavior.

Obviously there is some small CPU overhead associated with SSL.  This
is not about that.  This is about the system overhead induced by
calling stat 500+ times on a directory that doesn't exist.

This gets a little worse.  Because there are several checks running at
any given time, there is a lot of contention to VFS lookup this
handful of paths.  That's an area where FreeBSD has known SMP
performance issues I've seen discussed elsewhere, and this is a
pathological worst case.  The net result, a dual core machine is
brought to its knees by a relatively simple Nagios setup.

Anyway, long story short, why is OpenSSL doing this and how can we make it stop?

Thanks for any suggestions!