Nagios SIGSEGV on FreeBSD 8

Tue Sep 15 03:38:41 UTC 2009

I've been running a FreeBSD 8-BETA2 server for DNS on a network I
recently took over.  No problems.  We needed to get Nagios running on
that network to watch all the hosts in RFC 1918 space.  Taking the easy
route, I just installed the Nagios 3.0.6 port on this 8-BETA2 box.

Nagios runs great until someone acknowleges a down host, (adding
a comment).  Later, when the host comes back up, Nagios exits on
a SIGSEGV.  It seems to only happen when we have retention data
(retention.dat) showing the host down.  If we just restart Nagios
without removing the retention.dat file, it SIGSEGV's the next time
it tries to mark the host up.  I upgraded to the nagios-devel (Nagios
3.1.2) port and we have the same problem.

I'm not good with gdb, but it looks like there are two threads
running.  I can't tell what the other thread is doing, but the one that
SEGVs seems to be trying to remove the comment associated with the
acknowlegement message.

sudo gdb -c /var/coredumps/nagios-52050.core /usr/local/bin/nagios
GNU gdb 6.1.1 [FreeBSD]
Copyright 2004 Free Software Foundation, Inc.
GDB is free software, covered by the GNU General Public License, and you are
welcome to change it and/or distribute copies of it under certain conditions.
Type "show copying" to see the conditions.
There is absolutely no warranty for GDB.  Type "show warranty" for details.
This GDB was configured as "i386-marcel-freebsd"...(no debugging symbols found)...
Core was generated by `nagios'.
Program terminated with signal 11, Segmentation fault.
Reading symbols from /lib/libm.so.5...(no debugging symbols found)...done.
Loaded symbols for /lib/libm.so.5
Reading symbols from /lib/libthr.so.3...(no debugging symbols found)...done.
Loaded symbols for /lib/libthr.so.3
Reading symbols from /lib/libc.so.7...(no debugging symbols found)...done.
Loaded symbols for /lib/libc.so.7
Reading symbols from /libexec/ld-elf.so.1...(no debugging symbols found)...done.
Loaded symbols for /libexec/ld-elf.so.1
#0  0x0807fe8b in get_next_comment_by_host ()
[New Thread 28326280 (LWP 100051)]
[New Thread 28301140 (LWP 100222)]
(gdb) bt
#0  0x0807fe8b in get_next_comment_by_host ()
#1  0x08080940 in delete_host_acknowledgement_comments ()
#2  0x28331180 in ?? ()
#3  0x4aaac053 in ?? ()
#4  0x080cc394 in __JCR_LIST__ ()
#5  0x28342f00 in ?? ()
#6  0x00000000 in ?? ()
#7  0xbfbfe858 in ?? ()
#8  0x08071c15 in handle_host_state ()
Previous frame inner to this frame (corrupt stack?)

Here is the code for get_next_comment_by_host:

comment *get_next_comment_by_host(char *host_name, comment *start){
	comment *temp_comment=NULL;

	if(host_name==NULL || comment_hashlist==NULL)
		return NULL;

	if(start==NULL)
		temp_comment=comment_hashlist[hashfunc(host_name,NULL,COMMENT_HASHSLOTS)];
	else
		temp_comment=start->nexthash;

	for(;temp_comment && compare_hashdata(temp_comment->host_name,NULL,host_name,NULL)<0;temp_comment=temp_comment->nexthash);

	if(temp_comment && compare_hashdata(temp_comment->host_name,NULL,host_name,NULL)==0)
		return temp_comment;

	return NULL;
	}

I don't grok the for loop but I'm not much of a C guy.  I think they
obfuscated a while loop there.  I am guessing that if the hashfunc()
and compare_hashdata() calls were an issue, they would show up in the
backtrace?

The reason I ask here, is I haven't found any reports of similar issues
on the Nagios list or elsewhere on Google.  I suspect the issue may have
to do with threads on FreeBSD 8.  I need more clue to figure out if my
suspicions could be correct.

I must be the first sucker to try to run Nagios on FreeBSD 8.  :-)

Thanks,

-- 
Scott Lambert                    KC5MLE                       Unix SysAdmin
lambert at lambertfam.org