Random crash and/or reboots

Jack L. Stone jackstone at sage-one.net
Sun Sep 7 08:29:04 PDT 2003


Mail server: 4.8-RELEASE-p3

A while back, on a couple of occasions, I posted a query about some bad
behavior on my mail server. For the past several months, it has been either
crashing/reboot or just rebooting. It's ALWAYS triggered by a SSH login,
but at random and ONLY at the "su" to root -- usually the most time before
reboot is about 2+ weeks and then contrasted by 2 in a row right after the
reboot -- actually no pattern. It has never happened directly at the console.

I have replaced every single piece of hardware, e.g., PSU, cables, NICs,
including finally a switching of the whole machine, except for the hard
disk that contains the system. That had to remain in the new machine. Even
then, I have moved the entire system & contents to another new HD. Thus, I
concluded it to be a software problem.

There are no indications of anything in the logs, and no core dumps. It
just stops and reboots, and any random time it pick. Only a couple of times
it has crashed without the remote login.

One tip was that I might have stale NFS mountabs -- cleared them out, but
problem persisted.

The above tip was suggested when I mentioned that on a couple or more of
the occurrences, I managed to get to the console quickly enough to see (in
bright bold) "lockmgr locking against myself" -- or close to that. My
google of that error does mention stale mounts, but mostly about esoteric
code stuff. No fix found anywhere.

Then, on this list, I saw the thread about other having mysterious reboots
and one suggestion was to run lsof(8) on continuous loops so that a log
file would be captured of open files when these reboots occurred. I have
captured 6 of these logs. I don't see anything that jumps out as being a
common file problem. I have placed 6 text files at the URLs below
containing only 300 lines of each log, which should contain enough info for
a comparison. (I let the logs grow to 200MB before restarting the lsof loop
each time -- of course these samples are chopped off at the moment of
crash/reboot along with the 300 other files before that moment)

I am at a loss, other than rebuilding the system from scratch, but that is
no assurance of a fix. The one thing unique here is that it is the mail
server and runs spamd (spamassassin-2.55), spamass-milter-2.0 (which has a
lock file) and procmail-3.22 (which does a lot of locking).

I am suspicious of the locking going on with the above spam-fight programs,
which may clash when a SSH login & su occurs. I believe a lock is required
for it too...??

Would appreciate anyone's time and efforts to look at these files and see
if anything is spotted that I don't see. the most recent is #6-lsof.txt and
works backwards. The 6-lsof.txt was just this morning.

http://sageweb/tmp/1-lsof.txt
http://sageweb/tmp/2-lsof.txt
http://sageweb/tmp/3-lsof.txt
http://sageweb/tmp/4-lsof.txt
http://sageweb/tmp/5-lsof.txt
http://sageweb/tmp/6-lsof.txt

Much obliged!

Best regards,
Jack L. Stone,
Administrator

SageOne Net
http://www.sage-one.net
jackstone at sage-one.net


More information about the freebsd-questions mailing list