6.0 random freezes

Mon Dec 12 18:49:50 PST 2005

Peter Jeremy said the following on 12/12/05 13:40:
> 
> Define "freezing":  Does it respond to pings?  Can you switch VTYs?
> Do the num-lock/caps-lock LEDs respond?  Do some processes seem to
> freeze before others?
> 
I used the word "freeze" instead of "crash", because the latter often
gets associated with some errors reported by the kernel in system logs
or on the console. In this case there are absolutely no error messages. 
I have also remote logging enabled (on another machine over the 
network), but there's nothing either.

When the thing happens, the server appears to respond to pings for the
first few minutes, but everything goes down until I go to the data canter.

When I plug a keyboard, there's no response at all - no LEDs, no VTYs, 
Ctrl-Alt-Esc, etc. You might think of "hint.atkbd.0.flags" not being set
properly, but it's right (i.e. unchanged, it appears to default to that
on i386 5.x+) and other machines with identical configuration do accept
keyboard.

I have no information about processes. Only the thing I have is a real 
time CPU load graph. I have a script tailing the output of a "vmstat cpu 
15" and drawing a graph with user/system/idle times, so according to 
that graph there are no load spikes or unusual variations before the 
crashes. The usual user/system/idle percentages look like 10/7/83.

> I suggest you add the following to your kernel config:
>  options         KDB                     # Enable kernel debugger support.
>  options         DDB                     # Support DDB.
> 
I just set these along with the DEBUG option below, and got the new
kernel (from 6.0-RELEASE sources dated Dec 9) running on both machines,
so we'll see.

> When it hangs, break into DDB (Ctrl-Alt-Esc on the console or BREAK on
> a serial console).
> 
> As a start, run 'show lockedvnods' and 'ps'.  My guess is that you'll
> see a lock that has a number of waiters - which is probably the
> culprit.  Use 'panic' or 'call doadump' to get a crashdump and then
> you can use kgdb to rummage around once you reboot - see
> http://www.freebsd.org/doc/en_US.ISO8859-1/books/developers-handbook/kerneldebg-gdb.html
> 
I don't have any experience in chasing kernel bugs, so I'm not sure
whether I would be able to get something useful, but I'll try that on 
the next crash. But if I have no keyboard response I won't be able to 
save it, right?

I do not know what a serial console is and would need some time to get 
along with it. Would I get something in addition to what I can get from 
the standard console?

>>< makeoptions   DEBUG=-g             # Build kernel with gdb(1) debug symbols
> 
> I suggest you add this back in.  Without it, you can't debug any crash
> dumps that you manage to get (and add "dumpdev" to your rc.conf).
> 
My bad, I realized that it's kind of harmless, but it was weeks later
after I put the box in production. It's back there now.

The "dumpdev" variable seems to default to AUTO, i.e. trying to use the 
first swap device if it's bigger than the RAM (in my case yes), so I 
guess I don't need to touch it.

> Whilst I realise that you can't have production machines freezing on
> schedule, your assistance in providing more information about your
> problem will help make 6.x more stable.
> 
Yes, I know and I will try. Today I already had a couple of crashes
(got lucky, no nasty data corruptions this time), and I cannot afford 
this to continue.

I'm already working on the downgrade, but most likely I will have at 
least one of these 2 machines still running 6.x during the next day or two.

After the downgrade we could eventually set a test bed and start 
hammering it with requests. The problem would be how to trigger the 
crash and whether we would be able to reproduce it at all.

Thanks for the prompt reply!

Regards,
Atanas