Crashes with Promise controller

Sat Jun 18 17:52:19 UTC 2011

On Sat, Jun 18, 2011 at 06:49:41PM +0200, Stefan Bethke wrote:
> Am 13.06.2011 um 16:22 schrieb Christian Baer:
> 
> > I have to slightly explain the word "crash" here: I don't actually have
> > to hard reset the system myself. My box just does a reboot by itself. No
> > filesystem is unmounted cleanly and because the machine isn't really new
> > and powerful fsck takes pretty long.
> 
> I can't help you with your controllers, but anyone in a position to
> help will likely want to know if the box simply resets, or if the
> kernel panics.  And if there are going to be any patches, you most
> certainly will want to get familiar with the debugger to help try
> stuff out.  The handbook has information on how to enable crash dumps
> and getting the kernel debugger going.  If you haven't done so
> already, try and get a serial console going, it helps tremendously to
> be able to cut&paste debugger info instead of trying to hand
> transcribe it.

It may be that the kernel is panic'ing and auto-rebooting before he can
see the message in question.  I would advocate he put the following
directives in his kernel configuration and rebuild/reinstall kernel and
wait for it to happen again.

# Debugging options
options		KDB			# Enable kernel debugger support
options		KDB_TRACE		# Print stack trace automatically on panic
options		DDB			# Support DDB
options		GDB			# Support remote GDB

If after doing this the machine literally reboots rather than panics,
then that would indicate a mainboard having issues, or power-related
stuff (keep reading).

As for the behaviour he describes -- this sort of problem can sometimes
turn out to be PSU-load-related (too many drives on a PSU that can't
handle it on a single rail), bad/improper voltages (difficult to track
down given the state of hardware monitoring on mainboards and on
FreeBSD), or "dirty power" / excessive ripple.  Power-related problems
on computers almost always appear as random/abrupt situations that can
usually be exacerbated by heavy system utilisation.  I have no proof
this is Christian's problem, but it's worth considering anyway.

One might be able to detect ("log") potential power loss by looking at
SMART attribute 12 on mechanical HDDs in the system; if the RAW_VALUE
increases after it happens, then power is being lost to the drives.  If
not, then it may be a soft reset.  I use the word "may" because
sometimes a very quick brown-out won't cause the drives to actually
"power down" fully (e.g. the attribute never gets incremented) but the
loss of power can be just enough to cause them to start freaking out.
Computers + power issues = expect random chaos.

-- 
| Jeremy Chadwick                                jdc at parodius.com |
| Parodius Networking                       http://www.parodius.com/ |
| UNIX Systems Administrator                   Mountain View, CA, US |
| Making life hard for others since 1977.               PGP 4BD6C0CB |