8.0RC2 amd64 - kernel panic running make buildworld
Kai Gallasch
gallasch at free.de
Wed Nov 4 00:17:36 UTC 2009
Am Tue, 03 Nov 2009 10:42:40 +0000
schrieb Gavin Atkinson <gavin at FreeBSD.org>:
> On Sat, 2009-10-31 at 23:15 +0100, Kai Gallasch wrote:
> > Hi.
> >
> > I installed 8.0RC2-amd64 on an 8-core opteron server a few days ago.
> >
> > When I try to do a make buildworld or make buildkernel the server
> > reboots without any message left in the logs. The same happens
> > when building bigger ports (for example ruby18 or perl58)
> First place I think I'd start id by running memtest86 on the machine
> overnight. This sounds like possible hardware issue to me, it would
> be good to see if we can confirm that that is the case.
I will do so tomorrow. Following actions I have already taken to rule
out a hardware problem:
- ran several passes with diagnostic software from the manufacturer
- reset BIOS settings to default
- upgraded BIOS to newest release
- booted server from 2 year old backup BIOS
- took out the only pair of RAM modules that was different from the
rest of the modules
- installed freebsd 7.2-STABLE on the server to repeat the kernel
panic (no panic with 7.2)
- installed 8.0-BETA4 (crash)
Besides: The server was in production with 7.2 for some time, without
showing any such problems.
> > Now my idea was to install the old 8.0-BETA4 and upgrade to RC2
> > through makeworld + buildkernel (gdb+witness). But no luck. When
> > trying to upgrade to RC2 the 8.0-BETA4 also crashes. At least
> > 8.0-BETA4 has debug
> > + witness active in the GENERIC kernel..
> >
> > So below some debug output of 8.0-BETA4 crashing. Has a vfs/ffs LOR
> > problem with the BETA4 already been fixed?
>
> The debug output you included were just lock order reversals, and
> don't seem to be related to your crash.
Sorry for causing possible confusion about this. I realized this after
my mail was already out.
> I think 8.0-BETA4 still had the debugger compiled in (you can test by
> pressing ctrl-alt-escape ion the console, if you do drop to the
> debugger, give the "c" command to continue).
>
> If the debugger is compiled in, then the spontaneous reboot without
> dropping to the debugger suggests even more that it may be hardware
> related. If you do get to the debugger, a copy of all of the messages
> on screen and the output of the "bt" command would be very useful.
> When you do your kernel recompile, please include full debugging,
> including WITNESS, INVARIANTS, KDB, DDB etc.
In the meantime I managed it to install a RELENG_8 world + GENERIC
kernel with all debug options enabled on the crashing server. (mounted
/usr/src and /usr/obj on another server running 8.0RC1 through NFS and
did buildworld + buildkernel over there..)
So now I have a debug kernel available with dumpev + dumpdir defined.
Here are my latest findings on this issue:
- Running a makeworld in about 80% leads to a server crash without
the server writing a crashdump to dumpdir. The server just reboots..
- In about 20% of the cases makeworld gets stuck in a not terminating
process that eats up 100% cpu. This process cannot be killed. When
restarting makeworld the server then reboots again
- It makes no difference doing makeworld -j1 or -j8, result is the same
> It depends what the bug is to be honest. So far there isn't really
> enough information to determine the cause, and therefore there isn't
> really enough info for a PR.
Mark Atkinson also commented on my mail and he gave the
hint: "If vm.pmap.pg_ps_enabled is 1 in 8.0-rc2, you might try
rebooting with c in /boot/loader.conf and try
another buildworld."
So I thought why not and just tried it - and surprise:
Disabling vm.pmap.pg_ps_enabled=1 in loader.conf resolves my problem
with 8.0RC2 crashing when doing a makeworld..
After successful buildworld and buildkernel I rebooted the server
again with commented out vm.pmap.pg_ps_enabled=0 and the problem
was there again. And then I disabled the option again in loader.conf,
rebooted + make buildworld .. no problem.
Seems to be deterministic. With vm.pmap.pg_ps_enabled=1 the server
crashes without being able to write crashdumps to dumpdev. (at least on
this specific Proliant DL385G2 server)
--Kai.
--
You need more time; and you probably always will.
More information about the freebsd-current
mailing list