6.1-STABLE; Fatal trap 12: page fault while in kernel mode; kgdb isn't working??!?

Wed May 31 20:10:19 PDT 2006

David Wolfskill wrote:

> In testing a vendor's product, I managed (as I had been warned might
> happen) to crash the machine on which the product was running.
> 
> It's a moderately-recent 6.1-STABLE:
> 
> mx-out05# uname -a
> FreeBSD mx-out05.lab.example.org 6.1-STABLE FreeBSD 6.1-STABLE #3: Sun May  7 10:06:44 PDT 2006     dhw at mx-out05.lab.example.org:/usr/obj/usr/src/sys/SMP_PAE  i386
> mx-out05# 
> 
> Hardware-wise, it's a dual 3 GHz Xeon box with 4 GB RAM.
> 
> In case it's relevant:
> 
> mx-out05# mount; df; swapinfo
> /dev/aacd0s2a on / (ufs, local, soft-updates)
> devfs on /dev (devfs, local)
> /dev/aacd0s2d on /usr (ufs, local, soft-updates)
> /dev/aacd0s3d on /home (ufs, local, soft-updates)
> /dev/aacd0s3e on /var (ufs, local, soft-updates)
> /dev/aacd1s1d on /var/spool (ufs, local, noatime)
> devfs on /var/named/dev (devfs, local)
> /dev/md0 on /tmp (ufs, local, soft-updates)
> Filesystem    1K-blocks    Used    Avail Capacity  Mounted on
> /dev/aacd0s2a    507630   37008   430012     8%    /
> devfs                 1       1        0   100%    /dev
> /dev/aacd0s2d   2280880 1676226   422184    80%    /usr
> /dev/aacd0s3d   5077038   50950  4619926     1%    /home
> /dev/aacd0s3e   7270492  949650  5739204    14%    /var
> /dev/aacd1s1d  34678048   14136 31889670     0%    /var/spool
> devfs                 1       1        0   100%    /var/named/dev
> /dev/md0        9159102      16  8426358     0%    /tmp
> Device          1K-blocks     Used    Avail Capacity
> /dev/aacd0s3b    16777216        0 16777216     0%
> mx-out05# 
> 
> Yes, swap is ridiculously huge (but note that /tmp is swap-backed).
> So are a few other allocations (huge, that is); in general, I prefer
> to avoid exhausting resources.  :-}
> 
> The crash appears to be quite reproducible by using
> ports/benchmarks/postal.  It's fairly likely that I need to configure
> some resource-consumption constraints so the application doesn't go
> completely berserk.  I note that running postal using the same
> parameters against a similar box running Postfix just chugs along, no
> problem at all.
> 
> Here's a typical complaint as extracted from /var/log/messages:
> 
> May 31 16:02:13 mx-out05 kernel: Fatal trap 12: page fault while in kernel mode
> May 31 16:02:13 mx-out05 kernel: cpuid = 0; apic id = 00
> May 31 16:02:13 mx-out05 kernel: fault virtual address  
> May 31 16:02:13 mx-out05 kernel: = 0x0
> May 31 16:02:13 mx-out05 kernel: fault code             = supervisor read, page not present
> May 31 16:02:13 mx-out05 kernel: instruction pointer    = 0x20:0x0
> May 31 16:02:13 mx-out05 kernel: stack pointer          = 0x28:0xf06f8b98
> May 31 16:02:13 mx-out05 kernel: frame pointer          = 0x28:0xf06f8bcc
> May 31 16:02:13 mx-out05 kernel: code segment           = base 0x0, limit 0xf
> May 31 16:02:13 mx-out05 kernel: f
> 
> 
> I did manage to set things up to get a kernel crash dump, and I'm about
> as certain as I can be that the kernel, userland, and crash dump are all
> in sync.
> 
> Still, when I
> 
> cd /usr/obj/usr/src/sys/SMP_PAE/ && kgdb kernel.debug /var/crash/vmcore.0
> 
> I get a repeating:
> kgdb: kvm_read: invalid address (0xc9ff5624)
> kgdb: kvm_read: invalid address (0xc9ff8600)
> kgdb: kvm_read: invalid address (0xc9ff5624)
> kgdb: kvm_read: invalid address (0xc9ff8600)
> 
> The pattern repeats until I interrupt it.
> 
> Now, this box is in a lab; it is for testing (at this time), so I have
> rather more flexibility than I might for a production system.  The
> product was built for FreeBSD 5.x; I have the ports/misc/compat-5x port
> installed, and the product does run -- at least, until I start
> stress-testing it.  :-}
> 
> I could bring the box up to a more recent -STABLE fairly easily; for that
> matter, I could probably bring it up to -CURRENT fairly easily, but I
> have no intent to be running a production service on -CURRENT.  (My
> laptop?  Sometimes.  A production box in a colo?  Uhh... maybe I'm just
> not sufficiently daring, but no thanks.  :-})
> 
> I'd appreciate suggestions (or pointers to same) as to how I might
> proceed to determine what I can do to get the product to run reliably
> iin a FreeBSD environment.  (The vendor has suggested eithe rRed Hat or
> Suse Linux as more stable platforms, and has complained about an
> inability to get debugging information from FreeBSD.  I have pointe dout
> that there's been some progress of late on getting DTrace ported to
> FreeBSD, and they've seemed at least somewhat interested, but in the
> mean time....)
> 
> Anyway, I'll plan on summarizing off-list responses that are relevant.
> 
> Thanks!
> 
> Peace,
> david

kgdb seems to be more broken than not.  COuld you enable KDB+DDB and at
least get a stack trace from the fault?

Scott