FreeBSD 5.3-[RELEASE-p1|STABLE] SMP crashes

Doug White dwhite at gumbysoft.com
Sun Nov 28 23:09:10 PST 2004


Pruning -smp crosspost since I'm not on that list.  To: address updated
accordingly.

On Sat, 20 Nov 2004, Oliver Hartmann wrote:

> First, please do not reply on this address, your reply will never reach
> me. Please contact me at ohartman at web.de. I can not post into this
> newsgroup via web.de due to SPAM exclusion of several web.de hosts.
>
> As I reported very often in the past I have still massvie problems with
> SMP enabled on a FreeBSD 5.3-RELEASE-p1 __and__ FreeBSD 5.3-STABLE box.
> The crash is always of the same typus as I can 'watch' how the machine
> freezes and for some lucky moments I am able to switch to the console
> before the box dies definitely and watch what error message comes up.

The panic caught below appears to have dropped you into ddb.  Could you
run 'tr' and post the output along with the panic output next time you
trigger this?

> This machine is a ASUS CUR-DLS maiboard, utilizing the RCC ServerWorks
> chipset, version 3 for Pentium 3 CPUs. At this moment I use two Intel
> 1GHz CPUs of the same stepping, but prior to this error report I used
> two CPUs with 866 Mhz and of different steppings, but it seems to make
> no difference.
>
> I also tried a lot of kernel options, especially those which are
> supposed to be critical (means: I switched them off) and I used a
> GENERIC kernel for a while, but it makes no difference. The crash occurs
> while using a graphical console, Xorg X11 (version 4.7.0 as compiled
> from the ports), fvwm2 (develepmonet version, but crash occurs also with
> windowmaker so the GUI seems not to be an issue). I also tried to fix
> the problem by using built in fxp-NIC instead of the 64Bit Intel GBit
> LAN adapter (em0), but it is always the same.

What are you using for disk?  Are you using the built-in ATA controller?

> I will append a mptable -verbose -dmesg output for your information and
> I will add the error message I receive. Most time when the crash occurs
> I did a lot of graphical load (working on several TIFF files 200MB in
> size or with Mozilla/FireFox), but this may simply trigger or fasten up
> the problem.

Are these operations compute- or i/o-intensive or are CPU or I/O bound?

> Sometimes I can not get a 'systat -vmstat 1' output, calling vmstat in
> systat results in 'Alternate system clock has died. Reverting to
> ''pigs'' ...'. This happens very often in SMP, but not in UP.

Thats not good.  That may indicate interrupt routing problems, and ASUS is
traditionally bad at writing ACPI code.  You may try disabling ACPI if you
haven't already.

> I will add, that the UP system (SMP disabled by kern.smp.disable='1' in
> loader.conf) was up for nearly 13 days under same conditions when a SMP
> box crashes after several minutes, sevral hours.

Good to know.

> This is the last console error I received:
>
> Fatal trap 12: page fault while in kernel mode
> cpuid = 1; apic id = 00
> fault virtual address  = 0x1c
> fault code  =  supervisor write, page not present
> instruction pointer  =  0x8:0xc062ac76
> stack pointer  =  0x10:0x4e2d7ac
> frame pointer  =  0x10:0xe4e2d7c4
> code segment  = base 0x0, limit 0xfffff, type 0x1b
>               = DPL 0, pres 1, def32 1, gran 1
> processor eflags  = interrupt enabled, resume, IOPL = 0
> current process = 44 (swi5: clock sio)
> [thread 100042]
> Stopped at      vref +0x16: lock cmpxchgl %edx, 0x1c(%edx)

Hm, null vnode reference.  vref() just increments the usecount on a vnode,
but its surrounded by mutex operations on that vnode which use that
particular instruction.  Considering that the releases in question are not
known to have these types of problems I'd say we're looking at a hardware
problem.

> What is 'swi5: clock sio'? Is this problem hardware related? Why only in
> SMP? Others seem not to have problems with 5.3 and SMP, maybe this is
> very specific to me due to the RCC based mainboard I use (in the past I
> had a lot of problems with a TYAN 2500 mobo also based on ServerWorks
> chipset in conjunction with FreeBSD 4/5).

I've run Linux on that series of Tyan board (2510 and the later 2518) with
only one problem -- the onboard ATA controller is known to cause data
corruption and should not be used under any circumstances.  The 2518 ships
with an onboard Promise that works. If you are using the onboard ATA
controller I strongly suggest using some other disk interface.

A temporary workaround would be to turn off DMA mode on your disks by
adding this to /boot/loader.conf and rebooting:

hw.ata.ata_dma="0"

This will of course cause a huge performance impact, particularly if your
work is I/O bound.

I'd also check for the usual hardware suspects -- cooling problems
(insufficient heatsinks, broken fans, poorly designed airflow, etc.),
overclocking, bad or incorrect memory, bad processor, bad motherboard.

-- 
Doug White                    |  FreeBSD: The Power to Serve
dwhite at gumbysoft.com          |  www.FreeBSD.org


More information about the freebsd-stable mailing list