nfs-server silent data corruption

Arno J. Klaassen arno at heho.snv.jussieu.fr
Mon Apr 21 21:46:58 UTC 2008


re,


Jeremy Chadwick <koitsu at freebsd.org> writes:

> On Mon, Apr 21, 2008 at 04:52:55PM +0200, Arno J. Klaassen wrote:
> > Kris Kennaway <kris at FreeBSD.ORG> writes:
> > > Uh, you're getting server-side data corruption, it could definitely be
> > > because of the memory you added.
> > 
> > yop, though I'm still not convinced the memory is bad (the very same
> > Kingston ECC as the 2*1G in use for about half a year already) :
> 
> Can you download and run memtest86 on this system, with the added 2G ECC
> insalled?  memtest86 doesn't guarantee showing signs of memory problems,
> but in most cases it'll start spewing errors almost immediately.


it finished in a bit less than 3 hours without a single error/warning

I feel pretty confident all memory is fine
 
> One thing I did notice in the motherboard manual below is something
> called "Hammer Configuration".  It appears to default to 800MHz, but
> there's an "Auto" choice.  Does using Auto fix anything?

Nope

> > I added it directly to the 2nd CPU (diagram on page 9 of
> >  http://www.tyan.com/manuals/m_s2895_101.pdf) and the problem
> > seems to be the interaction between nfe0 and powerd .... :
> 
> That board is the weirdest thing I've seen in years.


;) I agree I lifted (?) my eye-brows the first time I saw that
diagram


> Two separate CPUs using a single (shared) memory controller, two
> separate (and different!) nVidia chipsets, a SMSC I/O controller
> probably used for serial and parallel I/O, two separate nVidia NICs with
> Marvell PHYs (yet somehow you can bridge the two NICs and PHYs?), two
> separate PCI-e busses (each associated with a separate nVidia chipset),
> two separate PCI-X busses... the list continues.

some may say "it's just four wheels, an engine and a steer", she looks
different compared to most others
 

> I know you don't need opinions at this point, but what a behemoth.  I
> can't imagine that thing running reliably.

though it does ;) (till the day I decided she deserved a -stable upgrade
and 2 more gigs ...)
 
> >  - if I stop powerd, problems go away
> 
> This would imply that clock frequency stepping is somehow attributing
> itself to the corruption.  I don't see any BIOS options for controlling
> things related to AMD's Cool-n-Quiet or PowerNow! feature, which is
> usually what handles this.

you can turn it on/off; anyway, the problem *seems* easy to reproduce
when freq drops quickly form 2600Mhz to 1000Mhz ....
I just inspected a few corrupted copies, but out of 10-200Mbytes
just 1 byte was 0 iso \t

> >  - I let run powerd but turn of txcsum and tso4 on the interface,
> >    the problem is a lot harder to produce (if ever this gives
> >    a hint to anyone)
> 
> Possibly shared interrupts are causing problems?


don't think so; I first had two Promise TX4 cards in this box iso
the Marvell 8port card; since I had problems with TX4 some time
ago I first suspected them. The board is still running memtest86, but
from the dmesg I posted I don't see a shared irq.

>  MSI/MSI-X doing
> something odd?  Have you tried disabling MSI/MSI-X and see if it makes a
> difference?


MSI is disabled as is PCI-e Error reporting (or something like
that)

> 
> I think you mean "MAC LAN Bridge", according to the motherboard manual.
> I'm not even sure what that really does; somehow trunks the two NICs
> together to give you the equivalent of 2000mbit of traffic?  I don't
> know.

probably; I never tried ;) I need the second NIC for a seperate
subnet
 
> Does the corruption you see go away if you install a separate NIC (e.g.
> an Intel NIC) in a PCI or PCI-e slot, and disable the onboard NICs
> (should be "MAC LAN: Disable" on both the primary and slave)?

Don't have one available right now (for a 2U server).
I will test if I do not find another solution.

Thanx, Arno


More information about the freebsd-stable mailing list