Constant minor ZFS corruption

Fri Mar 11 00:18:00 UTC 2011

On Fri, Mar 11, 2011 at 09:19:32AM +1000, Stephen McKay wrote:
> On Thursday, 10th March 2011, Chris Forgeron wrote:
> >Lastly, check what Mike Tancsa said about his hardware - All of my
> >gear is quality,  1000W dual redundant power supplies, LSI SAS
> >controllers, ECC registered ram, no overclocking, etc, etc.  You may
> >have a software issue, but it's more likely that ZFS is just exposing
> >some instability in your system. Has your RAM checked out with a Memtest
> >run overnight? We're talking small, intermittent errors here, not big
> >red flags that will be obvious to spot.
> 
> The ASUS PIKE2008 card is LSI based.  Our RAM is ECC.  We're not
> overclocking (in fact I disabled turbo-boost).  We haven't run memtest
> but we have done a few "make buildworld" runs.  All of these completed
> without error.  And with ECC RAM, we should see log messages if anything
> is wrong there anyway.

Specifically with regards to your last sentence: you're making blind
assumptions here.  Let me talk a bit about how ECC RAM errors are
reported to the motherboard and how all of that works.

(Also -- calling John Baldwin to come in here and correct me if I'm
wrong, because over the years I've had to piece all of this together
myself, and I could obviously have parts wrong.  :-) )

When there's an uncorrectable-bit or correctable-bit errors (of either
single-bit or multi-bit types), witnessed on ECC RAM, the memory
controller can (doesn't have to!) throw, on the PCI bus, what's called a
PERR or SERR signal.  The BIOS controls this capability, and what
PERR/SERR can get turned into.  Some BIOSes permit you to tie these
signals to an interrupt (usually some form of NMI).  The operating
system's kernel has to be written to understand this NMI and handle it
appropriately.  So you have the following pieces that are required for
the OS to report an ECC error:

1) Use of ECC RAM,
2) A memory controller on your motherboard (or possibly the MCH is
   within the CPU, such as on newer Core iX CPUs or some Xeons) that
   supports throwing PERR# and SERR# signals,
3) A BIOS that can set up an NMI generation on PERR or SERR,
4) An operating system that knows how to handle that NMI.

There are a LOT of motherboards out there which "support ECC", but what
they mean to say is "our board works with ECC RAM, but if there's
uncorrected bit errors we didn't implement any mechanisms to tell the
underlying OS, lolz".  Lots of consumer-grade boards that claim to work
with either ECC or non-ECC RAM do this.  You won't find the BIOS tweaks
in there, and Technical Support will just tell you "yes board X works
with ECC".  Lovely situation.

Does FreeBSD support the above?  I have absolutely no idea.  The only
systems I've used which can generate an NMI on PERR or SERR are Tyan
boards (we use them at work), and all those systems run Solaris.
Solaris also has really good MCA support -- more on that next.

Now, there's also another possibility/mechanism, which is MCA.

MCA is something that's generated by the actual processor and covers
quite a vast number of hardware events of all ranges (some minor, some
major).  MCA will generate an MCE when there's any sort of memory error
and so on.  The OS has to have support for handling MCA, and also has to
provide decent details of the MCE.  Decoding MCEs is tricky, especially
on FreeBSD.  John Baldwin has made some patches for getting Linux's
mcelog working -- well, the log parsing part -- on FreeBSD (but they're
slightly out of date; I can provide more recent patches if need be).
Don't expect direct DMI to work on FreeBSD with mcelog, for example.

So with this situation we now have:

1) CPU has to support MCA,
2) OS has to support MCA and know how to decode MCEs properly,
3) Utilities to decode MCEs correctly.

FreeBSD 8.x does support MCA (it's enabled by default), and if you skim
the -stable list you'll find people occasionally trying to figure out
why their system is spewing these mysterious MCEs and what they mean.

MCA is only available, however, if your CPU supports it, and my gut
feeling says that parts of the system (motherboard) have to have parts
integrated as well.

So circling back to your very first post, you said you were using:

Asus P7F-E (includes 6 3Gb/s SATA ports)

Oh dear, Asus.  What kind of mission-critical environment uses this
hardware?  :-)  Let's see what the user manual has in it.  Section 4.4.2
has options related to the Northbridge (which I'm not sure what it is in
this case; the board supports Core iX CPUs which have on-die MCH, so I'm
not sure what this controls).  All of the items in this section of the
manual are horribly documented, but ones that catch my eye are:

* DRAM Margin Ranks (Enabled/Disabled)
* MRC Serial Debug Message Level (Disabled/Min/Max/Test)
* Memory ECC Function (Enabled/Disabled)
* Page Policy (Closed/Open)
* Adaptive Page (Disabled/Enabled)
* Data Scramble (Disabled/Enabled)
* Memory Thermal Throttling (Disabled/CLTT/OLTT)

I know what the 3rd and last items do, but not the rest.

There's also something on the Southbridge part of the manual which is
strange: something called "Energy Lake Feature".  It defaults to
Disabled, with a comment "We do not recommend you enable this feature".
This is all I could find:

* Energy Lake technology introduces two main end-user features: the
  "Consumer Electronics" (CE)-like device power behavior, and
  maintaining system state and data integrity during power loss
  events.

* Allow you to configure Intel's Energy Lake power management
  technology. If you are running a Media Center you can install the
  Intel VIIV software to get the correct driver; otherwise disable
  the Energy Lake feature in BIOS (it relates purely to Intel's Quick
  Resume feature, which is generally useless).

Otherwise, I see no mention of MCA, PERR/SERR, or anything else that's
considered useful (by my standards).  I see lots of server-esque options
like BIOS-level serial console, but the rest of the board is extremely
desktop-oriented, which is what Asus is known for.

> We have tried to buy quality hardware.  At least, we didn't deliberately
> skimp (except to build our own box vs buy a big name brand pre-built zfs
> server).

No offence intended -- honestly -- but I question anyone who would buy
an Asus motherboard for a server.  If I was sitting in a meeting room
with infrastructure engineers discussing what to buy and someone said
"We're considering Asus", I would say "This is a joke, right?"  (Note
that for my home Windows workstations, I do use Asus motherboards)

Sure, the motherboard might not even be the problem.  But I'm just
saying, who knows what's going on here, I have to question everything.

You followed up with "we're starting to question the PIKE card", which
should in turn make you question exactly why you bought this hardware to
begin with.  My recommendation, while not wanting to spend zillions of
bucks on HP/Compaq or Dell hardware?  Supermicro.  I can't talk about
their storage HBAs, but many other people here can -- the results have
been hit-or-miss.  I tend to stick with solely Intel ICHxx or ESBx
on-board controllers, which FreeBSD works wonderfully with.

-- 
| Jeremy Chadwick                                   jdc at parodius.com |
| Parodius Networking                       http://www.parodius.com/ |
| UNIX Systems Administrator                  Mountain View, CA, USA |
| Making life hard for others since 1977.               PGP 4BD6C0CB |