Constant minor ZFS corruption

Thu Mar 10 23:41:57 UTC 2011

On Fri, Mar 11, 2011 at 09:02:43AM +1000, Stephen McKay wrote:
> On Wednesday, 9th March 2011, Mike Tancsa wrote:
> 
> >On 3/9/2011 7:41 AM, Stephen McKay wrote:
> >> Of the 12 disks, only 1 has been error-free.  I've been doing this for
> >> about 10 days now and there is no pattern that I can see in the errors.
> 
> >After adding a larger case for future expansion, we found the next day
> >we were seeing all sorts of random errors
> >
> >Like
> >
> >Mar  3 05:34:47 offsite kernel: ad1: FAILURE - WRITE_DMA48
> >status=51<READY,DSC,ERROR> error=10<NID_NOT_FOUND> LBA=2281852580
> >
> >and
> >
> >Mar  4 08:56:15 offsite kernel: siisch1: siis_timeout is 00040000 ss
> >04000000 rs 04000000 es 00000000 sts 801e2000 serr 00000000

Speaking strictly to Mike here:

I spent some time a while ago trying to figure out the NID_NOT_FOUND
error.  Something I wrote back when I was contributing on the Wiki; see
section "SATA disk troubleshooting":

http://wiki.freebsd.org/JeremyChadwick/ATA_issues_and_troubleshooting

So, it could be that the LBA being accessed isn't within the permitted
valid range.  I could be completely off my rocker though; I'd need
someone much more familiar with the ATA-7 specification to state up
front what this bit actually defines.

Anyway, despite that, the controller is also reporting timeouts.  What
you haven't shown is what exact model of Silicon Image controller you're
using.  It matters.  There are certain models of SI chipsets that have
very bad, nasty bugs.  Other models of chips do not have these issues:

http://en.wikipedia.org/wiki/Silicon_Image#Product_alerts

> Our system does not report any driver errors or disk errors.  We see
> checksum errors from ZFS (mostly in scrubs).  It's like there's an
> invisible pixie sprinkling bad data on our disks while we sleep.

Speaking to Stephen:

With disk bit rot, your "system" (motherboard) won't report any errors.
The controller you're using won't report any errors.  The disks also
won't report any errors.  ZFS, however, *will* report checksum errors.

What if there's a bug in the FreeBSD driver you're using?  What if on
some rare occasion it only writes 4095 bytes of the 4096 it needs to
write?  What if there's a off-by-one bug in the FreeBSD driver where
it's randomly corrupting a piece of data it intends to write to the
disk?  And what about the firmware, which controls all the disk
interaction?

There's also the possibility that there's some wonkiness going on with
the memory controller on your mainboard; maybe it's randomly corrupting
something.  ECC RAM wouldn't necessarily detect this either.  FreeBSD
kmem/KVA, as I understand it, is dedicated solely to the kernel and not
to userland (so a userland app might not sig11, for example).  However I
would expect the kernel to be freaking out randomly in other ways (e.g.
I would expect the system to be behaving oddly and not just limited to
ZFS or disk I/O).

You get the idea.  The problem could be anywhere.  Welcome to OS,
system, and hardware troubleshooting in 2011, glad to have you on the
team.  ;-)

You're going to need to spend a lot of time debugging this, and some of
it will absolutely involve downtime unless you can afford to build a
complete 100% identical replica system that can reproduce the problem.
If you can reproduce the problem on that system, awesome.  My advice
would be to start (on the replica system) by replacing the controller
entirely.  Use some on-board SATA controller, or invest in an Areca or
"something else".  This will narrow down the problem to either the
controller, the controller firmware, or the FreeBSD driver.  That should
help.

> >We narrowed it down to 2 problems.  Failing / Marginal power supply and
> >bad SATA cables. After changing the power supply, we still had a few
> >disks errors.
> 
> If either of these were the cause of our problem, we'd see errors
> logged, right?  Not just invisible corruption?

Simple answer: no.  Long answer: I can't provide one because I'm not an
EE guy, so you'll just have to trust me: problems caused by dirty power
or "bad power" are absolutely crazy.  Given how complex hardware is
these days (numerous ASICs, circuitry components, etc.), absolutely
bizarre and weird things happen when a device doesn't get what it
expects.  That's about all I know, and there's lots of evidence on the
net to validate this fact.  I just wish I could put more absolute faith
into it, but since I don't understand EE/power "stuff", it'll always be
a mystery to me.

I could give you an example of a power-related problem I'm dealing with
at home that would probably blow your mind.  Contact me off-list if you
want the story (every person I've given it to so far has gone "...what?
That makes absolutely no sense.  Did you try...?"  "Yes"  "What
about..."  "Yep"  "...wow").

> We will probably swap the power supply and cables anyway soon, just to
> see what happens, but on other machines where cables or power was the
> problem I saw errors (just like yours) in the logs.

I imagine your controller has some kind of multi-lane break-out cable
that's used.  It's possible that thing is bad.

God I hate to bring this up, because it's really going out on a limb,
but there's always the remote possibility of interference/EMI causing
"weird things" to happen with data flowing across the cable.  However, I
STRONGLY doubt this; SMART attribute 199 (UDMA CRC Error Count) would
absolutely be incrementing whenever this occurred.

If you want to provide me with SMART stats (smartctl -a /dev/disk) for
each of your disks (please be sure to label them and re-provide me
"zpool status" output so I can correlate the checksum errors with the
disks), I will be happy to review them for you.

> >After almost 5 days of uptime, no problems at all now.  Not one error.
> 
> Well, we've got something to aim for, eh? :-)

I sure hope so.  :-)  Like you, I hate problems of this nature.  And
problems of this nature are exactly why I started spending a *lot* of
time, both at my job and outside of work, studying disks, ATA/SATA, and
storage a bit better.  I honestly don't mean to sound like a braggart
(despite how direct/pompous I am, I honestly have a very small ego), but
I've more or less become the main guy at my workplace when it comes to
disk/storage problems.

I just got done dealing with two separate cases of desktop-grade SATA
disks in our Citrix Netscaler products (which use FreeBSD) spewing DMA
errors right in the middle of OS upgrades (worst time for it to happen).
I was able to work around the problem using a combo of a sh script, dd,
and smartmontools, allowing upgrades to complete + get production
traffic working again.  We did RMAs on the disks/units later, since the
turnaround time for replacements was way outside of the permitted
maintenance window.  Networking owes me a case of beer.

-- 
| Jeremy Chadwick                                   jdc at parodius.com |
| Parodius Networking                       http://www.parodius.com/ |
| UNIX Systems Administrator                  Mountain View, CA, USA |
| Making life hard for others since 1977.               PGP 4BD6C0CB |