Update on data corruption with Tyan/3Ware

Ted Mittelstaedt tedm at toybox.placo.com
Tue Sep 25 21:24:29 PDT 2007



> -----Original Message-----
> From: owner-freebsd-questions at freebsd.org
> [mailto:owner-freebsd-questions at freebsd.org]On Behalf Of Erik Trulsson
> Sent: Tuesday, September 25, 2007 8:06 AM
> To: Ted Mittelstaedt
> Cc: Chris Boyd; freebsd-questions at freebsd.org; Bart Silverstrim
> Subject: Re: Update on data corruption with Tyan/3Ware
>
>


> > "...We've narrowed the problem down to files that are > 4GB.  Anytime we
> > have a file that's > 4GB, we get inconsistent checksums, can't
> > uncompress it, etc.  Files < 4GB are fine..."
>
> I missed none of that.  I just note that the 3ware driver and card knows
> nothing about files.  It has no way of knowing whether the blocks it is
> reading and writing belongs to one large file or several small files.
> Therefore if there are problems only with *files* larger than 4GB it seems
> unlikely that the problem is with the card or its driver.
>

I'm sure it seems unlikely but I've seen many a problem source be an
"unlikely" source.

> >
> > So as I already stated the VERY FIRST RESPONSE that Chris needs to
> > go to 3ware and ask them what is going on.  Unless your going
> to continue
> > to say that FreeBSD64 has a 4GB filesize limitation?
>
> No, I know very well that FreeBSD does not have any 4GB filesize
> limitation.
> It can have bugs in the filesystem or virtual memory system though.

I think those bugs have been ironed.  People have been complaining
about 4GB limitations for several years now, and the FreeBSD developers
have been fixing these problems as they come up.  One of the
motivators for going to 64 bit was to support large files like this.

I think if more people were seeing this we would see far more complaints
about it.

> The userland programs reading and writing the file might also have bugs
> for that matter.
>

That is true.  But how many different userland programs have to fail
before you stop blaming userland programs?  In any case this is easy
as pie to eliminate - run the same userland program on a different
system and see if it fails the same way.

>
> The first things I would check in such a situation is if the same problem
> happens with some other disk controller in the same system.

I wouldn't.  The big reason you buy raid controllers like the 3ware is
because they are supported by the manufacturer.  That's good money you
have paid 3ware and they owe you some time for support.

If 3ware comes back and says "we tested the 9550 on amd64 bit and there
is no problem with larger than 4GB files" then that is the time to spend
the effort building a test system and checking, or putting a disk in your
existing system and testing, or whatever.  If you find the 3ware controller
is the problem after doing this then your going to need the audit trail
in order to get them to fix the problem - or you return the card to where
you bought it from and buy a hipoint card.  Loss of revenue from returns
often speaks the loudest of all.

As it is, simply due to this posting of his, I have gone ahead and added
a >4GB test into the list of tests in the buildsheets for all of my
3ware 9550 servers, and I have
a couple myself.  Meaning, the next time I have to tear down and
rebuild any of them (hopefully far in the future) I will test for this
condition before putting the server online.  And if it fails you
better believe 3ware will hear about it and I'll file a PR and such.
Fortunately I do not deal with that large of files on any of those
servers.

There is always the chance 3ware will come back and say "Oops, you
are right there's a bug in the driver, here's a fix"

I would feel pretty stupid after having gone to all that trouble to
tear into the system to prove the card is at fault, only to have them
come back and say "yep, we knew about that"

> I would also check the RAM carefully with Memtest86 or similar. (Bad RAM
> can cause all kinds of very strange behaviour.)
>

More wasted time jumping the gun.  If both the 3ware card and another
controller failed this test THEN that is the time to start in with the
memory tests and other kinds of tests.  With bad ram many times it takes
days of testing it over and over and over for the ram to fail once.

And his symptoms are too repeatable anyway.  Bad ram almost always causes
random strange behavior, it is rarely associated with something as
repeatable
as what he is describing.  I wouldn't rule it out of course - but start
with the easy tests first - and the easiest of all is asking the
manufacturer
if it is a known problem.

Ted



More information about the freebsd-questions mailing list