ad10: WARNING - READ_DMA UDMA ICRC error (retrying request) LBA=11441599

Wed Aug 10 23:47:01 GMT 2005

On Thu, Aug 11, 2005 at 12:46:04AM +0200, S?ren Schmidt wrote:
> 
> On 10/08/2005, at 22:51, Karl Denninger wrote:
> >
> >This is the subject of the PR I filed back in February.
> >
> >Again, if you want either a controller shipped to you OR access to a
> >development machine (e.g. ssh in and play) which has the suspect
> >configuration on it, the latter of which is probably the best  
> >option (since
> >making it fail is simple) I'm willing to provide either - my only  
> >caveat is
> >that if I send hardware I want it back when you're done, and I  
> >believe its
> >reasonable to expect that 6.0 will get HELD in its release cycle  
> >until this
> >is resolved.
> 
> I have plenty of the sii3112's around, so thats not needed, however  
> I've not managed to get ahold of a machine in which it fails reliably  
> with ATA as is in 6.0.

I have two which reliably fail if you put TWO disks on them in a gmirror
config within minutes of starting a "make buildworld".  With one disk
it takes a bit longer and more effort, but can still be forced to fail.

It appears to require a mix of read and write operations and a fairly heavy
- but not horiffically so - I/O load to make it blow up.

All reads or all writes do NOT fail.  For example, you can do a gmirror
rebuild and it will succeed.  That's all writes (to the new disks) until
complete.  Seconds to minutes after the rebuilds complete if the system 
is under heavy random I/O load it will fail.

>From this and other tests I've concluded that a MIX of read and write
operations are required, and the total load must be substantial.  Either
reads alone or writes alone do not appear to provoke it, even with 100% 
disk utilization.

> >The latter offer (ssh access) has been on the table for several  
> >months.  The
> >former I just put on the table as I threw up my hands and bought a  
> >3ware
> >card - which means I now have TWO of the suspect cards and need  
> >only one
> >for my own testing (in the sandbox)
> >
> >I'm willing to go WELL out of my way to make it possible for this  
> >to get
> >fixed, since there appears to be an issue with access to hardware that
> >breaks reliably.  However, I, and others, would like to know that  
> >we're
> >going to see the problem get resolved.
> 
> I've already gone WAY out of my way to try to support the sii3112,  
> and I'm not inclined to waste more of my precious spare time on it.  
> However, if it really is that important to enough people to try to  
> workaround the silicon bugs (which very likely isn't possible), get  
> together and get me failing HW on my desk and time to work on it.

Ok, then do the RIGHT THING and document that the SiI chips are declared
BROKEN by FreeBSD and likely to cause people trouble - including irrevocable 
data corruption.

This would have saved me COUNTLESS hours when I first ran into this 
issue.  Indeed, it was not until someone else started posting excerpts 
from commit logs (months after I filed the PR originally!) that I was 
aware FreeBSD developers considered these chipsets "damaged goods".

Where is fair warning in the hardware compatability guide?

Second, your requirement for <BOTH> hardware <AND TIME> simply can't be 
met.  It is not possible for anyone to manufacture or deliver time.  

Is it thus necessary for us "mere users" to consider this an issue that 
will simply not be addressed?  If so, then just say so up front <AND 
DOCUMENT THAT THE SII CHIPSETS DON'T WORK RIGHT.>

> >Again - this is hardware that is STABLE and works under 4.x - in  
> >the case of
> >my specific configuration I ran under 4.x for over a year without a  
> >single
> >incident.  With 5.4 and 6.0-BETA I can kill it inside of 2 minutes  
> >with
> >nothing more complicated than a "make -j4 buildworld".
> 
> First. you cannot by any degree of the word call the sii3112 for  
> STABLE hardware, its broken beyond repair or workarounds,  and even  
> the supplier acknowledges that fact.

Well then how about if FreeBSD officially DECLARES this hardware to be
broken beyond repair and workaround, and simply says "if this doesn't work
for you, don't bitch or complain, because we have nothing further we can 
do about it"?

That is acceptable, although I bet it costs 'ya a fair number of users,
particularly in the small server and workstation markets.  Of course since
its not "money lost", that may be perfectly OK to the FreeBSD team. 

It definitely will change MY focus as a developer of software often run on 
small office and home network machines though.  It HAS TO Soren.  

This isn't a matter of me not wanting to be a FreeBSD evangelist - but
if I try to tell people that half of the machines out there that they might
run FreeBSD on are likely to fail, and if they do my only recommendation is
"sorry, I can't do anything about it other than sell you this hardware", the
obvious next reply is that they will want the software to be made available 
on an operating system that DOESN'T blow up like this.  Linux ends up being
something I have to support of necessity down that road......  (a thing I've
studiously avoided now for five years, by the way.)

I have a 3ware card in my production machine now and the "allegedly broken"
disks are magically just fine.  Guess the disks are fine eh?  Of course I 
lost the functionality that I thought I was getting with the newer ATA code
anyway, since the 3ware software doesn't support hot plug, and I also lost
access to the disk statistics and self-test capabilities that smartmontools
has, since 3ware's board doesn't pass that through cleanly either.

But all this begs the question - why did it work on 4.x, and how come the 
same timing constraints and code paths that worked on 4.x can't / weren't 
incorporated into what's there now?

> Second, you cannot possibly have used gmirror (as in the PR) on 4.x  
> so what was the config back then ?

I didn't NEED gmirror back then.  Attempting to use these disks on a SiI
controller WITHOUT gmirror in 5.4 or even 6.0 is asking to have to reload 
the machine as the errors cause irrevocable data corruption.

I'm not about to subject myself to having to reload a machine a few hundred
times while troubleshooting it, and I suspect you know that is a completely 
unreasonable request.

Gmirror was added to my config in an attempt to stop the crashes during
testing - with at least one disk in the mirror on the ICH5 adapter the 
system (and data) survives.  It turns out that on 5.x this is much more 
"reasonable" to use than vinum, which was severely broken in 5.x (may
be fixed now as "gvinum", I didn't give it anoyther crack after pulling my
hair out for quite a long time with THAT one.)

I assure you that the load profiles that generate BOOMs on 5.4 and 6.0-BETA 
do NOT under 4.x with the IDENTICAL hardware in use.  Over a year of heavy
production use of 4.x with ZERO trouble is my evidence for this.

> Third, please get gmirror out of the loop (use atacontrol to create a  
> mirror if need be) and let me know if that changes anything.

Uh, if the abstraction done by GEOM is hardware-independant, and the error 
comes from the DRIVER, how can GEOM be involved?  

GEOM (gmirror in this case) prevents me from having to reload the machine
every time it blows up due to data corruption that cannot be fixed.

Never mind that others are reporting irrevocable data loss and crashes -
they aren't mirrored..... I've managed to keep my data intact....

"atacontrol" doesn't help me as there is no rebuild mechanism available 
for "garden variety" controllers (at least the last time I tried it that
did nothing.)  So you can build the array but after the first crash you 
had no way to recover.

That's only marginally better than having the crash wipe the sidewalk with
the data on your drive, in terms of troubleshooting effort.

> Forth, another thing to try is fumbling with BIOS settings, some  
> setups has been reported to start working when PCI timings is changed  
> YMMV..
> 
> - S?ren

I can play with this.... but if the hardware is the cause and requires
tweaking timing in the PCI BIOS config, how come 4.x works without any 
tweaking on the same hardware?

In short, what's changed in the DRIVER timing that provokes this sort of 
thing , and does it NEED to have changed?

Again, I can easily set up ssh access to a machine that has problems with
this, and the "BOOM"s are VERY repeatable.

>From the other postings here, I am by no means an isolated user with an
isolated problem - the issue is fairly widespread.

I suspect, but of course cannot prove, that if you find the issue with my
machine, you will likely fix a lot of other people's issues with similar
problems...... I could be wrong, but I bet not......

In any event the ATA code changes have hurt a LOT of people Soren and led to
a huge amount of wasted time.  If it was known that the SiI chipsets simply
were never going to get full support (because they are considered
"unsupportable") then it is only right for the development team to DOCUMENT
THIS rather than letting people find out for themselves the hard way,
pulling their hair out looking for phantom bad disk drives and phantom
problems with cables - neither of which has anything to do with it.

If there is going to be no path out of this mess then just say so and 
we'll realign our expectations of where FreeBSD fits in terms of what
environments it is reasonable to consider it for.

--
-- 
Karl Denninger (karl at denninger.net) Internet Consultant & Kids Rights Activist
http://www.denninger.net	My home on the net - links to everything I do!
http://scubaforum.org		Your UNCENSORED place to talk about DIVING!
http://genesis3.blogspot.com	Musings Of A Sentient Mind