9.1-stable: ATI IXP600 AHCI: CAM timeout

Oliver Fromme olli at lurza.secnetix.de
Wed May 29 18:44:13 UTC 2013


Ian Lepore wrote:
 > On Wed, 2013-05-29 at 16:21 +0200, Oliver Fromme wrote:
 > > Steven Hartland wrote:
 > > > Have you checked your sata cables and psu outputs?
 > > > 
 > > > Both of these could be the underlying cause of poor signalling.
 > > 
 > > I can't easily check that because it is a cheap rented
 > > server in a remote location.
 > > 
 > > But I don't believe it is bad cabling or PSU anyway, or
 > > otherwise the problem would occur intermittently all the
 > > time if the load on the disks is sufficiently high.
 > > But it only occurs at tags=3 and above.  At tags=2 it does
 > > not occur at all, no matter how hard I hammer on the disks.
 > > 
 > > At the moment I'm inclined to believe that it is either
 > > a bug in the HDD firmware or in the controller.  The disks
 > > aren't exactly new, they're 400 GB Samsung ones that are
 > > several years old.  I think it's not uncommon to have bugs
 > > in the NCQ implementation in such disks.
 > > 
 > > The only thing that puzzles me is the fact that the problem
 > > also disappears completely when I reduce the SATA rev from
 > > II to I, even at tags=32.
 > 
 > It seems to me that you dismiss signaling problems too quickly.
 > Consider the possibilities... A bad cable leads to intermittant errors
 > at higher speeds.  When NCQ is disabled or limited the software handles
 > these errors pretty much transparently.  When NCQ is not limitted and
 > there are many outstanding requests, suddenly the error handling in the
 > software breaks down somehow and a minor recoverable problem becomes an
 > in-your-face error.
 > 
 > I'm not saying any of the foregoing is true, just that you should
 > consider the possibility that you're dealing with multiple problems
 > which are only loosely coupled, but together can seem like a single more
 > serious problem.  You don't know enough yet to casually dismiss
 > anything.

Well ...  I also can't dismiss the possibility that there is
a mouse in the machine that is pulling the SATA cables twice
every minute.  :-)

But seriously ...  I don't see how bad cabling could cause
errors at tags=3 and no errors at all at tags=2.  It shouldn't
make a difference for the cables if there are two or three
tags used.  And by the way, it doesn't make a difference at
all whether I use tags=3 or tags=32; the rate of errors is the
same in both cases (about two per minute during buildword).

I have googled a bit; the Samsung HD401LJ and HD403LJ don't
seem to be innocent ...  There are lots of pages mentioning
problems with NCQ and SATA I vs. II.

Best regards
   Oliver


-- 
Oliver Fromme,  secnetix GmbH & Co. KG,  Marktplatz 29, 85567 Grafing
Handelsregister:  Amtsgericht Muenchen, HRA 74606, Geschäftsfuehrung:
secnetix Verwaltungsgesellsch. mbH, Handelsreg.: Amtsgericht München,
HRB 125758, Geschäftsführer:  Maik Bachmann,  Olaf Erb,  Ralf Gebhart

FreeBSD-Dienstleistungen/-Produkte + mehr: http://www.secnetix.de/bsd

"A misleading benchmark test can accomplish in minutes
what years of good engineering can never do." -- Dilbert (2009-03-02)


More information about the freebsd-stable mailing list