9.1-stable: ATI IXP600 AHCI: CAM timeout

Wed May 29 22:54:15 UTC 2013

On Wed, May 29, 2013 at 10:09:14AM +0200, Oliver Fromme wrote:
> Hi,
> 
> Yesterday I have downloaded the latest 9.1 snapshot (May 15th)
> from ftp.freebsd.org and installed it on a machine that was
> previously running Linux.  It works fine, except that I get
> many the following when there is heavy disk I/O, e.g. when
> building world or ports:
> 
> ahcich0: Timeout on slot 23 port 0
> ahcich0: is 00000000 cs f07fffff ss ffffffff rs ffffffff tfd c0 serr 00000000 cmd 0004bc17
> (ada0:ahcich0:0:0:0): WRITE_FPDMA_QUEUED. ACB: 61 40 00 c9 e0 40 04 00 00 00 00 00
> (ada0:ahcich0:0:0:0): CAM status: Command timeout
> (ada0:ahcich0:0:0:0): Retrying command

The messages above indicate two things:

1) The AHCI driver is reporting an internal timeout when trying to speak
to the underlying device (disk) attached to whatever maps to ahcich0;
this is an "AHCI-level timeout", and the 2nd line shows all of the
AHCI-level status conditions at that time,

2) CAM reports what it was trying to do when that happened,
specifically issue WRITE_FPDMA_QUEUED (an NCQ-based write to ada0),
which timed out after 30 seconds (kern.cam.ada.default_timeout).

> It happens for *both* ahcich0/ada0 and ahcich1/ada1 equally
> often (it's a gmirror), sometimes even at exactly the same
> time so the messages for ada0 and ada1 are interleaved in
> the dmesg output.

Both surprising and not surprising (to me anyway), on numerous levels.

> The worst thing is that the whole system seems to freeze
> completely for about 10 seconds each time it happens.
> Other than that, I haven't seen any ill effects, i.e. no
> processes dying and no panics (so far).  But the system is
> quite unusable because of the freezes.

There isn't much you can do about this.  I get the impression from your
statement this is the first time you've ever encountered an I/O timeout
in your life?  :-)  This is just how it works -- pretty much the entire
I/O subsystem (for the device(s) involved) "stalls" until a response to
the CDB gets received.  It's like this on all OSes, all systems; it's
how I/O works.

The AHCI driver may have different timeout settings; I haven't looked.

The same CDB gets re-submit to the controller 5 times
(kern.cam.ada.retry_count will say 4, but it starts at 0 if I remember
right), in hopes that the I/O transaction will eventually go through.

Repeated device timeouts with no successful responses will eventually
cause CAM or AHCI (I forget which driver/subsystem) to drop the disk.
In your case, this could mean ada0 and ada1 eventually getting dropped,
which would induce a panic since you're using them for your root
filesystem.  (I wonder if there are readers of this thread who are
starting to see why I use a single disk for my main OS drive...)

> I'm pretty sure the hardware has no defects.  The machine
> was running Linux fine until recently.
>
> Are there any known issues with FreeBSD + ATI IXP600?

This is opening a can of worms, which I've discussed in the past.
Please see my posts to freebsd-fs and/or freebsd-stable archives
(another person in this thread mentioned it as well).

Fact: there is still not enough low-level, hard evidence at this time to
determine if the problem is with the AHCI driver, the AMD/ATI IXP600
controller, or Samsung disks.  The situations I have dealt with in the
past always were inconclusive.  There have been reports of problems with
non-Samsung disks as well, but the report count there is extremely low
in comparison.

Fact: You will find complaints on Linux lists about both the controller
and the drives as well (in combo).  Take that to mean whatever you wish.
Use Google and search for "SB600 HD403LJ Linux" or "SB600 Samsung Linux"
and see for yourself.

Fact: Samsung's SpinPoint series has had a troubling past of firmware
bugs.  Things have gotten better on their newer-ish drives, but the
"slightly older" ones, to me, seemed more like a learning experience
for engineers.  I am not picking on Samsung exclusively; all drive
vendors have had problems historically, there is no such thing as a
"reliable" drive vendor in this day and age.  You go with whatever works
for you/whatever your experiences justify.

All that said:

There is some code in sys/dev/ahci/ahci.c that indicates "one-off"
behaviour for the SB600/IXP600, pertaining to NCQ.  However this was
committed a long time ago (r196777 and r196796).  I look at this
code and I can think of one problem with it, but answers to my below
questions will provide what I need.

> The kernel is the default GENERIC from the snapshot, the
> only additional modules loaded are geom_mirror and linux.ko.
> The dmesg messages related to disks are copied below, and
> the full dmesg can be found here:
> http://www.secnetix.de/olli/tmp/dmesg.nox.txt
> 
> Best regards
>    Oliver
> 
> FreeBSD 9.1-STABLE #0: Mon May 13 05:10:23 UTC 2013
>     root at snap.freebsd.org:/usr/obj/usr/src/sys/GENERIC amd64
> ..
> ahci0: <ATI IXP600 AHCI SATA controller> port 0xb000-0xb007,0xa000-0xa003,0x9000-0x9007,0x8000-0x8003,0x7000-0x700f mem 0xfe7ff800-0xfe7ffbff irq 22 at device 18.0 on pci0
> ahci0: AHCI v1.10 with 4 3Gbps ports, Port Multiplier supported
> ahcich0: <AHCI channel> at channel 0 on ahci0
> ahcich1: <AHCI channel> at channel 1 on ahci0
> ahcich2: <AHCI channel> at channel 2 on ahci0
> ahcich3: <AHCI channel> at channel 3 on ahci0
> ..
> ..
> (aprobe0:ahcich0:0:15:0): NOP. ACB: 00 00 00 00 00 00 00 00 00 00 00 00
> (aprobe0:ahcich0:0:15:0): CAM status: Command timeout
> (aprobe0:ahcich0:0:15:0): Error 5, Retries exhausted
> (aprobe1:ahcich1:0:15:0): NOP. ACB: 00 00 00 00 00 00 00 00 00 00 00 00
> (aprobe1:ahcich1:0:15:0): CAM status: Command timeout
> (aprobe1:ahcich1:0:15:0): Error 5, Retries exhausted
> ada0 at ahcich0 bus 0 scbus0 target 0 lun 0
> ada0: <SAMSUNG HD403LJ CT100-12> ATA-8 SATA 2.x device
> ada0: 300.000MB/s transfers (SATA 2.x, UDMA6, PIO 8192bytes)
> ada0: Command Queueing enabled
> ada0: 381554MB (781422768 512 byte sectors: 16H 63S/T 16383C)
> ada0: Previously was known as ad4
> ada1 at ahcich1 bus 0 scbus1 target 0 lun 0
> ada1: <SAMSUNG HD403LJ CT100-12> ATA-8 SATA 2.x device
> ada1: 300.000MB/s transfers (SATA 2.x, UDMA6, PIO 8192bytes)
> ada1: Command Queueing enabled
> ada1: 381554MB (781422768 512 byte sectors: 16H 63S/T 16383C)
> ada1: Previously was known as ad6
> ..
> GEOM_MIRROR: Device mirror/gm0 launched (2/2).
> ..
> Trying to mount root from ufs:/dev/mirror/gm0s1a [rw]...
> ..
> ahcich0: Timeout on slot 23 port 0
> ahcich0: is 00000000 cs f07fffff ss ffffffff rs ffffffff tfd c0 serr 00000000 cmd 0004bc17
> (ada0:ahcich0:0:0:0): WRITE_FPDMA_QUEUED. ACB: 61 40 00 c9 e0 40 04 00 00 00 00 00
> (ada0:ahcich0:0:0:0): CAM status: Command timeout
> (ada0:ahcich0:0:0:0): Retrying command
> ahcich1: Timeout on slot 12 port 0
> ahcich1: is 00000000 cs ffff8fff ss ffffffff rs ffffffff tfd 40 serr 00000000 cmd 0004ee17
> (ada1:ahcich1:0:0:0): WRITE_FPDMA_QUEUED. ACB: 61 20 80 85 e3 40 04 00 00 00 00 00
> (ada1:ahcich1:0:0:0): CAM status: Command timeout
> (ada1:ahcich1:0:0:0): Retrying command
> ahcich1: Timeout on slot 2 port 0
> ahcich1: is 00000000 cs 00000000 ss 0000001c rs 0000001c tfd 40 serr 00000000 cmd 0004e417
> ahcich0: Timeout on slot 12 port 0
> (ada1:ahcich1:0:0:0): WRITE_FPDMA_QUEUED. ACB: 61 08 e0 04 e3 40 04 00 00 00 00 00
> (ada1:ahcich1:0:0:0): CAM status: Command timeout
> ahcich0: is 00000000 cs 00000000 ss 00007000 rs 00007000 tfd 40 serr 00000000 cmd 0004ee17
> (ada1:ahcich1:0:0:0): Retrying command
> (ada0:ahcich0:0:0:0): WRITE_FPDMA_QUEUED. ACB: 61 08 e0 04 e3 40 04 00 00 00 00 00
> (ada0:ahcich0:0:0:0): CAM status: Command timeout
> (ada0:ahcich0:0:0:0): Retrying command
> pid 40615 (try), uid 0: exited on signal 10 (core dumped)
> ahcich1: Timeout on slot 7 port 0
> ahcich1: is 00000000 cs fffff07f ss ffffffff rs ffffffff tfd c0 serr 00000000 cmd 0004ac17
> ahcich0: (ada1:ahcich1:0:0:0): WRITE_FPDMA_QUEUED. ACB: 61 40 00 7d 92 40 02 00 00 00 00 00
> Timeout on slot 19 port 0
> (ada1:ahcich1:0:0:0): CAM status: Command timeout
> ahcich0: is 00000000 cs ff07ffff ss ffffffff rs ffffffff tfd c0 serr 00000000 cmd 0004b817
> (ada1:ahcich1:0:0:0): Retrying command
> (ada0:ahcich0:0:0:0): WRITE_FPDMA_QUEUED. ACB: 61 40 00 7d 92 40 02 00 00 00 00 00
> (ada0:ahcich0:0:0:0): CAM status: Command timeout
> (ada0:ahcich0:0:0:0): Retrying command
> ahcich1: Timeout on slot 12 port 0
> ahcich1: is 00000000 cs 00000000 ss 0000f000 rs 0000f000 tfd 40 serr 00000000 cmd 0004ef17
> ahcich0: Timeout on slot 24 port 0
> (ada1:ahcich1:0:0:0): WRITE_FPDMA_QUEUED. ACB: 61 20 d8 78 e4 40 04 00 00 00 00 00
> (ada1:ahcich1:0:0:0): CAM status: Command timeout
> ahcich0: is 00000000 cs 00000000 ss 0f000000 rs 0f000000 tfd 40 serr 00000000 cmd 0004fb17
> (ada1:ahcich1:0:0:0): Retrying command
> (ada0:ahcich0:0:0:0): WRITE_FPDMA_QUEUED. ACB: 61 20 d8 78 e4 40 04 00 00 00 00 00
> (ada0:ahcich0:0:0:0): CAM status: Command timeout
> (ada0:ahcich0:0:0:0): Retrying command
> ahcich1: Timeout on slot 1 port 0
> ahcich1: is 00000000 cs 00000000 ss 0000003e rs 0000003e tfd 40 serr 00000000 cmd 0004e517
> ahcich0: Timeout on slot 13 port 0
> (ada1:ahcich1:0:0:0): WRITE_FPDMA_QUEUED. ACB: 61 08 30 e0 e4 40 04 00 00 00 00 00
> (ada1:ahcich1:0:0:0): CAM status: Command timeout
> ahcich0: is 00000000 cs 00000000 ss 0003e000 rs 0003e000 tfd 40 serr 00000000 cmd 0004f117
> (ada1:ahcich1:0:0:0): Retrying command
> (ada0:ahcich0:0:0:0): WRITE_FPDMA_QUEUED. ACB: 61 08 30 e0 e4 40 04 00 00 00 00 00
> (ada0:ahcich0:0:0:0): CAM status: Command timeout
> (ada0:ahcich0:0:0:0): Retrying command
> ..
> ..

It's worth pointing out that all of the events you provided are writes.
In my experience, historically, that has usually been the case.  If a
drive firmware screws around when handling an NCQ write, taking too long
to do something (think firmware bug), this can happen.  If that's the
case, the fact it happens on 2 disks of the same type thus wouldn't
surprise me.

I've mentioned in the past that I know of a few situations where this
can happen, particularly with 4KByte sector drives, depending on how the
user set up the system.  In this case, the Samsung HD403LJ is supposedly
a 512-byte sector drive, but the drive probably complies with an older
ATA specification and thus only provides the logical sector size in ATA
IDENTIFY output, thus the system must assume physical=logical
(camcontrol and smartmontools will both say something to the effect of
"512 bytes logical/physical").

I would appreciate the following:

1. smartctl -x {ada0,ada1} output using a recent version of
smartmontools (6.1 if possible please),

2. camcontrol identify {ada0,ada1} -v output (note the -v),

3. If you are running smartd(8) or not,

4. pciconf -lvbc output.

Anecdotal story:

A lot of people forget the infamous nVidia nForce 4 vs. Maxtor NCQ issue
that circulated "PC enthusiast" sites during the mid-2000s.  Neither
company wanted to own up to the problem, blaming each other instead.
There was never any official statement made as to where the problem was,
only that nVidia updated their nForce 4 controller drivers with some
sort of workaround (details were not disclosed), and Maxtor also quietly
added a document to their website stating that you could get a firmware
from Technical Support that would address the problem as well.  I had
a combination of the two at the time, which is why I remember it.  Still to
this day nobody knows who was really responsible.  I won't get into the
whole political/societal aspects of why vendors always blame one another
rather than solve real problems.

There is no way at this time (in real-time or via loader.conf) to
disable NCQ within the AHCI driver.  It is possible to add an entry to
the AHCI quirks table for your controller that sets AHCI_Q_NONCQ, if you
want to try that.  I can give you a patch for that, but I need to see
the output from the above (4) commands first -- it may not be necessary
to try, depending on the results.

I have probably left out key/important informations within this mail,
which is an indicator of how tired I have grown of seeing it come up.
:-(

-- 
| Jeremy Chadwick                                   jdc at koitsu.org |
| UNIX Systems Administrator                http://jdc.koitsu.org/ |
| Mountain View, CA, US                                            |
| Making life hard for others since 1977.             PGP 4BD6C0CB |