ahcich Timeouts SATA SSD

Sun Oct 14 23:03:47 UTC 2012

I originally posted this to the FreeBSD hardware forum and then on
freebsd-questions at the direction of a moderator in the forum.

Based on what I'm seeing for post types on freebsd-questions this
might be the best forum for this issue as it looks like some sort of a
strange issue or bug between FreeBSD 8.2/9.0 and SATA SSD drives.

My configuration is as follows:

FreeBSD 8.2-RELEASE
Supermicro X8DTi-LN4F (Intel Tylersburg 5520 chipset) motherboard
24 GB system memory
32 x Hitachi Deskstar 5K3000 disks connected to 4 x Intel SASUC8I (LSI
3081E-R) in IT mode
2 x Crucial M4 64 Gb SATA SSD for FreeBSD OS (zroot)
2 x Intel 320 MLC 80 Gb SATA SSD for L2ARC and swap
SSD are connected to on-board SATA port on motherboard

This system was commissioned in February of 2012 and ran without issue
as a ZFS backup system on our network until about 3 weeks ago.

At that time I started getting kernel panics due to timeouts to the
on-board SATA devices. The only change to the system since it was
built was to add an SSD for swap (32 Gb swap device) and this issue
did not happen until several months after this was added.

My initial thought was that I might have a bad SSD drive so I swapped
out one of the Crucial SSD drives and the problem happened again a few
days later.

I then moved to systematically replacing items such as SATA cables,
memory, motherboard, etc and the problem continued. For example, I
swapped out the 4 SATA cables with brand new SATA cables and waited to
see if the problem happened again. Once it did I moved on to replacing
the motherboard with an identical motherboard, waited, etc.

I could not find an obvious hardware related explanation for this
behavior so about a week and a half ago I did a fresh install of
FreeBSD 9.0-RELEASE to move from the ATA driver to the AHCI driver as
I found some evidence that this was helpful.

The problem continued with something like this:

ahcich0: Timeout on slot 29 port 0
ahcich0: is 000000000 cs 00000000 ss e0000000 rs e0000000 tfd 40 serr
00000000 cmd 0004df17

ahcich0: AHCI reset: device not ready after 31000ms (tfd = 00000080)
ahcich0: Timeout on slot 31 port 0
ahcich0: is 00000000 cs 80000000 ss 00000000 rs 80000000 tfd 80 serr
00000000 cmd 0004df17
(ada0:ahcich0:0:0:0): lost device

ahcich0: AHCI reset: device not ready after 3100ms (tfd = 00000080)
ahcich0: Timeout on slot 31 port 0
ahcich0: is 00000000 cs 80000003 ss 800000003 rs 80000003 tfd 80 serr
0000000 cmd 0004df17
(ada0:ahcich0:0:0:0): removing device entry

ahcich0: AHCI reset: device not ready after 31000ms (tfd = 00000080)
ahcich0: Poll timeout on slot 1 port 0
ahcich0: is 00000000 cs 00000002 ss 000000000 rs 0000002 tfd 80 serr
00000000 cmd 004c117

When this happens the only way to recover the system is to hard boot
via IPMI (yanking the power vs hitting reset). I cannot say that every
time this happens a hard reset is necessary but more often than not a
hard reset is necessary as the on-board AHCI portion of the BIOS does
not always see the disks after the event without a hard system power
reset.

I have done a bunch of Google work on this and have seen the issue
appear in FreeNAS and FreeBSD but no clear cut resolution in terms of
how to address it or what causes it. Some people had a bad SSD, others
had to disable NCQ or power management on their SSD, particular brands
of SSD (Samsung), etc.

Nothing conclusive so far.

At the present time the issue happens every 1-2 hours unless I have
the following in my /boot/loader.conf after the ahci_load statement:

ahci_load="YES"

# See ahci(4)
hint.ahcich.0.sata_rev=1
hint.ahcich.1.sata_rev=1
hint.ahcich.2.sata_rev=1
hint.ahcich.3.sata_rev=1

hint.ahcich.0.pm_level=1
hint.ahcich.1.pm_level=1
hint.ahcich.2.pm_level=1
hint.ahcich.3.pm_level=1

I have a script in /usr/local/etc/rc.d which disables NCQ on these drives:

#!/bin/sh

CAMCONTROL=/sbin/camcontrol

$CAMCONTROL tags ada0 -N 1 > /dev/null
$CAMCONTROL tags ada1 -N 1 > /dev/null
$CAMCONTROL tags ada2 -N 1 > /dev/null
$CAMCONTROL tags ada3 -N 1 > /dev/null

exit 0

I went ahead and pulled the Intel SSDs as they were showing ASR and
hardware resets which incremented. Removing both of these disks from
the system did not change the situation.

The combination of /boot/loader.conf and this script gets me 6 days or
so of operation before the issue pops up again. If I remove these two
items I get maybe 2 hours before the issue happens again.

Right now I'm down to one OS disk and one swap disk and that is it for
SSD disks on the system.

At the last reboot (yesterday) I disabled APM on the disks (ada0 and
ada1 at this point) to see if that makes a difference as I found a
reference to this being a potential problem.

I'm looking for insight/help on this as I'm about out of options. If
there is a way to gather more information when this happens, post up
information, etc I'm open to trying it.

What is driving me crazy is that I can't seem to come up with a
concrete explanation as to why now and not back when the system was
built. The issue only seems to happen when the system is idle and the
SSD drives do not see much action other than to host OS, scripts, etc
while the Intel/LSI based drives is where the actual I/O is at.

The system logs do not show anything prior to event happening and the
OS will respond to ping requests after the issue and if you have an
active SSH session you will remain connected to the system until you
attempt to do something like 'ls', 'ps', etc.

New SSH requests to the system get 'connection refused'.

As far as I can see I have three real options left:

* Hope that someone here knows something I don't
* Ditch SSD for straight SATA disks (plan on doing this next week
before next likely happening sometime Wed am) as perhaps there is some
odd SATA/SSD interaction with FreeBSD or with controller I'm not aware
of (haven't had this happen with plain SATA and FreeBSD before)
* Ditch FreeBSD for Solaris so I can keep ZFS lovin for the intended
purpose of this system

I'm open to suggestions, direction, etc to see if I can nail down what
is going on and put this issue to bed for not only myself but for
anyone else who might run into it in the future.