ahcich Timeouts SATA SSD

Fri Oct 12 16:47:30 UTC 2012

My configuration is as follows:

FreeBSD 8.2-RELEASE
Supermicro X8DTi-LN4F (Intel Tylersburg 5520 chipset) motherboard
24 GB system memory
32 x Hitachi Deskstar 5K3000 disks connected to 4 x Intel SASUC8I (LSI 3081E-R) in IT mode
2 x Crucial M4 64 Gb SATA SSD for FreeBSD OS (zroot)
2 x Intel 320 MLC 80 Gb SATA SSD for L2ARC and swap
SSD are connected to on-board SATA port on motherboard

This system was commissioned in February of 2012 and ran without issue as a ZFS backup system on our network until about 3 weeks ago.

At that time I started getting kernel panics due to timeouts to the on-board SATA devices. The only change to the system since it was built was to add an SSD for swap (32 Gb swap device) and this issue did not happen until several months after this was added.

My initial thought was that I might have a bad SSD drive so I swapped out one of the Crucial SSD drives and the problem happened again a few days later.

I then moved to systematically replacing items such as SATA cables, memory, motherboard, etc and the problem continued. For example, I swapped out the 4 SATA cables with brand new SATA cables and waited to see if the problem happened again. Once it did I moved on to replacing the motherboard with an identical motherboard, waited, etc.

I could not find an obvious hardware related explanation for this behavior so about a week and a half ago I did a fresh install of FreeBSD 9.0-RELEASE to move from the ATA driver to the AHCI driver as I found some evidence that this was helpful.

The problem continued with something like this:

ahcich0: Timeout on slot 29 port 0
ahcich0: is 000000000 cs 00000000 ss e0000000 rs e0000000 tfd 40 serr 00000000 cmd 0004df17

ahcich0: AHCI reset: device not ready after 31000ms (tfd = 00000080)
ahcich0: Timeout on slot 31 port 0
ahcich0: is 00000000 cs 80000000 ss 00000000 rs 80000000 tfd 80 serr 00000000 cmd 0004df17
(ada0:ahcich0:0:0:0): lost device

ahcich0: AHCI reset: device not ready after 3100ms (tfd = 00000080)
ahcich0: Timeout on slot 31 port 0
ahcich0: is 00000000 cs 80000003 ss 800000003 rs 80000003 tfd 80 serr 0000000 cmd 0004df17
(ada0:ahcich0:0:0:0): removing device entry

ahcich0: AHCI reset: device not ready after 31000ms (tfd = 00000080)
ahcich0: Poll timeout on slot 1 port 0
ahcich0: is 00000000 cs 00000002 ss 000000000 rs 0000002 tfd 80 serr 00000000 cmd 004c117

When this happens the only way to recover the system is to hard boot via IPMI (yanking the power vs hitting reset). I cannot say that every time this happens a hard reset is necessary but more often than not a hard reset is necessary as the on-board AHCI portion of the BIOS does not always see the disks after the event without a hard system power reset.

I have done a bunch of Google work on this and have seen the issue appear in FreeNAS and FreeBSD but no clear cut resolution in terms of how to address it or what causes it. Some people had a bad SSD, others had to disable NCQ or power management on their SSD, particular brands of SSD (Samsung), etc.

Nothing conclusive so far.

At the present time the issue happens every 1-2 hours unless I have the following in my /boot/loader.conf after the ahci_load statement:

ahci_load="YES"

# See ahci(4)
hint.ahcich.0.sata_rev=1
hint.ahcich.1.sata_rev=1
hint.ahcich.2.sata_rev=1
hint.ahcich.3.sata_rev=1

hint.ahcich.0.pm_level=1
hint.ahcich.1.pm_level=1
hint.ahcich.2.pm_level=1
hint.ahcich.3.pm_level=1

I have a script in /usr/local/etc/rc.d which disables NCQ on these drives:

#!/bin/sh

CAMCONTROL=/sbin/camcontrol

$CAMCONTROL tags ada0 -N 1 > /dev/null
$CAMCONTROL tags ada1 -N 1 > /dev/null
$CAMCONTROL tags ada2 -N 1 > /dev/null
$CAMCONTROL tags ada3 -N 1 > /dev/null

exit 0

I went ahead and pulled the Intel SSDs as they were showing ASR and hardware resets which incremented. Removing both of these disks from the system did not change the situation.

The combination of /boot/loader.conf and this script gets me 6 days or so of operation before the issue pops up again. If I remove these two items I get maybe 2 hours before the issue happens again.

Right now I'm down to one OS disk and one swap disk and that is it for SSD disks on the system.

At the last reboot (yesterday) I disabled APM on the disks (ada0 and ada1 at this point) to see if that makes a difference as I found a reference to this being a potential problem.

I'm looking for insight/help on this as I'm about out of options. If there is a way to gather more information when this happens, post up information, etc I'm open to trying it.

What is driving me crazy is that I can't seem to come up with a concrete explanation as to why now and not back when the system was built. The issue only seems to happen when the system is idle and the SSD drives do not see much action other than to host OS, scripts, etc while the Intel/LSI based drives is where the actual I/O is at.

The system logs do not show anything prior to event happening and the OS will respond to ping requests after the issue and if you have an active SSH session you will remain connected to the system until you attempt to do something like 'ls', 'ps', etc.

New SSH requests to the system get 'connection refused'.

As far as I can see I have three real options left:

* Hope that someone here knows something I don't
* Ditch SSD for straight SATA disks (plan on doing this next week before next likely happening sometime Wed am) as perhaps there is some odd SATA/SSD interaction with FreeBSD or with controller I'm not aware of (haven't had this happen with plain SATA and FreeBSD before)
* Ditch FreeBSD for Solaris so I can keep ZFS lovin for the intended purpose of this system

I'm open to suggestions, direction, etc to see if I can nail down what is going on and put this issue to bed for not only myself but for anyone else who might run into it before I lose what little hair and sanity I have left...heh

- Nate