problems with AHCI on FreeBSD 8.2

Jeremy Chadwick freebsd at jdc.parodius.com
Wed Feb 15 10:52:06 UTC 2012



On Wed, Feb 15, 2012 at 02:42:05AM -0800, Jeremy Chadwick wrote:
> On Wed, Feb 15, 2012 at 10:19:37AM +0000, Tom Evans wrote:
> > On Tue, Feb 14, 2012 at 7:52 PM, Jeremy Chadwick
> > <freebsd at jdc.parodius.com> wrote:
> > > On Tue, Feb 14, 2012 at 08:31:23PM +0100, Oscar Prieto wrote:
> > >> I used to had tons of ahci errors in my 4 disk raidz1 worth of
> > >> HD154UIs when the rig was built a year ago or so (with 8.0 Release),
> > >> but they dissapeared after tuning ZFS.
> > >>
> > >> Sadly i also got a new timeout days ago followed with smartcl erros i
> > >> still keep unchecked but i guess they cold be legit, i still have to
> > >> test/swap cables and give it a try.
> > 
> > Interesting. I have 9 SAMSUNG HD154UI 1AG01118 in my raidz setup,
> > haven't had a problem with any of them yet (touch wood).
> > 
> > > Further details which pertain to Samsung drives:
> > >
> > > In your case, you run smartd(8), which periodically hits the drive with
> > > SMART requests, pulling attribute data down and parsing it. ??I believe
> > > your model is fine for this, but for similar Samsung models, I must
> > > strongly advise against this. ??There are well-documented problems with
> > > Samsung firmwares and SMART behaviour which can result in data loss (yes
> > > you read that right). ??Please see smartmontools' Wiki page on the matter
> > > for full details. ??Just make sure you're running a fixed firmware:
> > >
> > > http://sourceforge.net/apps/trac/smartmontools/wiki/SamsungF4EGBadBlocks
> > >
> > 
> > Yikes, I have just this week installed a HD204UI. From that page,
> > drives manufactured after December 2010 should not be affected, which
> > is fortunate as the linked firmware page doesn't seem to exist
> > anymore, Samsung no longer seem to offer support for their drives and
> > point you at Seagate, whose site (of course!) only has downloads for
> > current Seagate drives.
> > 
> > 
> > Hmm reading later on in the thread there is a patch to mark certain
> > drives as having flaky NCQ - in the patch it is for the SAMSUNG
> > HD154UI. As I mentioned before, I have 9 SAMSUNG HD154UI, all of which
> > use ahci(4) and NCQ, and all work perfectly, no timeouts. This is
> > using 9-STABLE.
> > 
> > I suspect that there may be more going on than 'flaky NCQ', and that
> > perhaps disabling NCQ masks the real issue.
> 
> It could simply be a firmware bug in the drive, which is what some
> others have eluded to (and I'm in agreement with).  I would love to say
> "compare firmware versions on your drives", except there is real
> in-the-field proof that firmware version strings often do not get
> updated/changed between firmwares (at least in the case of some Seagate
> and Western Digital disks).  Furthermore, NCQ can "play differently" with
> different AHCI controllers.
> 
> That said, the disks / firmware versions mentioned by people involved in
> this thread / referenced threads are:
> 
> * Victor Balada Diaz  -- SAMSUNG HD154UI, firmware 1AG01118
> * Claudius Herder     -- SAMSUNG HD753LJ, firmware 1AA01118
> * Oscar Prieto        -- SAMSUNG HD154UI, firmware 1AG01118
>   - NOTE: In Oscar's case, his drives exhibit other problems.  I
>     would provide a link but the web archive for freebsd-stable does
>     not show my mail which contains analysis of the situation
> * Harald Schmalzbauer -- not provided, but hints at Samsung EG drives
> 
> For this to be thorough, one would need to check what all AHCI
> controllers are being used and compare those as well.
> 
> I think Scott's theory is probably on-the-ball here, as it pertains to
> tag exhaustion, which would manifest itself in the described fashion:
> 
> http://lists.freebsd.org/pipermail/freebsd-stable/2012-February/066177.html
> 
> I'd urge people experiencing this problem to issue the command Scott
> provided on all their Samsung disks and see if the problem goes away
> after that.  If it does, great, and I acknowledge there is no
> loader.conf tunable for doing this, etc. etc. etc. so either make an
> rc.d script that does it after boot-up or something.

Sorry, I missed the in-line part of your post at the top where you said:

> > Interesting. I have 9 SAMSUNG HD154UI 1AG01118 in my raidz setup,
> > haven't had a problem with any of them yet (touch wood).

So that would be you using the same firmware (or so I'd like to believe,
but see my previous explanations) as others.

It could be some AHCI<->NCQ drive implementation quirk.  There was an
example of this back in the day with Maxtor drives' NCQ implementation
not behaving correctly on nVidia controllers, which Maxtor insisted was
an nVidia problem yet released a drive firmware fix for.  I'm one of the
people this affected (on my desktop system), which is why I remember it.

-- 
| Jeremy Chadwick                              jdc at parodius.com |
| Parodius Networking                     http://www.parodius.com/ |
| UNIX Systems Administrator                 Mountain View, CA, US |
| Making life hard for others since 1977.             PGP 4BD6C0CB |


More information about the freebsd-stable mailing list