RELENG_7: zfs mirror causes ata timeout

Tue Jan 8 19:54:01 PST 2008

Quoting Jeremy Chadwick <koitsu at FreeBSD.org>:

> On Tue, Jan 08, 2008 at 05:28:46PM -0500, Stephen M. Rumble wrote:
>> I'm having a bit of trouble with a new machine running the latest RELENG_7
>> code. I have two 500GB WD Caviar GP disks on a mini-itx GM965-based board
>> (MSI "fuzzy") running amd64 with 4GB of ram. The disks are:
>
> Could be related to a PR that I submit long ago, but was not specific to
> ZFS -- instead, it appeared to be specific to the motherboard I was
> using.  There's also some tidbits posted by others which appeared to
> help them, although performance was impacted:
>
> http://www.freebsd.org/cgi/query-pr.cgi?pr=103435
>
> Another related PR, which seems to indicate motherboard problems:
>
> http://www.freebsd.org/cgi/query-pr.cgi?pr=93885

Thanks. I'm not sure they apply, but I'll keep them in mind. The Intel  
chipsets seem to be rather bug-free; at least, I didn't see any  
mention of quirks or workarounds when glancing over code. The problems  
I'm seeing also seem to occur during low utilisation, not high  
(remarkably, keeping the system active seems to postpone issues!). I'm  
not sure PCI bus issues would be a likely culprit and I don't see any  
obviously relevant BIOS settings.

>> ad4: 476940MB <WDC WD5000AACS-00ZUB0 01.01B01> at ata2-master SATA150
>> ad6: 476940MB <WDC WD5000AACS-00ZUB0 01.01B01> at ata3-master SATA150
>>
>> I've tried different power supplies and cables. I've enabled and disabled
>> spread spectrum clocking and tried both SATA300 and SATA150 rates. I've
>> also tried switching drives between ports so that what was ad4 is ad6 and
>> what was ad6 is ad4. The problems persist, but seem to follow the same
>> drive (ad6 originally, then ad4 when swapped). This seems to indicate a
>> drive problem, but it works great on its own, even when exercising both
>> disks simultaneously. SMART reports no problems and ZFS reports no issues
>> when ad6 is used on its own outside of a zfs mirror. It seems like it's the
>> drive, but it works fine when not in a mirror. I'm stumped. Any ideas?
>
> Have you tried running long SMART tests (smartctl -t long) on both of
> these drives, ditto with an offline test (smartctl -t offline)?
> Statistics that are labelled "Offline" as their type won't get updated
> until an offline test is performed.  It's possible those statistics may
> provide some answers, but no guarantees.

Nope, but I'm going to do that right now!

>> The only interesting bit of evidence I could find is that when these errors
>> do occur, smartctl reports an increase in the Start_Stop_Count field on
>> ad6. ad4, which appears to work fine, doesn't demonstrate this and has a
>> much lower value.
>
> Start_Stop_Count indicates the drive is actually stopping then spinning
> back up (usually caused by a reset of some kind; equivalent of powering
> down then back up but without the loss of power).  It's possible that
> your drive has actual problems -- this is supported by the fact that the
> problem follows the disk (when moving the disk to another SATA port).

I'm leaning ever closer to blaming the disk. I still can't explain why  
I couldn't make it misbehave with it on its own zfs pool and UFS  
filesystems. However, shortly after setting the dubious disk offline  
using zpool, I poked at it with 'atacontrol cap' and managed to wedge  
it. Upon issuing the command it sounded like it was spinning up (it  
should never spin down, although these GP drives are supposed to lower  
their RPM while idle) and atacontrol hung. I couldn't kill it and top  
listed the state as 'ata re'. The rest of the system was responsive,  
but the machine wouldn't shutdown properly, presumably on account of  
that stuck channel.

> Tracking down the source of this problem usually requires a lot of time,
> money, and trial-and-error techniques.  This is what I'd go with:
>
> 1) See if there's a BIOS update.  I know at least in the case of Intel
> manufactured boards BIOS updates have solved weird problems like this in
> the past.

None. BIOS version 1.0 doesn't leave me convinced it's bug-free, though ;)

> 2) Try an Advanced RMA with Western Digital (which guarantees you get a
> brand new drive rather than chancing that they repair the one you send
> them) and see if a new drive helps.

I'll definitely look into that.

> 3) Try replacing the motherboard with a different brand (non-MSI).  I
> have nothing against MSI, but switching vendors usually means that you
> ensure a cross-model h/w bug (e.g. something vendor does in the BIOS or
> engineering which is suspect).  Try Asus or Gigabyte.  Obviously this
> will cost money to do and will very likely set you out the cost of the
> motherboard you have currently, but it's a viable option since you've
> already tried replacing SATA cables.

I suppose I could always stick the disks in another box, boot it up,  
and see what happens. Actually, I may just do that next.

> I'm not sure why ZFS would cause something like this to happen vs. UFS.
> I happen to run ZFS at home (same machine as what's mentioned in PR
> 103435, with the replaced motherboard of course) doing very heavy disk
> I/O across two disks, and I have never seen problems of this sort.  That
> doesn't mean there isn't a problem, just that I haven't encountered it
> with ZFS.

I'm not convinced it's any issue with ZFS or FreeBSD. Rather, it seems  
that using a ZFS mirror just makes that drive unhappy posthaste. If I  
didn't want to avoid rebuilding the dataset, I'd try gmirror. I  
probably haven't been patient enough to let the problem exhibit itself  
outside of the mirrored configuration.

[snip working system tease]

Thanks for all your input,
Steve