[Bug 206448] ZFS hang/stall when drives in ATA mode

bugzilla-noreply at freebsd.org bugzilla-noreply at freebsd.org
Wed Jan 20 21:37:02 UTC 2016


https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=206448

            Bug ID: 206448
           Summary: ZFS hang/stall when drives in ATA mode
           Product: Base System
           Version: 10.2-RELEASE
          Hardware: amd64
                OS: Any
            Status: New
          Severity: Affects Only Me
          Priority: ---
         Component: kern
          Assignee: freebsd-bugs at FreeBSD.org
          Reporter: danmcgrath.ca at gmail.com
                CC: freebsd-amd64 at FreeBSD.org
                CC: freebsd-amd64 at FreeBSD.org

Created attachment 165888
  --> https://bugs.freebsd.org/bugzilla/attachment.cgi?id=165888&action=edit
Screenshot of ata console error

I had a Dell PowerEdge R210 amd64 system that was exhibiting some off
behaviour. A year or two ago I had one of the systems 2 1TB SATA drives drop
out of raid, but surprisingly it I simply added it back and it has been fine
ever since. Then this week I installed py27-salt on the servers.

After installing salt everything seemed fine for the first day. After the daily
mails for the machine came in however, I noticed that the daily periodic got
stuck running some smartd checks for the log. I tried to kill the process but
ended up not being able to, which prompted a reboot. After the reboot there
were jails that refused to start and all of a sudden found myself unable to do
any writes to the drive, and only the message "ata2: already connected!"
showing up on the console.

After some digging (thanks to auditd and salt and system logs), I was able to
narrow the trigger down to some camcontrol inquiry and identify commands that
would reliably trigger the problem.

After some more digging I was noticing that only this server (out of several
identical/near identical) was showing the problem and that for some strange
reason there were /dev/gpt/swap0 (and swap1) files only on this system. Also
odd was that when I went to try some tests with stopping swap (`gmirror stop
swap`) I found that the second I tried to stop the swap mirror, it redetected
the swap mirror but under different device names (see screenshot of the console
in attachments). I also noticed that the dmesg of this system only, was showing
some odd "unmapped" messages:

  GEOM_MIRROR: cancelling unmapped because of ada0p2
  GEOM_MIRROR: cancelling unmapped because of ada1p2
  GEOM_MIRROR: Device mirror/swap launched (2/2).

As for the ZFS symptoms, when the console would show the "already attached!"
error, ZFS (this was a zfs install with the mirrored swap option enabled) would
no longer allow writes (or at least very slowly, in the area of 1 IOPS), and
reads would eventually fail (when doing a test with `find /`), which I assume
happens when they run out of cache entries.

In the end I stumbled on the BIOS setting having the drives set to ATA mode
instead of AHCI or RAID, and correcting this setting seems to have solved the
problem. While I can't know for sure if this is a "bug" or just a known
limitation of ATA, it would almost seem like camcontrol was somehow briefly
disconnecting the drives when being issued commands, and in turn was causing
the swap device to switch from ada0p2 to gpt/swap0 and vice versa, possibly
causing some sort of bug in ZFS.

Anyway, this is the report, and hopefully helps fix a possible bug lurking
around the system that could cause problems for other users.

Cheers o/

-- 
You are receiving this mail because:
You are the assignee for the bug.


More information about the freebsd-bugs mailing list