ZFS stalled after some mirror disks were lost

Sat Oct 7 13:57:37 UTC 2017

> On 07 Oct 2017, at 15:08, Fabian Keil <freebsd-listen at fabiankeil.de> wrote:
> 
> Ben RUBSON <ben.rubson at gmail.com> wrote:
> 
>> So first, many thanks again to Andriy, we spent almost 3 hours debugging
>> the stalled server to find the root cause of the issue.
>> 
>> Sounds like I would need help from iSCSI dev team (Edward perhaps ?), as
>> issue seems to be on this side.
> 
> Maybe.
> 
>> Here is Andriy conclusion after the debug session, I quote him :
>> 
>>> So, it seems that the root cause of all evil is this outstanding zio
>>> (it might be not the only one).
>>> In other words, it looks like iscsi stack bailed out without
>>> completing all outstanding i/o requests that it had.
>>> It should either return success or error for every request, it can not
>>> simply drop a request.
>>> And that appears to be what happened here.  
>> 
>>> It looks like ZFS is fragile in the face of this type of errors.
> 
> Indeed. In the face of other types of errors as well, though.
> 
>>> Essentially, each logical i/o request obtains a configuration lock of
>>> type 'zio' in shared mode to prevent certain configuration changes
>>> from happening while there are any outsanding zio-s.
>>> If a zio is lost, then this lock is leaked.
>>> Then, the code that deals with vdev failures tries to take this lock in
>>> exclusive mode while holding a few other configuration locks also in
>>> exclsuive mode so, any other thread needing those locks would block.
>>> And there are code paths where a configuration lock is taken while
>>> spa_namespace_lock is held.
>>> And when spa_namespace_lock is never dropped then the system is close
>>> to toast, because all pool lookups would get stuck.
>>> I don't see how this can be fixed in ZFS.  
> 
> While I haven't used iSCSI for a while now, over the years I've seen
> lots of similar issues with ZFS pools located on external USB disks
> and ggate devices (backed by systems with patches for the known data
> corruption issues).
> 
> At least in my opinion, many of the various known spa_namespace_lock
> issues are plain ZFS issues and could be fixed in ZFS if someone was
> motivated enough to spent the time to actually do it (and then jump
> through the various "upstreaming" hoops).
> 
> In many cases tolerable workarounds exist, though, and sometimes they
> work around some of the issues well enough. Here's an example workaround
> that I've been using for a while now:
> https://www.fabiankeil.de/sourcecode/electrobsd/ElectroBSD-r312620-6cfa243f1516/0222-ZFS-Optionally-let-spa_sync-wait-until-at-least-one-v.diff
> 
> According to the commit message the issue was previously mentioned on
> freebsd-current@ in 2014 but I no longer remember all the details and
> didn't look them up.

There's no mention to code revision in this thread.
It finishes with a message from Alexander Motin :
"(...) I've got to conclusion that ZFS in many places
written in a way that simply does not expect errors. In such cases it
just stucks, waiting for disk to reappear and I/O to complete. (...)"

> I'm not claiming that the patch or other workarounds I'm aware of
> would actually help with your ZFS stalls at all, but it's not obvious
> to me that your problems can actually be blamed on the iSCSI code
> either.
> 
> Did you try to reproduce the problem without iSCSI?

No, I would have to pull out disks from their slots (well...), or shut-down
the SAS2008-IT adapter, or put disks offline (not sure how-to for these two).

I will test in the next few hours without GPT labels and GEOM labels,
as I use them and Andriy suspects they could be the culprit.

> Anyway, good luck with your ZFS-on-iscsi issue(s).

Thank you very much Fabian for your help and contribution,
I really hope we'll find the root cause of this issue,
as it's quite annoying in a HA-expected production environment :/

Ben