Re: ZFS deadlocks triggered by HDD timeouts

From: Warner Losh <imp_at_bsdimp.com>
Date: Wed, 01 Dec 2021 21:45:52 UTC
On Wed, Dec 1, 2021, 2:36 PM Alan Somers <asomers@freebsd.org> wrote:

> On Wed, Dec 1, 2021 at 1:56 PM Warner Losh <imp@bsdimp.com> wrote:
> >
> >
> >
> > On Wed, Dec 1, 2021 at 1:47 PM Alan Somers <asomers@freebsd.org> wrote:
> >>
> >> On Wed, Dec 1, 2021 at 1:37 PM Warner Losh <imp@bsdimp.com> wrote:
> >> >
> >> >
> >> >
> >> > On Wed, Dec 1, 2021 at 1:28 PM Alan Somers <asomers@freebsd.org>
> wrote:
> >> >>
> >> >> On Wed, Dec 1, 2021 at 11:25 AM Warner Losh <imp@bsdimp.com> wrote:
> >> >> >
> >> >> >
> >> >> >
> >> >> > On Wed, Dec 1, 2021, 11:16 AM Alan Somers <asomers@freebsd.org>
> wrote:
> >> >> >>
> >> >> >> On a stable/13 build from 16-Sep-2021 I see frequent ZFS deadlocks
> >> >> >> triggered by HDD timeouts.  The timeouts are probably caused by
> >> >> >> genuine hardware faults, but they didn't lead to deadlocks in
> >> >> >> 12.2-RELEASE or 13.0-RELEASE.  Unfortunately I don't have much
> >> >> >> additional information.  ZFS's stack traces aren't very
> informative,
> >> >> >> and dmesg doesn't show anything besides the usual information
> about
> >> >> >> the disk timeout.  I don't see anything obviously related in the
> >> >> >> commit history for that time range, either.
> >> >> >>
> >> >> >> Has anybody else observed this phenomenon?  Or does anybody have a
> >> >> >> good way to deliberately inject timeouts?  CAM makes it easy
> enough to
> >> >> >> inject an error, but not a timeout.  If it did, then I could
> bisect
> >> >> >> the problem.  As it is I can only reproduce it on production
> servers.
> >> >> >
> >> >> >
> >> >> > What SIM? Timeouts are tricky because they have many sources, some
> of which are nonlocal...
> >> >> >
> >> >> > Warner
> >> >>
> >> >> mpr(4)
> >> >
> >> >
> >> > Is this just a single drive that's acting up, or is the controller
> initialized as part of the error recovery?
> >>
> >> I'm not doing anything fancy with mprutil or sas3flash, if that's what
> >> you're asking.
> >
> >
> > No. I'm asking if you've enabled debugging on the recovery messages and
> see that we enter any kind of
> > controller reset when the timeouts occur.
>
> No.  My CAM setup is the default except that I enabled CAM_IO_STATS
> and changed the following two sysctls:
> kern.cam.da.retry_count=2
> kern.cam.da.default_timeout=10
>
>
> >
> >>
> >> > If a single drive,
> >> > are there multiple timeouts that happen at the same time such that we
> timeout a request while we're waiting for
> >> > the abort command we send to the firmware to be acknowledged?
> >>
> >> I don't know.
> >
> >
> > OK.
> >
> >>
> >> > Would you be able to run a kgdb script to see
> >> > if you're hitting a situation that I fixed in mpr that would cause
> I/O to never complete in this rather odd circumstance?
> >> > If you can, and if it is, then there's a change I can MFC :).
> >>
> >> Possibly.  When would I run this kgdb script?  Before ZFS locks up,
> >> after, or while the problematic timeout happens?
> >
> >
> > After the timeouts. I've been doing 'kgdb' followed by 'source
> mpr-hang.gdb' to run this.
> >
> > What you are looking for is anything with a qfrozen_cnt > 0.. The script
> is imperfect and racy
> > with normal operations (but not in a bad way), so you may need to run it
> a couple of times
> > to get consistent data. On my systems, there'd be one or two devices
> with a frozen count > 1
> > and no I/O happened on those drives and processes hung. That might not
> be any different than
> > a deadlock :)
> >
> > Warner
> >
> > P.S. here's the mpr-hang.gdb script. Not sure if I can make an
> attachment survive the mailing lists :)
>
> Thanks, I'll try that.  If this is the problem, do you have any idea
> why it wouldn't happen on 12.2-RELEASE (I haven't seen it on
> 13.0-RELEASE, but maybe I just don't have enough runtime on that
> version).
>

9781c28c6d63 was merged to stable/13 as a996b55ab34c on Sept 2nd. I fixed a
bug
with that version in current as a8837c77efd0, but haven't merged it. I
kinda expect that
this might be the cause of the problem. But in Netflix's fleet we've seen
this maybe a
couple of times a week over many thousands of machines, so I've been a
little cautious
in merging it to make sure that it's really fixed. So far, the jury is out.

Warner