Linux NFSv4 clients are getting (bad sequence-id error!)

Thu Jul 23 10:41:10 UTC 2015

Ahmed Kamal wrote:
> rhel6 servers logs were flooded with errors like: http://paste2.org/EwLGcGF6
> The Freebsd box was being pounded with 40Mbps of nfs traffic .. probably
> Linux was retrying too hard ?! I had to reboot all PCs and after the last
> one, nfsd CPU usage dropped immediately to zero
> 
Btw, it would be interesting to know what triggers these things (overload of
the nfs server resulting in very slow response or ???). Basically Destroy_session
isn't an operation that a client would normally do. I have no idea why the Linux
client would do it. (A session is what achieves the "exactly once" semantics for
the RPCs. It should really be in the RPC layer, but the NFSv4 working group put
it in NFSv4.1 because they didn't want to replace Sun RPC. I can't think of a reason
to destroy a session except on dismount. Maybe if the client thinks the session is
broken for some reason??)

Maybe something like "vmstat -m", "vmstat -z" and "nfsstat -s -e" running repeatedly
(once/sec with timestamps via "date" or similar) so that you can see what was happening just
before the meltdowns.

A raw packet trace of just when the meltdown starts would be useful, but I can't think
of how you'd get one of reasonable size. Maybe having "tcpdump -s 0 -w <file>.pcap <client-host>"
run for 1sec and then kill/restart it repeatedly with different file names, so you might get
a useful 1sec capture at the critical time?

Anyhow, good luck with it, rick

> On Tue, Jul 21, 2015 at 5:52 AM, Ahmed Kamal <
> email.ahmedkamal at googlemail.com> wrote:
> 
> > More info .. Just noticed nfsd is spinning the cpu at 500% :( I just did
> > the dtrace with:
> >
> > dtrace -n profile-1001 { @[stack()] = count(); }
> > The result is at http://paste2.org/vb8ZdvF2 (scroll to bottom)
> >
> > Since rebooting the nfs server didn't fix it .. I imagine I'd have to
> > reboot all NFS clients .. This would be really sad .. Any advice is most
> > appreciated .. Thanks
> >
> >
> > On Tue, Jul 21, 2015 at 5:26 AM, Ahmed Kamal <
> > email.ahmedkamal at googlemail.com> wrote:
> >
> >> Hi folks,
> >>
> >> I've upgraded a test client to rhel6 today, and I'll keep an eye on it to
> >> see what happens.
> >>
> >> During the process, I made the (I guess mistake) of zfs send | recv to a
> >> locally attached usb disk for backup purposes .. long story short,
> >> sharenfs
> >> property on the received filesystem was causing some nfs/mountd errors in
> >> logs .. I wasn't too happy with what I got .. I destroyed the backup
> >> datasets and the whole pool eventually .. and then rebooted the whole nas
> >> box .. After reboot my logs are still flooded with
> >>
> >> Jul 21 05:12:36 nas kernel: nfsrv_cache_session: no session
> >> Jul 21 05:13:07 nas last message repeated 7536 times
> >> Jul 21 05:15:08 nas last message repeated 29664 times
> >>
> >> Not sure what that means .. or how it can be stopped .. Anyway, will keep
> >> you posted on progress.
> >>
> >> On Fri, Jul 17, 2015 at 9:31 PM, Rick Macklem <rmacklem at uoguelph.ca>
> >> wrote:
> >>
> >>> Graham Allan wrote:
> >>> > I'm curious how things are going for you with this?
> >>> >
> >>> > Reading your thread did pique my interest since we have a lot of
> >>> > Scientific Linux (RHEL clone) boxes with FreeBSD NFSv4 servers. I meant
> >>> > to glance through our logs for signs of the same issue, but today I
> >>> > started investigating a machine which appeared to have hung processes,
> >>> > high rpciod load, and high traffic to the NFS server. Of course it is
> >>> > exactly this issue.
> >>> >
> >>> > The affected machine is running SL5 though most of our server nodes are
> >>> > now SL6. I can see errors from most of them but the SL6 systems appear
> >>> > less affected - I see a stream of the sequence-id errors in their logs
> >>> but
> >>> > things in general keep working. The one SL5 machine I'm looking at
> >>> > has a single sequence-id error in today's logs, but then goes into a
> >>> > stream of "state recovery failed" then "Lock reclaim failed". It's
> >>> > probably partly related to the particular workload on this machine.
> >>> >
> >>> > I would try switching our SL6 machines to NFS 4.1 to see if the
> >>> > behaviour changes, but 4.1 isn't supported by our 9.3 servers (is it in
> >>> > 10.1?).
> >>> >
> >>> Btw, I've done some testing against a fairly recent Fedora and haven't
> >>> seen
> >>> the problem. If either of you guys could load a recent Fedora on a test
> >>> client
> >>> box, it would be interesting to see if it suffers from this. (My
> >>> experience is
> >>> that the Fedora distros have more up to date Linux NFS clients.)
> >>>
> >>> rick
> >>>
> >>> > At the NFS servers, most of the sysctl settings are already tuned
> >>> > from defaults. eg tcp.highwater=100000, vfs.nfsd.tcpcachetimeo=300,
> >>> > 128-256 nfs kernel threads.
> >>> >
> >>> > Graham
> >>> >
> >>> > On Fri, Jul 03, 2015 at 01:21:00AM +0200, Ahmed Kamal via freebsd-fs
> >>> wrote:
> >>> > > PS: Today (after adjusting tcp.highwater) I didn't get any screaming
> >>> > > reports from users about hung vnc sessions. So maybe just maybe,
> >>> linux
> >>> > > clients are able to somehow recover from this bad sequence messages.
> >>> I
> >>> > > could still see the bad sequence error message in logs though
> >>> > >
> >>> > > Why isn't the highwater tunable set to something better by default ?
> >>> I mean
> >>> > > this server is certainly not under a high or unusual load (it's only
> >>> 40 PCs
> >>> > > mounting from it)
> >>> > >
> >>> > > On Fri, Jul 3, 2015 at 1:15 AM, Ahmed Kamal
> >>> > > <email.ahmedkamal at googlemail.com
> >>> > > > wrote:
> >>> > >
> >>> > > > Thanks all .. I understand now we're doing the "right thing" ..
> >>> Although
> >>> > > > if mounting keeps wedging, I will have to solve it somehow! Either
> >>> using
> >>> > > > Xin's patch .. or Upgrading RHEL to 6.x and using NFS4.1.
> >>> > > >
> >>> > > > Regarding Xin's patch, is it possible to build the patched nfsd
> >>> code, as
> >>> > > > a
> >>> > > > kernel module ? I'm looking to minimize my delta to upstream.
> >>> > > >
> >>> > > > Also would adopting Xin's patch and hiding it behind a
> >>> > > > kern.nfs.allow_linux_broken_client be an option (I'm probably not
> >>> the
> >>> > > > last
> >>> > > > person on earth to hit this) ?
> >>> > > >
> >>> > > > Thanks a lot for all the help!
> >>> > > >
> >>> > > > On Thu, Jul 2, 2015 at 11:53 PM, Rick Macklem <
> >>> rmacklem at uoguelph.ca>
> >>>
> >>> > > > wrote:
> >>> > > >
> >>> > > >> Ahmed Kamal wrote:
> >>> > > >> > Appreciating the fruitful discussion! Can someone please
> >>> explain to
> >>> > > >> > me,
> >>> > > >> > what would happen in the current situation (linux client doing
> >>> this
> >>> > > >> > skip-by-1 thing, and freebsd not doing it) ? What is the effect
> >>> of
> >>> > > >> > that?
> >>> > > >> Well, as you've seen, the Linux client doesn't function correctly
> >>> > > >> against
> >>> > > >> the FreeBSD server (and probably others that don't support this
> >>> > > >> "skip-by-1"
> >>> > > >> case).
> >>> > > >>
> >>> > > >> > What do users see? Any chances of data loss?
> >>> > > >> Hmm. Mostly it will cause Opens to fail, but I can't guess what
> >>> the
> >>> > > >> Linux
> >>> > > >> client behaviour is after receiving NFS4ERR_BAD_SEQID. You're the
> >>> guy
> >>> > > >> observing
> >>> > > >> it.
> >>> > > >>
> >>> > > >> >
> >>> > > >> > Also, I find it strange that netapp have acknowledged this is a
> >>> bug on
> >>> > > >> > their side, which has been fixed since then!
> >>> > > >> Yea, I think Netapp screwed up. For some reason their server
> >>> allowed
> >>> > > >> this,
> >>> > > >> then was fixed to not allow it and then someone decided that was
> >>> broken
> >>> > > >> and
> >>> > > >> reversed it.
> >>> > > >>
> >>> > > >> > I also find it strange that I'm the first to hit this :) Is no
> >>> one
> >>> > > >> running
> >>> > > >> > nfs4 yet!
> >>> > > >> >
> >>> > > >> Well, it seems to be slowly catching on. I suspect that the Linux
> >>> client
> >>> > > >> mounting a Netapp is the most common use of it. Since it appears
> >>> that
> >>> > > >> they
> >>> > > >> flip flopped w.r.t. who's bug this is, it has probably persisted.
> >>> > > >>
> >>> > > >> It may turn out that the Linux client has been fixed or it may
> >>> turn out
> >>> > > >> that most servers allowed this "skip-by-1" even though David
> >>> Noveck (one
> >>> > > >> of the main authors of the protocol) seems to agree with me that
> >>> it
> >>> > > >> should
> >>> > > >> not be allowed.
> >>> > > >>
> >>> > > >> It is possible that others have bumped into this, but it wasn't
> >>> isolated
> >>> > > >> (I wouldn't have guessed it, so it was good you pointed to the
> >>> RedHat
> >>> > > >> discussion)
> >>> > > >> and they worked around it by reverting to NFSv3 or similar.
> >>> > > >> The protocol is rather complex in this area and changed
> >>> completely for
> >>> > > >> NFSv4.1,
> >>> > > >> so many have also probably moved onto NFSv4.1 where this won't be
> >>> an
> >>> > > >> issue.
> >>> > > >> (NFSv4.1 uses sessions to provide exactly once RPC semantics and
> >>> doesn't
> >>> > > >> use
> >>> > > >>  these seqid fields.)
> >>> > > >>
> >>> > > >> This is all just mho, rick
> >>> > > >>
> >>> > > >> > On Thu, Jul 2, 2015 at 1:59 PM, Rick Macklem <
> >>> rmacklem at uoguelph.ca>
> >>> > > >> wrote:
> >>> > > >> >
> >>> > > >> > > Julian Elischer wrote:
> >>> > > >> > > > On 7/2/15 9:09 AM, Rick Macklem wrote:
> >>> > > >> > > > > I am going to post to nfsv4 at ietf.org to see what they
> >>> say.
> >>> > > >> > > > > Please
> >>> > > >> > > > > let me know if Xin Li's patch resolves your problem, even
> >>> though
> >>> > > >> > > > > I
> >>> > > >> > > > > don't believe it is correct except for the UINT32_MAX
> >>> case. Good
> >>> > > >> > > > > luck with it, rick
> >>> > > >> > > > and please keep us all in the loop as to what they say!
> >>> > > >> > > >
> >>> > > >> > > > the general N+2 bit sounds like bullshit to me.. its always
> >>> N+1 in
> >>> > > >> > > > a
> >>> > > >> > > > number field that has a
> >>> > > >> > > > bit of slack at wrap time (probably due to some ambiguity
> >>> in the
> >>> > > >> > > > original spec).
> >>> > > >> > > >
> >>> > > >> > > Actually, since N is the lock op already done, N + 1 is the
> >>> next
> >>> > > >> > > lock
> >>> > > >> > > operation in order. Since lock ops need to be strictly
> >>> ordered,
> >>> > > >> allowing
> >>> > > >> > > N + 2 (which means N + 2 would be done before N + 1) makes no
> >>> sense.
> >>> > > >> > >
> >>> > > >> > > I think the author of the RFC meant that N + 2 or greater
> >>> fails, but
> >>> > > >> it
> >>> > > >> > > was poorly worded.
> >>> > > >> > >
> >>> > > >> > > I will pass along whatever I get from nfsv4 at ietf.org. (There
> >>> is an
> >>> > > >> archive
> >>> > > >> > > of it somewhere, but I can't remember where.;-)
> >>> > > >> > >
> >>> > > >> > > rick
> >>> > > >> > > _______________________________________________
> >>> > > >> > > freebsd-fs at freebsd.org mailing list
> >>> > > >> > > http://lists.freebsd.org/mailman/listinfo/freebsd-fs
> >>> > > >> > > To unsubscribe, send any mail to
> >>> > > >> > > "freebsd-fs-unsubscribe at freebsd.org"
> >>> > > >> > >
> >>> > > >> >
> >>> > > >>
> >>> > > >
> >>> > > >
> >>> > > _______________________________________________
> >>> > > freebsd-fs at freebsd.org mailing list
> >>> > > http://lists.freebsd.org/mailman/listinfo/freebsd-fs
> >>> > > To unsubscribe, send any mail to "freebsd-fs-unsubscribe at freebsd.org
> >>> "
> >>> >
> >>> > --
> >>> >
> >>> -------------------------------------------------------------------------
> >>> > Graham Allan - allan at physics.umn.edu - gta at umn.edu - (612) 624-5040
> >>> > School of Physics and Astronomy - University of Minnesota
> >>> >
> >>> -------------------------------------------------------------------------
> >>> > _______________________________________________
> >>> > freebsd-fs at freebsd.org mailing list
> >>> > http://lists.freebsd.org/mailman/listinfo/freebsd-fs
> >>> > To unsubscribe, send any mail to "freebsd-fs-unsubscribe at freebsd.org"
> >>> >
> >>> _______________________________________________
> >>> freebsd-fs at freebsd.org mailing list
> >>> http://lists.freebsd.org/mailman/listinfo/freebsd-fs
> >>> To unsubscribe, send any mail to "freebsd-fs-unsubscribe at freebsd.org"
> >>>
> >>
> >>
> >
>