Linux NFSv4 clients are getting (bad sequence-id error!)

Wed Jul 29 21:00:39 UTC 2015

Ahmed Kamal wrote:
> hmm Thanks Rick ..
> 
> You mentioned the error appears when nfsrvboottime != clientid.lval[0] .. I
> understand nfsrvboottime is number of seconds since the epoch (1970) .. Can
> you please explain what clientid.lval[0] is, and (if it comes from the
> client?) what guarantees it should equal nfsrvboottime ?
> 
The low order 32bits of nfsrvboottime is assigned to clientid.lval[0] when a
new clientid is generated and then sent to a client. The client simply puts the
clientid in requests (the bits are opaque to the client). Since nfsrvboottime
was assumed by the code to fit in 32bits and does not change (it is set from
time_seconds once during startup), this check detects if the clientid that the
client has was acquired from the server before the most recent reboot of the
server. (Since the server didn't reboot between ExchangeID and Create_session
in the packet trace, the value should be the same as what is in nfsrvboottime.)

However, if nfsrvboottime is 64bits (which is what time_t is on some arches) and
somehow has a non-zero high order 32bits, the value will be != because clientid.lval[0]
is a uint32_t. (The patch just casts nfsrvboottime to (uint32_t) to make the comparison
ignore the high order bits of nfsrvboottime.)

> Apart from trying to understand the problem. Can you send me a small c
> program that runs the same code that computes nfsrvboottime and writes that
> to the terminal window. I would like to avoid testing a kernel patch on
> this system since it runs in production. And last time I rebooted the nfs
> server, I ended up having to reboot all clients (every single workstation)
> so that was painful .. So if we just want to know if the number if bigger
> than 4 billion or not, I think this small app can help us get this value
> right ?
> 
Since it is in the kernel and declared "static" I don't know of a way for a C app to manipulate it.
(I just booted with a printf in the kernel for it, to see what it was set to.)

I just spotted a place in the code that allocates clientids (nfsrv_setclient()) that
looks broken when the entry exists on the last hash table element.

I will email you a patch for this. To be honest, the only way this will get resolved
is if you can test the patch(es). I understand that rebooting the server isn't
attractive (and ideally is done only when all clients are dismounted from it), but
this is all I can suggest. (Maybe create a kernel with the patches on the server and
then reboot with the new kernel when it crashes or has to be rebooted for some other
reason? This is no rush for me, so it depends on how problematic the issue is for you?)

rick

> On Wed, Jul 29, 2015 at 1:35 PM, Rick Macklem <rmacklem at uoguelph.ca> wrote:
> 
> > Ahmed Kamal wrote:
> > > hmm, if I understand you correctly, this time_seconds value is the number
> > > of seconds till the box booted ?
> > No, it is the number of seconds when the box booted. Once, it was supposed
> > to
> > be the number of seconds since Jan. 1, 1970, but I don't if that is still
> > the
> > case. (For my i386, it is about 1.4billion when it boots, so I'd guess they
> > still use Jan. 1, 1970. 3600*24*365*45 = 1419120000. Yea, I didn't bother
> > with
> > leap years, etc.)
> >
> > Now, I don't know if the clock could somehow set it to a value > 4billion
> > when
> > the nfsd starts (it copies time_seconds to nfsrvboottime as it starts up),
> > but the
> > current clock code is pretty convoluted stuff, so??
> >
> > rick
> > ps: From the NFSv4 server's point of view, it only needs a number that is
> > unique and
> >     changes every time the server reboots. As such, using the low order
> > 32bits of
> >     it would be sufficient, even if it exceeds 4billion. However, the code
> > incorrectly
> >     assumes it won't exceed 4*1024*1024*1024 - 1 unless you apply the
> > patch.
> >
> > > If so, I guess this is not really the
> > > cause of what we're seeing as the box is only up for 8 days
> > >
> > > bsd# uptime
> > > 11:28AM  up 8 days,  6:20, 6 users, load averages: 0.94, 0.91, 0.84
> > >
> > > The NFS client box's uptime is
> > > linux# uptime
> > >  11:31:39 up 8 days,  5:51, 11 users,  load average: 87.74, 87.43, 87.35
> > >
> > > and yes the huge load is most likely due to this NFS bug
> > >
> > > On Wed, Jul 29, 2015 at 1:39 AM, Rick Macklem <rmacklem at uoguelph.ca>
> > wrote:
> > >
> > > > Ahmed Kamal wrote:
> > > > > Hi again Rick,
> > > > >
> > > > > Seems that I'm still being unlucky with nfs :/ I caught one of the
> > newly
> > > > > installed RHEL6 boxes having high CPU usage, and bombarding the BSD
> > NFS
> > > > box
> > > > > with 10Mbps traffic .. I caught a tcpdump as you mentioned .. You can
> > > > > download it here:
> > > > >
> > > > >
> > > >
> > https://dl.dropboxusercontent.com/u/51939288/nfs41-high-client-cpu.pcap.bz2
> > > > >
> > > > Ok, the packet trace suggests that the NFSv4 server is broken (it is
> > > > replying
> > > > with NFS4ERR_STALE_CLIENTID for a recently generated ClientID).
> > > > Now, I can't be sure, but the only explanation I can come up with is...
> > > > - For some arches (I only have i386, so I wouldn't have seen this
> > during
> > > > testing),
> > > >   time_t is 64bits (uint64_t).
> > > >   --> If time_seconds somehow doesn't fit in the low order 32bits, then
> > > > the code
> > > >       would be busted for these arches because nfsrvboottime is set to
> > > > time_seconds
> > > >       when the server is started and then there are comparisons like:
> > > >       if (nfsrvboottime != clientid.lval[0])
> > > >            return (NFSERR_STALECLIENTID);
> > > >        /* where clientid.lval[0] is a uint32_t */
> > > > Anyhow, if this is what is happening, the attached simple patch should
> > fix
> > > > it.
> > > > (I don't know how time_seconds would exceed 4billion, but the clock
> > code is
> > > >  pretty convoluted, so I can't say if it can possibly happen?)
> > > >
> > > > rick
> > > > ps: Hmm, on i386 time_seconds ends up at 1438126486, so maybe it can
> > exceed
> > > >     4*1024*1024*1024 - 1 on amd64?
> > > >
> > > > > I didn't restart the client yet .. so if you catch me in the next few
> > > > hours
> > > > > and want me to run any diagnostics, let me know. Thanks a lot all for
> > > > > helping
> > > > >
> > > > > On Thu, Jul 23, 2015 at 11:59 PM, Rick Macklem <rmacklem at uoguelph.ca
> > >
> > > > wrote:
> > > > >
> > > > > > Ahmed Kamal wrote:
> > > > > > > Can you please let me know the ultimate packet trace command I'd
> > > > need to
> > > > > > > run in case of any nfs4 troubles .. I guess this should be
> > > > comprehensive
> > > > > > > even at the expense of a larger output size (which we can trim
> > > > later)..
> > > > > > > Thanks a lot for the help!
> > > > > > >
> > > > > > tcpdump -s 0 -w <file>.pcap host <client-host-name>
> > > > > > (<file> refers to a file name you choose and <client-host-name>
> > refers
> > > > to
> > > > > >  the host name of a client generating traffic.)
> > > > > > --> But you won't be able to allow this to run for long during the
> > > > storm
> > > > > > or the
> > > > > >     file will be huge.
> > > > > >
> > > > > > Then you look at <file>.pcap in wireshark, which knows NFS.
> > > > > >
> > > > > > rick
> > > > > >
> > > > > > > On Thu, Jul 23, 2015 at 11:53 PM, Rick Macklem <
> > rmacklem at uoguelph.ca
> > > > >
> > > > > > wrote:
> > > > > > >
> > > > > > > > Graham Allan wrote:
> > > > > > > > > For our part, the user whose code triggered the pathological
> > > > > > behaviour
> > > > > > > > > on SL5 reran it on SL6 without incident - I still see lots of
> > > > > > > > > sequence-id errors in the logs, but nothing bad happened.
> > > > > > > > >
> > > > > > > > > I'd still like to ask them to rerun again on SL5 to see if
> > the
> > > > > > "accept
> > > > > > > > > skipped seqid" patch had any effect, though I think we expect
> > > > not.
> > > > > > Maybe
> > > > > > > > > it would be nice if I could get set up to capture rolling
> > > > tcpdumps of
> > > > > > > > > the nfs traffic before they run that though...
> > > > > > > > >
> > > > > > > > > Graham
> > > > > > > > >
> > > > > > > > > On 7/20/2015 10:26 PM, Ahmed Kamal wrote:
> > > > > > > > > > Hi folks,
> > > > > > > > > >
> > > > > > > > > > I've upgraded a test client to rhel6 today, and I'll keep
> > an
> > > > eye
> > > > > > on it
> > > > > > > > > > to see what happens.
> > > > > > > > > >
> > > > > > > > > > During the process, I made the (I guess mistake) of zfs
> > send |
> > > > > > recv to
> > > > > > > > a
> > > > > > > > > > locally attached usb disk for backup purposes .. long story
> > > > short,
> > > > > > > > > > sharenfs property on the received filesystem was causing
> > some
> > > > > > > > nfs/mountd
> > > > > > > > > > errors in logs .. I wasn't too happy with what I got .. I
> > > > > > destroyed the
> > > > > > > > > > backup datasets and the whole pool eventually .. and then
> > > > rebooted
> > > > > > the
> > > > > > > > > > whole nas box .. After reboot my logs are still flooded
> > with
> > > > > > > > > >
> > > > > > > > > > Jul 21 05:12:36 nas kernel: nfsrv_cache_session: no session
> > > > > > > > > > Jul 21 05:13:07 nas last message repeated 7536 times
> > > > > > > > > > Jul 21 05:15:08 nas last message repeated 29664 times
> > > > > > > > > >
> > > > > > > > > > Not sure what that means .. or how it can be stopped ..
> > Anyway,
> > > > > > will
> > > > > > > > > > keep you posted on progress.
> > > > > > > > >
> > > > > > > > Oh, I didn't see the part about "reboot" before.
> > Unfortunately, it
> > > > > > sounds
> > > > > > > > like the
> > > > > > > > client isn't recovering after the session is lost. When the
> > server
> > > > > > > > reboots, the
> > > > > > > > client(s) will get NFS4ERR_BAD_SESSION errors back because the
> > > > server
> > > > > > > > reboot has
> > > > > > > > deleted all sessions. The NFS4ERR_BAD_SESSION should trigger
> > state
> > > > > > > > recovery on the client.
> > > > > > > > (It doesn't sound like the clients went into recovery, starting
> > > > with a
> > > > > > > > Create_session
> > > > > > > >  operation, but without a packet trace, I can't be sure?)
> > > > > > > >
> > > > > > > > rick
> > > > > > > >
> > > > > > > > >
> > > > > > > > > --
> > > > > > > > >
> > > > > >
> > > >
> > -------------------------------------------------------------------------
> > > > > > > > > Graham Allan - gta at umn.edu - allan at physics.umn.edu
> > > > > > > > > School of Physics and Astronomy - University of Minnesota
> > > > > > > > >
> > > > > >
> > > >
> > -------------------------------------------------------------------------
> > > > > > > > >
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
>