nfs lockd errors after NetApp software upgrade.

Daniel Braniss danny at cs.huji.ac.il
Sun Dec 22 06:18:16 UTC 2019



> On 21 Dec 2019, at 19:32, Rick Macklem <rmacklem at uoguelph.ca> wrote:
> 
> Daniel Braniss wrote:
>>> On 20 Dec 2019, at 19:19, Rick Macklem >><rmacklem at uoguelph.ca<mailto:rmacklem at uoguelph.ca>> wrote:
>>> 
>>> Adam McDougall wrote:
>>>> Try changing bool_t do_tcp = FALSE; to TRUE in
>>>> /usr/src/sys/nlm/nlm_prot_impl.c, recompile the kernel and try again. I
>>>> think this makes it match Linux client behavior. I suspect I ran into
>>>> the same issue as you. I do think I used nolockd is a workaround
>>>> temporarily. I can provide some more details if it works.
>>> If this fixes the problem, please let me know.
>>> 
>>> I'm not sure I'd want to change the default, since it might break things for
>>> others, but I can definitely make it a tunable, so that people don't need to
>>> recompile a kernel to deal with it.
>>> 
>>> 
>> great! I was just about to see how it can be done(tunable) but need to check if it can >be done
>> at any time, or just at boot time.
> I haven't looked at the code, but I suspect changing it on the fly could cause problems,
> so I am inclined to make it a tunable (boot time only).
my feelings too.
> 
>> thanks.
>> btw, currently, from several hours of analysing the traffic, it seems that nlm is UDP.
> I assume that means you haven't tried flipping it to TCP yet.
I will soon, but I have my doubts, the problem is caused my multiple events, i.e, it happened once while
I was doing svn checkout, but i have done it several times since, and no issues. So it must be
an aggregation of factors. Other hosts are reporting locks times too.

danny

> 
> Please let us know how it goes, rick
> 
> danny
> 
> 
> rick
> 
> On 12/19/19 9:21 AM, Daniel Braniss wrote:
> 
> 
> On 19 Dec 2019, at 16:09, Rick Macklem <rmacklem at uoguelph.ca<mailto:rmacklem at uoguelph.ca>> wrote:
> 
> Daniel Braniss wrote:
> [stuff snipped]
> all mounts are nfsv3/tcp
> This doesn't affect what the NLM code (rpc.lockd) uses. I honestly don't know when
> the NLM uses tcp vs udp. I think rpc.statd still uses IP broadcast at times.
> can the replay cache have any influence here? I tend to remember way back issues
> with it,
> 
> To me, it looks like a network configuration issue.
> that was/is my gut feelings too, but, as far as we can tell, nothing has changed in the network infrastructure,
> the problems appeared after the NetAPP’s software was updated, it was working fine till then.
> 
> the problems are also happening on freebsd 12.1
> 
> You could capture packets (maybe when a client first starts rpc.statd and rpc.lockd)
> and then look at them in wireshark. I'd disable statup of rpc.lockd and rpc.statd
> at boot for a test client and then run something like:
> # tcpdump -s 0 -s out.pcap host <netapp-host>
> - and then start rpc.statd and rpc.lockd
> Then I'd look at out.pcap in wireshark (much better at decoding this stuff than
> tcpdump). I'd look for things like different reply IP addresses from the Netapp,
> which might confuse this tired old NLM protocol Sun devised in the mid-1980s.
> 
> it’s going to be an interesting week end :-(
> 
> the error is also appearing on freebsd-11.2-stable, I’m now checking if it’s also
> happening on 12.1
> btw, the NetApp version is 9.3P17
> Yes. I wasn't the author of the NSM and NLM code (long ago I refused to even
> try to implement it, because I knew the protocol was badly broken) and I avoid
> fiddling with. As such, it won't have change much since around FreeBSD7.
> and we haven’t had any issues with it for years, so you must have done something good
> 
> cheers,
>     danny
> 
> 
> rick
> 
> cheers,
>      danny
> 
> rick
> 
> Cheers
> 
> Richard
> (NetApp admin)
> 
> On Wed, 18 Dec 2019 at 15:46, Daniel Braniss <danny at cs.huji.ac.il<mailto:danny at cs.huji.ac.il><mailto:danny at cs.huji.ac.il>> wrote:
> 
> 
> On 18 Dec 2019, at 16:55, Rick Macklem <rmacklem at uoguelph.ca<mailto:rmacklem at uoguelph.ca><mailto:rmacklem at uoguelph.ca>> wrote:
> 
> Daniel Braniss wrote:
> 
> Hi,
> The server with the problems is running FreeBSD 11.1 stable, it was working fine for >several months,
> but after a software upgrade of our NetAPP server it’s reporting many lockd errors >and becomes catatonic,
> ...
> Dec 18 13:11:02 moo-09 kernel: nfs server fr-06:/web/www: lockd not responding
> Dec 18 13:11:45 moo-09 last message repeated 7 times
> Dec 18 13:12:55 moo-09 last message repeated 8 times
> Dec 18 13:13:10 moo-09 kernel: nfs server fr-06:/web/www: lockd is alive again
> Dec 18 13:13:10 moo-09 last message repeated 8 times
> Dec 18 13:13:29 moo-09 kernel: sonewconn: pcb 0xfffff8004cc051d0: Listen queue >overflow: 194 already in queue awaiting acceptance (1 occurrences)
> Dec 18 13:14:29 moo-09 kernel: sonewconn: pcb 0xfffff8004cc051d0: Listen queue >overflow: 193 already in queue awaiting acceptance (3957 occurrences)
> Dec 18 13:15:29 moo-09 kernel: sonewconn: pcb 0xfffff8004cc051d0: Listen queue >overflow: 193 already in queue awaiting acceptance …
> Seems like their software upgrade didn't improve handling of NLM RPCs?
> Appears to be handling RPCs slowly and/or intermittently. Note that no one
> tests it with IPv6, so at least make sure you are still using IPv4 for the mounts and
> try and make sure IP broadcast works between client and Netapp. I think the NLM
> and NSM (rpc.statd) still use IP broadcast sometimes.
> 
> we are ipv4 - we have our own class c :-)
> Maybe the network guys can suggest more w.r.t. why, but as I've stated before,
> the NLM is a fundamentally broken protocol which was never published by Sun,
> so I suggest you avoid using it if at all possible.
> well, at the moment the ball is on NetAPP court, and switching to NFSv4 at the moment is out of the question, it’s
> a production server used by several thousand students.
> 
> 
> - If the locks don't need to be seen by other clients, you can just use the "nolockd"
> mount option.
> or
> - If locks need to be seen by other clients, try NFSv4 mounts. Netapp filers
> should support NFSv4.1, which is a much better protocol that NFSv4.0.
> 
> Good luck with it, rick
> thanks
>     danny
> 
>> any ideas?
> 
> thanks,
>    danny
> 
> _______________________________________________
> freebsd-stable at freebsd.org<mailto:freebsd-stable at freebsd.org><mailto:freebsd-stable at freebsd.org> mailing list
> https://lists.freebsd.org/mailman/listinfo/freebsd-stable
> To unsubscribe, send any mail to "freebsd-stable-unsubscribe at freebsd.org<mailto:freebsd-stable-unsubscribe at freebsd.org>"
> 
> _______________________________________________
> freebsd-stable at freebsd.org<mailto:freebsd-stable at freebsd.org><mailto:freebsd-stable at freebsd.org> mailing list
> https://lists.freebsd.org/mailman/listinfo/freebsd-stable
> To unsubscribe, send any mail to "freebsd-stable-unsubscribe at freebsd.org<mailto:freebsd-stable-unsubscribe at freebsd.org>"
> 
> 
> _______________________________________________
> freebsd-stable at freebsd.org<mailto:freebsd-stable at freebsd.org> mailing list
> https://lists.freebsd.org/mailman/listinfo/freebsd-stable
> To unsubscribe, send any mail to "freebsd-stable-unsubscribe at freebsd.org"
> 
> 
> _______________________________________________
> freebsd-stable at freebsd.org<mailto:freebsd-stable at freebsd.org> mailing list
> https://lists.freebsd.org/mailman/listinfo/freebsd-stable
> To unsubscribe, send any mail to "freebsd-stable-unsubscribe at freebsd.org<mailto:freebsd-stable-unsubscribe at freebsd.org>"
> _______________________________________________
> freebsd-stable at freebsd.org<mailto:freebsd-stable at freebsd.org> mailing list
> https://lists.freebsd.org/mailman/listinfo/freebsd-stable
> To unsubscribe, send any mail to "freebsd-stable-unsubscribe at freebsd.org<mailto:freebsd-stable-unsubscribe at freebsd.org>"
> 



More information about the freebsd-stable mailing list