Date: Sat, 02 Oct 2021 01:39:48 UTC
https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=178231 Chris Stephan <firstname.lastname@example.org> changed: What |Removed |Added ---------------------------------------------------------------------------- CC| |email@example.com --- Comment #5 from Chris Stephan <firstname.lastname@example.org> --- Seeing same logs as OP: nfsv4 client/server protocol prob err=10026 In an effort to prepare to migrate to our next deployment of production servers to leverage NFSv4 over TLS, we have built a development environment on 13.0-RELEASE. All hosts were built from scratch using 13.0-RELEASE txz's, expanded into custom ZFS datasets on Supermicro X9 series Intel. NFS Clients are Diskless Dell workstations, booting PXE from /srv/tftpboot, root onto MFS, mouting NFS shares on server on /net. MFS includes most of base.txz. Minor bits have been removed mainly from /usr/share/ /boot that are better suited for tmpfs or NFS. /usr/bin and /usr/sbin are untouched. The method for this build is identical to what was is done for our 12.2 deployment, short of the source now being the 13-RELEASE tarball and the use of NFSv4 vs NFSv3 previously. ## /etc/fstab entries tried: 192.168.10.101:/ /net nfs nfsv4,rw,hard,tcp 0 0 192.168.10.101:/ /net nfs nfsv4,rw,soft,retrycnt=0,tcp 0 0 All works fine, until one of the testers starts Chromium, which, after even minor browsing causes all of the window system to freeze. Originally we thought this was a bug in Chrome or the X server. After trying to isolate with D-trace for the last three days, we found we can trigger the event by waiting for the window system to freeze up and setting up a dtrace on the window manager (fluxbox) and right clicking on the desktop to open a menu, which triggers the open() call to pull the menu file from the user's home directory in: `/net/home/<user>/.fluxbox/menu` NFS never fulfills this request. It locks up the window manager and I can switch to another VT and troubleshoot from there. On the NFS server, /usr/sbin/nfsdumpstate shows there are locks for each of the clients running chrome. On the clients, /usr/bin/nfsstat shows thousands of timeout and retries but they have stopped incrementing by the time everything has locked up. When checking stats even after hours of running (so long as Chrome is not started), stats in question stay at 0. It seems apparent that the NFS client has seized at this point and can not recover. Rebooting a client does not clear the locks on the server. clear_locks does not appear to resolve the server side either (but I'm not sure clear_locks works with NFSv4. Any application, CLI or GUI which accesses the NFS system locks up and never returns. If I try to umount [-f] /net the command locks the VT. if I try to read the subtree of /net issuance of the command locks the VT. Our previous setup uses NFSv3+KRB5i/p on 12.2-RELEASE-p10 and works flawlessly. We also have tried connecting a lab client from our 12.2 cluster as an NFSv4 client to the 13.0 server, and the same thing happens. I am not willing to attempt to connect a client via NFSv4 to our 12.2 cluster because I really don't want to cause some further issue in the event we lock up the server with active testing going on. When we change the client mounts to NFSv3, all is well. So this definately feels like a bug in the NFSv4 client or server. Anyways, I'm at the point where I feel like there are smarter folk than I that might be interested in looking at this. I have an relatively idle cluster ready for all the testing anyone wants to throw at it. -- You are receiving this mail because: You are the assignee for the bug.