Major issues with nfsv4

Sun Dec 13 17:09:00 UTC 2020

J David wrote:
[stuff snipped]
>The most common trace (91 times, or over 1/4 of all observed) is:
>
>__mtx_lock_sleep+0xf8 nfscl_nodeleg+0x207 nfs_lookup+0x314
>VOP_LOOKUP_APV+0x75 null_lookup+0x98 VOP_LOOKUP_APV+0x75 >lookup+0x451
>namei+0x414 kern_statat+0x72 freebsd11_stat+0x30 amd64_syscall+0x387
>fast_syscall_common+0xf8
This is just waiting on the mutex that protects the open/lock/delegation
state.

Now, is this just caused by heavy load and 130000 opens, or is there
something in nullfs that results in more state handling than needed?
[more stuff snipped]
>It happens these were all Python-based jobs.  Python seems predisposed
>to trigger this, but non-Python jobs trigger it as well.  Heavy use of
>stat() does seem to be a common element regardless of job type.
>
>Here's the output of "nfsstat -E -c" 60 seconds after running it with -z:
>
>Client Info:
>RPC Counts:
>      Getattr      Setattr       Lookup     Readlink         Read        Write
>         1667          111         6376            0           42          153
>       Create       Remove       Rename         Link      Symlink        Mkdir
>          111            7           14            0            0            0
>        Rmdir      Readdir     RdirPlus       Access        Mknod       Fsstat
>            0            0            0         2620            0          160
>       FSinfo     pathConf       Commit      SetClId    SetClIdCf         Lock
>            0            0          113            0            0           48
>        LockT        LockU         Open      OpenCfr
>            0           48          320            0
>   OpenDownGr        Close
>            0          402
--> So, during this time interval, there were 320 opens done and 402 closes.
>    RelLckOwn  FreeStateID    PutRootFH     DelegRet       GetAcl       SetAcl
>            0            3            0            0            0            0
>   ExchangeId   CreateSess  DestroySess  DestroyClId    LayoutGet   GetDevInfo
>            0            0            0            0            0            0
> LayoutCommit LayoutReturn ReclaimCompl    ReadDataS   WriteDataS  >CommitDataS
>            0            0            0            0            0            0
>   OpenLayout CreateLayout
>            0            0
>    OpenOwner        Opens    LockOwner        Locks       Delegs     LocalOwn
>        21175       130439           30            6            0            0
So it has accumulated 130439 opens over 21175 different processes.
(An openowner represents a process on the client.)
Are there typically 20000+ processes/jobs running concurrently on this setup?

That implies that any time a file is opened (and any time a vnode v_usecount
drops to 0) a linear traversal of a 130000 element linked list is done while
holding the mutex that the above common procstat entry is waiting on.
(The nfscl_nodeleg() call to check for a delegation happens whenever lookup
 has a name cache hit.)
-->Also, every VOP_GETATTR() called from stat() will need to acquire the
     state mutex for a short period of time.

-->So I am not surprised that the mount gets "constipated".
     To be honest, the NFSv4 state handling in the client was never designed
     to handle opens at this scale.
     --> Unlike the NFS server code, which uses hash tables of linked lists, the client
            only uses a single linked list for all open structures.
            I have thought of changing the NFS client code to use hash tables of lists
            and may work on doing so during the coming months, but who knows
            when such a patch might be ready.
            So long as the lists are short, holding the mutex during list traversal
            seems to work ok, even on a heavily loaded NFS server, so doing so
            should allow the client code to scale to larger numbers of open
            structures.

What you are seeing may just be contention for the client state mutex
(called NFSLOCKCLSTATE() in the code), caused by repeated traversals
of the 130000 element linked list.

The other question is...
Is having 130000 opens normal or is nullfs somehow delaying the
VOP_INACTIVE() call on the underlying NFS vnode enough to cause
these opens to accumulate?
(Do to how these Windows oplock like opens are defined, they cannot
 be closed/discarded by the NFS client until VOP_INACTIVE() is called on
 the open file's vnode.)

Counting how many processes are in __mtx_lock_sleep in the procstat
output should give us a better idea if contention on the NFSv4 client
state mutex is causing the problem.

rick

>    LocalOpen    LocalLown    LocalLock
>            0            0            0
Rpc Info:
     TimedOut      Invalid    X Replies      Retries     Requests
            0            0            0            0        12247
Cache Info:
    Attr Hits  Attr Misses    Lkup Hits  Lkup Misses
      1110054          858      1002829         6361
    BioR Hits  BioR Misses    BioW Hits  BioW Misses
         2000           54          292          153
   BioRL Hits BioRL Misses    BioD Hits  BioD Misses
         6911            0          208            0
    DirE Hits  DirE Misses
          104            0

This does reflect the whole machine, not just those four processes.

Finally, when I attempted to kill those four processes with ktrace
running on all of them, the system panicked:

Fatal trap 12: page fault while in kernel mode
cpuid = 2; apic id = 02
fault virtual address = 0x10
fault code = supervisor read data, page not present
instruction pointer = 0x20:0xffffffff8272d28f
stack pointer         = 0x28:0xfffffe008a5f24c0
frame pointer         = 0x28:0xfffffe008a5f25f0
code segment = base 0x0, limit 0xfffff, type 0x1b
= DPL 0, pres 1, long 1, def32 0, gran 1
processor eflags = interrupt enabled, resume, IOPL = 0
current process = 5627 (node)
trap number = 12
panic: page fault
cpuid = 2
time = 1607845622
KDB: stack backtrace:
db_trace_self_wrapper() at db_trace_self_wrapper+0x2b/frame 0xfffffe008a5f2180
vpanic() at vpanic+0x17b/frame 0xfffffe008a5f21d0
panic() at panic+0x43/frame 0xfffffe008a5f2230
trap_fatal() at trap_fatal+0x391/frame 0xfffffe008a5f2290
trap_pfault() at trap_pfault+0x4f/frame 0xfffffe008a5f22e0
trap() at trap+0x286/frame 0xfffffe008a5f23f0
calltrap() at calltrap+0x8/frame 0xfffffe008a5f23f0
--- trap 0xc, rip = 0xffffffff8272d28f, rsp = 0xfffffe008a5f24c0, rbp
= 0xfffffe008a5f25f0 ---
null_bypass() at null_bypass+0xaf/frame 0xfffffe008a5f25f0
VOP_ADVLOCK_APV() at VOP_ADVLOCK_APV+0x80/frame 0xfffffe008a5f2620
closef() at closef+0x8f/frame 0xfffffe008a5f26b0
fdescfree_fds() at fdescfree_fds+0x3c/frame 0xfffffe008a5f2700
fdescfree() at fdescfree+0x466/frame 0xfffffe008a5f27c0
exit1() at exit1+0x488/frame 0xfffffe008a5f2820
sigexit() at sigexit+0x159/frame 0xfffffe008a5f2b00
postsig() at postsig+0x2fd/frame 0xfffffe008a5f2bc0
ast() at ast+0x317/frame 0xfffffe008a5f2bf0
doreti_ast() at doreti_ast+0x1f/frame 0x7fffffff67a0

The fsck ate the ktrace.out file on reboot.  This system does not have
kernel debug symbols.  If there's a way to try to figure out what
null_bypass+0xaf corresponds to using the machine that built this
kernel, which has the kernel, objects, debug symbols, and source,
please let me know.  It's built from releng/12.2 r368515.

Thanks!