Re: Did something change with ZFS and vnode caching?

From: Rick Macklem <rick.macklem_at_gmail.com>
Date: Fri, 01 Sep 2023 21:22:23 UTC
On Thu, Aug 31, 2023 at 12:05 PM Garrett Wollman <wollman@bimajority.org> wrote:
>
> <<On Thu, 24 Aug 2023 11:21:59 -0400, Garrett Wollman <wollman@bimajority.org> said:
>
> > Any suggestions on what we should monitor or try to adjust?
I remember you mentioning that you tried increasing kern.maxvnodes
but I was wondering if you've tried bumping it way up (like 10X what it
currently is)?

You could try decreasing the max nfsd threads (--maxthreads command
line option for nfsd. That would at least limit the # of vnodes used
by the nfsd.

rick

>
> To bring everyone up to speed: earlier this month we upgraded our NFS
> servers from 12.4 to 13.2 and found that our backup system was
> absolutely destroying NFS performance, which had not happened before.
>
> With some pointers from mjg@ and the thread relating to ZFS
> performance on current@ I built a stable/13 kernel
> (b5a5a06fc012d27c6937776bff8469ea465c3873) and installed it on one of
> our NFS servers for testing, then removed the band-aid on our backup
> system and allowed it to go as parallel as it wanted.
>
> Unfortunately, we do not control the scheduling of backup jobs, so
> it's difficult to tell whether the changes made any difference.  Each
> backup job does a parallel breadth-first traversal of a given
> filesystem, using as many as 150 threads per job (the backup client
> auto-scales itself), and we sometimes see as many as eight jobs
> running in parallel on one file server.  (There are 17, soon to be 18,
> file servers.)
>
> When the performance of NFS's backing store goes to hell, the NFS
> server is not able to put back-pressure on the clients hard enough to
> stop them from writing, and eventually the server runs out of 4k jumbo
> mbufs and crashes.  This at least is a known failure mode, going back
> a decade.  Before it gets to this point, the NFS server also
> auto-scales itself, so it's in competition with the backup client over
> who can create the most threads and ultimately allocate the most
> vnodes.
>
> Last night, while I was watching, the first dozen or so backups went
> fine, with no impact to NFS performance, until the backup server
> decided to schedule scans of two, and then three, parallel scans of
> filesystems containing about 35 million files each.  These tend to
> take an hour or four, depending on how much changed data is identified
> during the scane, but most of the time it's just sitting in a
> readdir()/fstatat() loop with a shared work queue for parallelism.
> (That's my interpretation based on its activity; we do not have source
> code.)
>
> Once these scans were underway, I observed the same symptoms as on
> releng/13.2, with lots of lock contention and the vnlru process
> running almost constantly (95% CPU, so most of a core on this
> 20-core/40-thread server).  From our monitoring, the server was
> recycling about 35k vnodes per second during this period.  I wasn't
> monitoring these statistics before so I don't have historical
> comparisons.  My working assumption, such as it is, is that the switch
> from OpnSolaris ZFS to OpenZFS in 13.x moved some bottlenecks around
> so that the backup client previously got tangled higher up in the ZFS
> code and now can put real pressure on the vnode allocator.
>
> During the hour that the three backup clients were running, I was able
> to run mjg@'s dtrace script and generate a flame graph, which is
> viewable at <https://people.csail.mit.edu/wollman/dtrace-terad.2.svg>.
> This just shows what the backup clients themselves are doing, and not
> what's going on in the vnlru or nfsd processes.  You can ignore all
> the umtx stacks since that's just coordination between the threads in
> the backup client.
>
> On the "oncpu" side, the trace captures a lot of time spent spinning
> in lock_delay(), although I don't see where the alleged call site
> acquires any locks, so there must have been some inlining.  On the
> "offcpu" side, it's clear that there's still a lot of time spent
> sleeping on vnode_list_mtx in the vnode allocation pathway, both
> directly from vn_alloc_hard() and also from vnlru_free_impl() after
> the mutex is dropped and then needs to be reacquired.
>
> In ZFS, there's also a substantial number of waits (shown as
> sx_xlock_hard stack frames), in both the easy case (a free vnode was
> readily available) and the hard case where vn_alloc_hard() calls
> vnlru_free_impl() and eventually zfs_inactive() to reclaim a vnode.
> Looking into the implementation, I noted that ZFS uses a 64-entry hash
> lock for this, and I'm wondering if there's an issue with false
> sharing.  Can anyone with ZFS experience speak to that?  If I
> increased ZFS_OBJ_MTX_SZ to 128 or 256, would it be likely to hurt
> something else (other than memory usage)?  Do we even know that the
> low-order 6 bits of ZFS object IDs are actually uniformly distributed?
>
> -GAWollman
>
>