how to fix an interesting issue with mountd?

Wed Jun 3 00:50:12 UTC 2020

Peter Eriksson wrote:
>I once reported that we had a server with many thousands (typically 23000 or so >per server) ZFS filesystems (and 300+ snapshots per filesystem) where mountd >was 100% busy reading and updating the kernel (and while doing that holding the >NFS lock for a very long time) every hour (when we took snapshots of all the >filesystems - the code in the zfs commands send a lot of SIGHUPs to mountd it >seems)….
>
>(Causing NFS users to complain quite a bit)
>
>I have also seen the effect that when there are a lot of updates to filesystems that >some exports can get “missed” if mountd is bombarded with multiple SIGHUPS - >but with the new incremental update code in mountd this window (for SIGHUPs to >get lost) is much smaller (and I now also have a Nagios check that verifies that all >exports in /etc/zfs/exports also is visible in the kernel).
I just put a patch up in PR#246597, which you might want to try.

>But while we had this problem it I also investigated going to a DB based exports >“file” in order to make the code in the “zfs” commands that read and update >/etc/zfs/exports a lot faster too. As Rick says there is room for _huge_ >improvements there.
>
>For every change of “sharenfs” per filesystem it would open and read and parse, >line-by-line, /etc/zfs/exports *two* times and then rewrite the whole file. Now >imagine doing that recursively for 23000 filesystems… My change to the zfs code >simple opened a DB file and just did a “put” of a record for the filesystem (and >then sent mountd a SIGHUP).
Just to clarify, if someone else can put Peter's patch in ZFS, I am willing to
put the required changes in mountd.

>
>(And even worse - when doing the boot-time “zfs share -a” - for each filesystem it >would open(/etc/zfs/exports, read it line by line and check to make sure the >filesystem isn’t already in the file, then open a tmp file, write out all the old >filesystem + plus the new one, rename it to /etc/zfs/exports, send a SIGHUP) and >the go on to the next one.. Repeat.  Pretty fast for 1-10 filesystems, not so fast for >20000+ ones… And tests the boot disk I/O a bit :-)
>
>
>I have seen that the (ZFS-on-Linux) OpenZFS code has changed a bit regarding >this and I think for Linux they are going the route of directly updating the kernel >instead of going via some external updater (like mountd).
The problem here is NFSv3, where something (currently mountd) needs to know
about this stuff, so it can do the Mount protocol (used for NFSv3 mounting and
done with Mount RPCs, not NFS ones).

>That probably would be an even better way (for ZFS) but a DB database might be >useful anyway. It’s a very simple change (especially in mountd - it just opens the >DB file and reads the records sequentially instead of the text file).
I think what you have, which puts the info in a db file and then SIGHUPs mountd
is a good start.
Again, if someone else can get this into ZFS, I can put the bits in mountd.

Thanks for posting this, rick
ps: Do you happen to know how long a reload of exports in mountd is currently
      taking, with the patches done to it last year?

- Peter

On 2 Jun 2020, at 06:30, Rick Macklem <rmacklem at uoguelph.ca<mailto:rmacklem at uoguelph.ca>> wrote:

Rodney Grimes wrote:
Hi,

I'm posting this one to freebsd-net@ since it seems vaguely similar
to a network congestion problem and thought that network types
might have some ideas w.r.t. fixing it?

PR#246597 - Reports a problem (which if I understand it is) where a sighup
  is posted to mountd and then another sighup is posted to mountd while
  it is reloading exports and the exports are not reloaded again.
  --> The simple patch in the PR fixes the above problem, but I think will
         aggravate another one.
For some NFS servers, it can take minutes to reload the exports file(s).
(I believe Peter Erriksonn has a server with 80000+ file systems exported.)
r348590 reduced the time taken, but it is still minutes, if I recall correctly.
Actually, my recollection w.r.t. the times was way off.
I just looked at the old PR#237860 and, without r348590, it was 16seconds
(aka seconds, not minutes) and with r348590 that went down to a fraction
of a second (there was no exact number in the PR, but I noted milliseconds in
the commit log entry.

I still think there is a risk of doing the reloads repeatedly.

--> If you apply the patch in the PR and sighups are posted to mountd as
      often as it takes to reload the exports file(s), it will simply reload the
      exports file(s) over and over and over again, instead of processing
      Mount RPC requests.

So, finally to the interesting part...
- It seems that the code needs to be changed so that it won't "forget"
 sighup(s) posted to it, but it should not reload the exports file(s) too
 frequently.
--> My thoughts are something like:
 - Note that sighup(s) were posted while reloading the exports file(s) and
   do the reload again, after some minimum delay.
   --> The minimum delay might only need to be 1second to allow some
          RPCs to be processed before reload happens again.
    Or
   --> The minimum delay could be some fraction of how long a reload takes.
         (The code could time the reload and use that to calculate how long to
          delay before doing the reload again.)

Any ideas or suggestions? rick
ps: I've actually known about this for some time, but since I didn't have a good
    solution...

Build a system that allows adding and removing entries from the
in mountd exports data so that you do not have to do a full
reload every time one is added or removed?

Build a system that used 2 exports tables, the active one, and the
one that was being loaded, so that you can process RPC's and reloads
at the same time.
Well, r348590 modified mountd so that it built a new set of linked list
structures from the modified exports file(s) and then compared them with
the old ones, only doing updates to the kernel exports for changes.

It still processes the entire exports file each time, to produce the in mountd
memory linked lists (using hash tables and a binary tree).

Peter did send me a patch to use a db frontend, but he felt the only
performance improvements would be related to ZFS.
Since ZFS is something I avoid like the plague I never pursued it.
(If anyone willing to ZFS stuff wants to pursue this,
just email and I can send you the patch.)
Here's a snippet of what he said about it.
It looks like a very simple patch to create and even though it wouldn’t really        >  improve the speed for the work that mountd does it would make possible really > drastic speed improvements in the zfs commands. They (zfs commands) currently >  reads the thru text-based exports file multiple times when you do work with zfs  > filesystems (mounting/sharing/changing share options etc). Using a db based
exports file for the zfs exports (b-tree based probably) would allow the zfs code > to be much faster.

At this point, I am just interested in fixing the problem in the PR, rick

_______________________________________________
freebsd-net at freebsd.org<mailto:freebsd-net at freebsd.org> mailing list
https://lists.freebsd.org/mailman/listinfo/freebsd-net
To unsubscribe, send any mail to "freebsd-net-unsubscribe at freebsd.org<mailto:freebsd-net-unsubscribe at freebsd.org>"