how to fix an interesting issue with mountd?

Tue Jun 2 13:12:49 UTC 2020

I once reported that we had a server with many thousands (typically 23000 or so per server) ZFS filesystems (and 300+ snapshots per filesystem) where mountd was 100% busy reading and updating the kernel (and while doing that holding the NFS lock for a very long time) every hour (when we took snapshots of all the filesystems - the code in the zfs commands send a lot of SIGHUPs to mountd it seems)…. 

(Causing NFS users to complain quite a bit)

I have also seen the effect that when there are a lot of updates to filesystems that some exports can get “missed” if mountd is bombarded with multiple SIGHUPS - but with the new incremental update code in mountd this window (for SIGHUPs to get lost) is much smaller (and I now also have a Nagios check that verifies that all exports in /etc/zfs/exports also is visible in the kernel).

But while we had this problem it I also investigated going to a DB based exports “file” in order to make the code in the “zfs” commands that read and update /etc/zfs/exports a lot faster too. As Rick says there is room for _huge_ improvements there. 

For every change of “sharenfs” per filesystem it would open and read and parse, line-by-line, /etc/zfs/exports *two* times and then rewrite the whole file. Now imagine doing that recursively for 23000 filesystems… My change to the zfs code simple opened a DB file and just did a “put” of a record for the filesystem (and then sent mountd a SIGHUP).

(And even worse - when doing the boot-time “zfs share -a” - for each filesystem it would open(/etc/zfs/exports, read it line by line and check to make sure the filesystem isn’t already in the file, then open a tmp file, write out all the old filesystem + plus the new one, rename it to /etc/zfs/exports, send a SIGHUP) and the go on to the next one.. Repeat.  Pretty fast for 1-10 filesystems, not so fast for 20000+ ones… And tests the boot disk I/O a bit :-)

I have seen that the (ZFS-on-Linux) OpenZFS code has changed a bit regarding this and I think for Linux they are going the route of directly updating the kernel instead of going via some external updater (like mountd). That probably would be an even better way (for ZFS) but a DB database might be useful anyway. It’s a very simple change (especially in mountd - it just opens the DB file and reads the records sequentially instead of the text file).

- Peter

> On 2 Jun 2020, at 06:30, Rick Macklem <rmacklem at uoguelph.ca> wrote:
> 
> Rodney Grimes wrote:
>>> Hi,
>>> 
>>> I'm posting this one to freebsd-net@ since it seems vaguely similar
>>> to a network congestion problem and thought that network types
>>> might have some ideas w.r.t. fixing it?
>>> 
>>> PR#246597 - Reports a problem (which if I understand it is) where a sighup
>>>   is posted to mountd and then another sighup is posted to mountd while
>>>   it is reloading exports and the exports are not reloaded again.
>>>   --> The simple patch in the PR fixes the above problem, but I think will
>>>          aggravate another one.
>>> For some NFS servers, it can take minutes to reload the exports file(s).
>>> (I believe Peter Erriksonn has a server with 80000+ file systems exported.)
>>> r348590 reduced the time taken, but it is still minutes, if I recall correctly.
> Actually, my recollection w.r.t. the times was way off.
> I just looked at the old PR#237860 and, without r348590, it was 16seconds
> (aka seconds, not minutes) and with r348590 that went down to a fraction
> of a second (there was no exact number in the PR, but I noted milliseconds in
> the commit log entry.
> 
> I still think there is a risk of doing the reloads repeatedly.
> 
>>> --> If you apply the patch in the PR and sighups are posted to mountd as
>>>       often as it takes to reload the exports file(s), it will simply reload the
>>>       exports file(s) over and over and over again, instead of processing
>>>       Mount RPC requests.
>>> 
>>> So, finally to the interesting part...
>>> - It seems that the code needs to be changed so that it won't "forget"
>>>  sighup(s) posted to it, but it should not reload the exports file(s) too
>>>  frequently.
>>> --> My thoughts are something like:
>>>  - Note that sighup(s) were posted while reloading the exports file(s) and
>>>    do the reload again, after some minimum delay.
>>>    --> The minimum delay might only need to be 1second to allow some
>>>           RPCs to be processed before reload happens again.
>>>     Or
>>>    --> The minimum delay could be some fraction of how long a reload takes.
>>>          (The code could time the reload and use that to calculate how long to
>>>           delay before doing the reload again.)
>>> 
>>> Any ideas or suggestions? rick
>>> ps: I've actually known about this for some time, but since I didn't have a good
>>>     solution...
>> 
>> Build a system that allows adding and removing entries from the
>> in mountd exports data so that you do not have to do a full
>> reload every time one is added or removed?
>> 
>> Build a system that used 2 exports tables, the active one, and the
>> one that was being loaded, so that you can process RPC's and reloads
>> at the same time.
> Well, r348590 modified mountd so that it built a new set of linked list
> structures from the modified exports file(s) and then compared them with
> the old ones, only doing updates to the kernel exports for changes.
> 
> It still processes the entire exports file each time, to produce the in mountd
> memory linked lists (using hash tables and a binary tree).
> 
> Peter did send me a patch to use a db frontend, but he felt the only
> performance improvements would be related to ZFS.
> Since ZFS is something I avoid like the plague I never pursued it.
> (If anyone willing to ZFS stuff wants to pursue this,
> just email and I can send you the patch.)
> Here's a snippet of what he said about it.
>> It looks like a very simple patch to create and even though it wouldn’t really        >  improve the speed for the work that mountd does it would make possible really > drastic speed improvements in the zfs commands. They (zfs commands) currently >  reads the thru text-based exports file multiple times when you do work with zfs  > filesystems (mounting/sharing/changing share options etc). Using a db based  
>> exports file for the zfs exports (b-tree based probably) would allow the zfs code > to be much faster.
> 
> At this point, I am just interested in fixing the problem in the PR, rick
> 
> _______________________________________________
> freebsd-net at freebsd.org <mailto:freebsd-net at freebsd.org> mailing list
> https://lists.freebsd.org/mailman/listinfo/freebsd-net <https://lists.freebsd.org/mailman/listinfo/freebsd-net>
> To unsubscribe, send any mail to "freebsd-net-unsubscribe at freebsd.org <mailto:freebsd-net-unsubscribe at freebsd.org>"