Re: native inotify implementation

Reply: Mark Johnston : "Re: native inotify implementation"
Reply: Vladimir Kondratyev : "Re: native inotify implementation"
Reply: Anthony Pankov : "Re: native inotify implementation"
In reply to: Mark Johnston : "native inotify implementation"
Go to: [ bottom of page ] [ top of archives ] [ this month ]

From: David Chisnall <theraven_at_FreeBSD.org>
Date: Sat, 17 May 2025 16:00:48 UTC

On 12 May 2025, at 21:58, Mark Johnston <markj@freebsd.org> wrote:
> 
> This work was largely motivated by a race condition in EVFILT_VNODE: in
> order to get events for a particular file, you first have to open it, by
> which point you may have missed the event(s) you care about.  For
> instance, if some upload service adds files to a directory, and you want
> to know when a new file has finished uploading, you'd have to watch the
> directory to get new file events, scan the directory to actually find
> the new file(s), open them, and then wait for NOTE_CLOSE (which might
> never arrive if the upload had already finished).  Aside from that, the
> need to hold each monitored file open is also a problem for large
> directory hierarchies as it's easy to exhaust file descriptor limits.

My experience as a user was that NOTE_CLOSE was unreliable.  I tried using it to detect when uploads had finished but I never saw it (on ZFS).  I have had producing a working reduced test case for this on my todo list for a while, but I solved my problem by writing my own sftp-server implementation that stored the received ‘file’ in a shared memory object and passed it to another process, so didn’t end up depending on this.

The only way that I found on FreeBSD to determine that a file was no longer open for writing was via libprocstat, which required root.  Linux has an API for this, apparently, but I didn’t try it.

> My initial solution was a new kqueue filter, EVFILT_FSWATCH, which lets
> one watch for all file events under a mountpoint.  The consumer would
> allocate a ring buffer with space to store paths and event metadata,
> register that with the kernel, and the kernel would write entries to the
> buffer, using reverse lookups to find a path for each event vnode.  This
> prototype worked, but got somewhat hairy and I decided it would be
> better to simply implement an existing interface: inotify already exists
> and is commonly used, and has a somewhat simpler model, as it merely
> watches for events within a particular directory.

I think it’s worth discussing the design a bit.  I’ve used the Linux inotify implementation a bit and the fact that it doesn’t see changes made through hard links or bind mounts is quite problematic for several use cases.  From skimming your code, it looks as if it might have the same limitation?  This means, for example, that a jailed application that watches its config files would miss notifications if they are modified via their original location rather than the nullfs mount.  With containers, this is likely to be more important: if the same volume is mounted in two containers (nullfs mounted in two jails), one should be able to watch for changes made by the other.  

I had pondered an implementation using two layers of bloom filters to track vnodes that are watched by any filter, and then specific filters, which would track inode numbers and do the name lookup on the slow path after matching, but I suspect there are some details that would make this hard.  

The approach on XNU is a lot more scalable than on Linux and seems to be similar to your original proposal here.  The kernel has a single fsevents device node that tells a userspace daemon the directories that contain files that have been modified.  When a process wants to watch for events in a tree, it notifies the userspace daemon, which maintains all of the state.  There is a ring buffer between the daemon and the kernel.  If the userspace daemon can’t keep up with kernel events, it sees non-sequential message numbers and falls back to examining the modification times of files in all watched paths to determine if any files were modified in the period where it missed messages.

The benefit of the XNU approach is that filesystem watching never backpressures the kernel.  I’m not sure how such an approach would work in a jail.

David