Re: native inotify implementation

Reply: David Chisnall : "Re: native inotify implementation"
In reply to: David Chisnall : "Re: native inotify implementation"
Go to: [ bottom of page ] [ top of archives ] [ this month ]
From: Mark Johnston <markj_at_freebsd.org>
Date: Sat, 17 May 2025 16:56:24 UTC
On Sat, May 17, 2025 at 05:00:48PM +0100, David Chisnall wrote:
> On 12 May 2025, at 21:58, Mark Johnston <markj@freebsd.org> wrote:
> > 
> > This work was largely motivated by a race condition in EVFILT_VNODE: in
> > order to get events for a particular file, you first have to open it, by
> > which point you may have missed the event(s) you care about.  For
> > instance, if some upload service adds files to a directory, and you want
> > to know when a new file has finished uploading, you'd have to watch the
> > directory to get new file events, scan the directory to actually find
> > the new file(s), open them, and then wait for NOTE_CLOSE (which might
> > never arrive if the upload had already finished).  Aside from that, the
> > need to hold each monitored file open is also a problem for large
> > directory hierarchies as it's easy to exhaust file descriptor limits.
> 
> My experience as a user was that NOTE_CLOSE was unreliable.  I tried using it to detect when uploads had finished but I never saw it (on ZFS).  I have had producing a working reduced test case for this on my todo list for a while, but I solved my problem by writing my own sftp-server implementation that stored the received ‘file’ in a shared memory object and passed it to another process, so didn’t end up depending on this.

I don't quite follow: was your problem perhaps due to the
above-mentioned race, or are you sure it was something else?

There is at least related one bug that I see: fds opened with O_PATH do
not have VOP_OPEN/VOP_CLOSE invoked on the underlying vnode, so they
don't raise NOTE_OPEN/NOTE_CLOSE kevent notifications.  I doubt this is
related to your problem though (and I'm not certain that the current
behaviour is really incorrect).

> The only way that I found on FreeBSD to determine that a file was no longer open for writing was via libprocstat, which required root.  Linux has an API for this, apparently, but I didn’t try it.
> 
> > My initial solution was a new kqueue filter, EVFILT_FSWATCH, which lets
> > one watch for all file events under a mountpoint.  The consumer would
> > allocate a ring buffer with space to store paths and event metadata,
> > register that with the kernel, and the kernel would write entries to the
> > buffer, using reverse lookups to find a path for each event vnode.  This
> > prototype worked, but got somewhat hairy and I decided it would be
> > better to simply implement an existing interface: inotify already exists
> > and is commonly used, and has a somewhat simpler model, as it merely
> > watches for events within a particular directory.
> 
> 
> I think it’s worth discussing the design a bit.  I’ve used the Linux inotify implementation a bit and the fact that it doesn’t see changes made through hard links or bind mounts is quite problematic for several use cases.  From skimming your code, it looks as if it might have the same limitation?  This means, for example, that a jailed application that watches its config files would miss notifications if they are modified via their original location rather than the nullfs mount.  With containers, this is likely to be more important: if the same volume is mounted in two containers (nullfs mounted in two jails), one should be able to watch for changes made by the other.  

Right now I don't expect inotify to work properly with nullfs mounts,
but I do plan to address that.  That consideration is why my patch
introduces new VOPs instead of operating directly on the passed vnode:
for nullfs vnodes, I want to ensure that inotify watches are associated
with the lower vnode (i.e., the "real" vnode).  So, this limitation of
the Linux implementation won't apply to FreeBSD (unless it turns out to
be very painful to make that work, but I don't expect that).

> I had pondered an implementation using two layers of bloom filters to track vnodes that are watched by any filter, and then specific filters, which would track inode numbers and do the name lookup on the slow path after matching, but I suspect there are some details that would make this hard.  
> 
> The approach on XNU is a lot more scalable than on Linux and seems to be similar to your original proposal here.  The kernel has a single fsevents device node that tells a userspace daemon the directories that contain files that have been modified.  When a process wants to watch for events in a tree, it notifies the userspace daemon, which maintains all of the state.  There is a ring buffer between the daemon and the kernel.  If the userspace daemon can’t keep up with kernel events, it sees non-sequential message numbers and falls back to examining the modification times of files in all watched paths to determine if any files were modified in the period where it missed messages.

Yes, the EVFILT_FSWATCH PoC I did was inspired by having read XNU's
fsevents implementation.  I think that design is better in some ways,
but we can't implement it in a way that's usefully compatible with XNU,
and designing a new interface is quite tricky.  Compatibility with
inotify, which is used widely enough that we have a userspace
implementation in the ports tree, seems a lot more valuable.

> The benefit of the XNU approach is that filesystem watching never backpressures the kernel.
>
> I’m not sure how such an approach would work in a jail.

Me neither.  I guess you'd have to have a daemon in each jail, but this
would scale poorly if you're doing all the filtering in userspace.  I
saw that XNU has a fairly small limit on the number of watchers, which
makes sense if you have a single daemon consuming events from the
kernel, but that doesn't work so well with jails/containers each
subscribing individually.