Re: native inotify implementation

Reply: Mark Johnston : "Re: native inotify implementation"
Reply: Vadim Goncharov : "Re: native inotify implementation"
In reply to: Mark Johnston : "Re: native inotify implementation"
Go to: [ bottom of page ] [ top of archives ] [ this month ]
From: David Chisnall <theraven_at_freebsd.org>
Date: Sun, 18 May 2025 10:35:50 UTC
On 17 May 2025, at 17:56, Mark Johnston <markj@freebsd.org> wrote:
> 
> On Sat, May 17, 2025 at 05:00:48PM +0100, David Chisnall wrote:
>>> On 12 May 2025, at 21:58, Mark Johnston <markj@freebsd.org> wrote:
>>> 
>>> This work was largely motivated by a race condition in EVFILT_VNODE: in
>>> order to get events for a particular file, you first have to open it, by
>>> which point you may have missed the event(s) you care about.  For
>>> instance, if some upload service adds files to a directory, and you want
>>> to know when a new file has finished uploading, you'd have to watch the
>>> directory to get new file events, scan the directory to actually find
>>> the new file(s), open them, and then wait for NOTE_CLOSE (which might
>>> never arrive if the upload had already finished).  Aside from that, the
>>> need to hold each monitored file open is also a problem for large
>>> directory hierarchies as it's easy to exhaust file descriptor limits.
>> 
>> My experience as a user was that NOTE_CLOSE was unreliable.  I tried using it to detect when uploads had finished but I never saw it (on ZFS).  I have had producing a working reduced test case for this on my todo list for a while, but I solved my problem by writing my own sftp-server implementation that stored the received ‘file’ in a shared memory object and passed it to another process, so didn’t end up depending on this.
> 
> I don't quite follow: was your problem perhaps due to the
> above-mentioned race, or are you sure it was something else?
> 
> There is at least related one bug that I see: fds opened with O_PATH do
> not have VOP_OPEN/VOP_CLOSE invoked on the underlying vnode, so they
> don't raise NOTE_OPEN/NOTE_CLOSE kevent notifications.  I doubt this is
> related to your problem though (and I'm not certain that the current
> behaviour is really incorrect).

I am not sure if it was this race. I had a program watching for NOTE_CLOSE_WRITE to tell when sftp-server had finished uploading a file. This never saw the notification. It did see writes, so I fell back to waiting a second after the last write and using libprocstat to see if there were open descriptors for the file, but this required root.

>> The only way that I found on FreeBSD to determine that a file was no longer open for writing was via libprocstat, which required root.  Linux has an API for this, apparently, but I didn’t try it.

Paging some of this back in, Linux has non-advisory locks, so you can open a file with a read lock that will fail if there is an open write file descriptor (and cause attempts to open for writing to fail until the lock is dropped). This is a nice way of doing this in a race-free way. FreeBSD has no equivalent that I could see.

>>> My initial solution was a new kqueue filter, EVFILT_FSWATCH, which lets
>>> one watch for all file events under a mountpoint.  The consumer would
>>> allocate a ring buffer with space to store paths and event metadata,
>>> register that with the kernel, and the kernel would write entries to the
>>> buffer, using reverse lookups to find a path for each event vnode.  This
>>> prototype worked, but got somewhat hairy and I decided it would be
>>> better to simply implement an existing interface: inotify already exists
>>> and is commonly used, and has a somewhat simpler model, as it merely
>>> watches for events within a particular directory.
>> 
>> 
>> I think it’s worth discussing the design a bit.  I’ve used the Linux inotify implementation a bit and the fact that it doesn’t see changes made through hard links or bind mounts is quite problematic for several use cases.  From skimming your code, it looks as if it might have the same limitation?  This means, for example, that a jailed application that watches its config files would miss notifications if they are modified via their original location rather than the nullfs mount.  With containers, this is likely to be more important: if the same volume is mounted in two containers (nullfs mounted in two jails), one should be able to watch for changes made by the other.  
> 
> Right now I don't expect inotify to work properly with nullfs mounts,
> but I do plan to address that.  That consideration is why my patch
> introduces new VOPs instead of operating directly on the passed vnode:
> for nullfs vnodes, I want to ensure that inotify watches are associated
> with the lower vnode (i.e., the "real" vnode).  So, this limitation of
> the Linux implementation won't apply to FreeBSD (unless it turns out to
> be very painful to make that work, but I don't expect that).

Great! Does this also work for hard links?

One of the use cases that GNOME folks were complaining about when I looked at this 10-15 years ago was watching a music collection that used hard links to index by different categories in the filesystem. Something that wrote an update to metadata via one link would not trigger an inotify notification on another.

>> I had pondered an implementation using two layers of bloom filters to track vnodes that are watched by any filter, and then specific filters, which would track inode numbers and do the name lookup on the slow path after matching, but I suspect there are some details that would make this hard.  
>> 
>> The approach on XNU is a lot more scalable than on Linux and seems to be similar to your original proposal here.  The kernel has a single fsevents device node that tells a userspace daemon the directories that contain files that have been modified.  When a process wants to watch for events in a tree, it notifies the userspace daemon, which maintains all of the state.  There is a ring buffer between the daemon and the kernel.  If the userspace daemon can’t keep up with kernel events, it sees non-sequential message numbers and falls back to examining the modification times of files in all watched paths to determine if any files were modified in the period where it missed messages.
> 
> Yes, the EVFILT_FSWATCH PoC I did was inspired by having read XNU's
> fsevents implementation.  I think that design is better in some ways,
> but we can't implement it in a way that's usefully compatible with XNU,
> and designing a new interface is quite tricky.  Compatibility with
> inotify, which is used widely enough that we have a userspace
> implementation in the ports tree, seems a lot more valuable.

I agree that compatibility with inotify is useful, but the API has a lot of known problems that it would be good to avoid.

>> The benefit of the XNU approach is that filesystem watching never backpressures the kernel.
>> 
>> I’m not sure how such an approach would work in a jail.
> 
> Me neither.  I guess you'd have to have a daemon in each jail, but this
> would scale poorly if you're doing all the filtering in userspace.  I
> saw that XNU has a fairly small limit on the number of watchers, which
> makes sense if you have a single daemon consuming events from the
> kernel, but that doesn't work so well with jails/containers each
> subscribing individually.

I’m not sure what the XNU-level implementation does with restrictions but on macOS there is exactly one consumer of the kernel interface. This is responsible for fan out.

I would expect a jailed version to communicate with the parent jail version.

Unfortunately, this then decays to a special case of the ‘we need a useful broadcast / multicast message bus with sensible access control that can fan out to relays for jails and users’ problem that remains the blocker for most of the things I want to build for FreeBSD.

David