Re: kqueue extensibility (Was: native inotify implementation)

From: Konstantin Belousov <kostikbel_at_gmail.com>
Date: Mon, 07 Jul 2025 14:43:34 UTC
On Mon, Jul 07, 2025 at 10:31:18AM -0400, Mark Johnston wrote:
> On Sun, Jul 06, 2025 at 12:42:22AM +0300, Vadim Goncharov wrote:
> > On Sat, 5 Jul 2025 12:30:18 -0400
> > Mark Johnston <markj@freebsd.org> wrote:
> > 
> > > On Sat, Jul 05, 2025 at 03:49:46AM +0300, Vadim Goncharov wrote:
> > > > On Sat, 17 May 2025 11:18:34 -0400
> > > > Mark Johnston <markj@freebsd.org> wrote:
> > > >   
> > > > > On Fri, May 16, 2025 at 11:02:33AM -0500, Jake Freeland wrote:  
> > > > > > On Mon May 12, 2025 at 3:58 PM CDT, Mark Johnston wrote:    
> > > > > > > For the past while I've been hacking on a native implementation of
> > > > > > > Linux's inotify.  Functionality-wise, this is similar to but not
> > > > > > > quite equivalent to the EVFILT_VNODE kqueue filter.  While we
> > > > > > > already have a userspace implementation of inotify built on top of
> > > > > > > kqueue, it shares the limitations of EVFILT_VNODE, and my version
> > > > > > > can also be used in the Linuxulator.  (Please let me know if you're
> > > > > > > interested in working on that and testing it out.)  
> > > > [...]  
> > > > > > > This work was largely motivated by a race condition in EVFILT_VNODE:
> > > > > > > in order to get events for a particular file, you first have to open
> > > > > > > it, by which point you may have missed the event(s) you care about.
> > > > > > > For instance, if some upload service adds files to a directory, and
> > > > > > > you want to know when a new file has finished uploading, you'd have
> > > > > > > to watch the directory to get new file events, scan the directory to
> > > > > > > actually find the new file(s), open them, and then wait for
> > > > > > > NOTE_CLOSE (which might never arrive if the upload had already
> > > > > > > finished).  Aside from that, the need to hold each monitored file
> > > > > > > open is also a problem for large directory hierarchies as it's easy
> > > > > > > to exhaust file descriptor limits.
> > > > > > >
> > > > > > > My initial solution was a new kqueue filter, EVFILT_FSWATCH, which
> > > > > > > lets one watch for all file events under a mountpoint.  The consumer
> > > > > > > would allocate a ring buffer with space to store paths and event
> > > > > > > metadata, register that with the kernel, and the kernel would write
> > > > > > > entries to the buffer, using reverse lookups to find a path for each
> > > > > > > event vnode.  This prototype worked, but got somewhat hairy and I
> > > > > > > decided it would be better to simply implement an existing
> > > > > > > interface: inotify already exists and is commonly used, and has a
> > > > > > > somewhat simpler model, as it merely watches for events within a
> > > > > > > particular directory.    
> > > > > > 
> > > > > > I've found that more and more developers are blindly using
> > > > > > Linux-specific interfaces these days, so +1 for natively supporting
> > > > > > another one.
> > > > > > 
> > > > > > The more support we have for these, the easier porting/Linux emulation
> > > > > > is. I think the benefits of this far outweighs the cost of maintaining
> > > > > > the code.    
> > > > > 
> > > > > I think so too.  My perspective is that we should implement widely used
> > > > > Linux interfaces as part of the larger goal of making existing software
> > > > > usable on FreeBSD.  This is more important than the purity of the
> > > > > kernel's interfaces or architecture, at least up to a certain point.
> > > > > 
> > > > > The whole purpose of an OS is to let users run the programs they want to
> > > > > run, without getting in the way (too much).  
> > > > 
> > > > Yes, and no. While it's often useful in short-term perspective, such
> > > > approach leaves FreeBSD without unique features so it becomes yet another
> > > > "Linux, just poorer" with obvious then "why choose it?". It's
> > > > understandable that in some cases it is simple to implement compatible
> > > > API, but an alternative like "have more general solution with a
> > > > compatibility shim layer via which their API is implemented" is better,
> > > > when possible.  
> > > 
> > > Sure, but so far there is no clear description of a more general
> > > solution, and the shortcomings of EVFILT_VNODE have been known for a
> > > long time.
> > 
> > I am not to blame (better than _vnode interface is a win for FreeBSD now), I
> > also see limitations of kqueue - however, it's very attractive and much better
> > than Linux zoo mess of epoll() + xxxfd() kludges. So let's talk about
> > extending it below...
> > 
> > > There's also nothing precluding this inotify implementation from being
> > > extended or replaced, just so long as a compatible implementation can be
> > > provided in libc.
> > > 
> > > > It's late in which particular topic as commit was landed, but for future we
> > > > should think how to extend kqueue to be able more.  
> > > 
> > > As I mentioned in my original email, that's what I tried to do first.
> > > It is immediately more complicated than inotify since kevent() doesn't
> > > have a good way to return arbitrary data (particularly file names and
> > > paths) to userspace.  It is possible if we make kevent() write to a user
> > > pointer embedded in the knote, but it's not simple.  I note that XNU
> > 
> > Yeah, that's the way Windows took in it's WM_* messages.
> > 
> > > also does not use kqueue for this purpose, and I'm skeptical that it's
> > > the right substrate for a file montoring interface.
> > 
> > Well, this is one problem of it (other below), let's discuss an idea,
> > kinda brain-storming (not necesssary final)... David talked about message-bus,
> > and while it's doubtful the kqueue() is the right place for it, this induces
> > to idea of variable-length messages (also e.g. X11 has an extension of such
> > kind). How this could be implemented? Suppose there is a flag for `flags`
> > which tells this is not complete `struct kevent` but a series for it - e.g. a
> > number in one of fields indicate 3 for three structs in total (192 bytes).
> > Then aplication knows it must read 2 more `struct kevent`'s (if did not
> > already) which must be placed consecutively in array, so that than a cast to
> > `struct longXXXevent *` can be performed. Then, `struct longXXXevent` contains
> > first fields identical to `struct kevent` but these are not repeated - e.g.
> > rest two kevents would be raw data instead of real kevents. Such longXXXevent
> > most certainly contains char[0] or uint8_t[0] as it's last field for
> > variable-length data.
> 
> I don't think this idea really works.  Normally, kevent() doesn't copy
> in the eventlist, it's an output-only parameter.  So now we'd have to
> copy in the event list, see which event structures are "extended", and
> be careful when copying out.  It could be implemented but IMO it's not
> in the spirit of the interface.
> 
> In my EVFILT_FSWATCH prototype, I used the "ident" field of the kevent
> to store a user pointer; when the filter's f_touch function is called,
> it copies out any pending data to that pointer and activates the knote.
> When the application returns from kevent() it has to process all the
> data that was written, then it can call kevent() again and get fresh
> event descriptions.
> 
> But, what's the real advantage of this approach over defining a new fd
> type that you can read() from to get data?  It saves an extra system
> call, but is that the primary goal?

I think this does not matter much.  What is fatal with kqueue in many
situations, is that kqueue fd is not passable, and it cannot be fixed by
a simple patch to the API.

> 
> > Of course, application must indicate to kernel it's prepared to receive such
> > "train wagons" of events and must be ready to memmove() head and tail events if
> > it got split between kqueue() calls.
> > 
> > For kernel, it mostly straightforward to split long structs to several kevents
> > and post them to queue and forget, except the problem of automatically
> > deleting unread events from kqueue e.g. on close(), especially when split one
> > are partially read (though I think it's enough to track head only, if app
> > began reading, then let's it finish and be prepared for races).
> > 
> > > > [E.g. I'd want to have notifications for my protocol with multiple streams
> > > > inside one socket (think like QUIC), but it does not fit nicely into
> > > > current struct kevent or socket API (multiple socket buffers with separate
> > > > reading)]  
> > 
> > Another problem is fixed set of filters. It is not possible for a KLD to
> > register it's own EVFILT_XXX so that software from ports could be used on a
> > GENERIC kernel without recompiling. Probably khelp(9) could be a solution
> > here, but I'm not familiar with this subsystem and seems it is not very
> > straightforward to add such support.
> 
> Well, the EVFILT number is exposed to userspace, so it needs to be
> stable to preserve the ABI.  I can imagine some dynamic filter registry,
> where the application uses sysctl() to resolve the filter name to an ID
> before first use.
> 
> But, if your KLD can create its own fds, then you can define your own
> behaviour for standard fd-based filters, i.e., EVFILT_READ etc..  So
> rather than creating new filter types, it seems more attractive to
> define new functionality using file descriptors and just use existing
> filters.