Re: kqueue extensibility (Was: native inotify implementation)

From: Mark Johnston <markj_at_freebsd.org>
Date: Mon, 07 Jul 2025 14:31:18 UTC
On Sun, Jul 06, 2025 at 12:42:22AM +0300, Vadim Goncharov wrote:
> On Sat, 5 Jul 2025 12:30:18 -0400
> Mark Johnston <markj@freebsd.org> wrote:
> 
> > On Sat, Jul 05, 2025 at 03:49:46AM +0300, Vadim Goncharov wrote:
> > > On Sat, 17 May 2025 11:18:34 -0400
> > > Mark Johnston <markj@freebsd.org> wrote:
> > >   
> > > > On Fri, May 16, 2025 at 11:02:33AM -0500, Jake Freeland wrote:  
> > > > > On Mon May 12, 2025 at 3:58 PM CDT, Mark Johnston wrote:    
> > > > > > For the past while I've been hacking on a native implementation of
> > > > > > Linux's inotify.  Functionality-wise, this is similar to but not
> > > > > > quite equivalent to the EVFILT_VNODE kqueue filter.  While we
> > > > > > already have a userspace implementation of inotify built on top of
> > > > > > kqueue, it shares the limitations of EVFILT_VNODE, and my version
> > > > > > can also be used in the Linuxulator.  (Please let me know if you're
> > > > > > interested in working on that and testing it out.)  
> > > [...]  
> > > > > > This work was largely motivated by a race condition in EVFILT_VNODE:
> > > > > > in order to get events for a particular file, you first have to open
> > > > > > it, by which point you may have missed the event(s) you care about.
> > > > > > For instance, if some upload service adds files to a directory, and
> > > > > > you want to know when a new file has finished uploading, you'd have
> > > > > > to watch the directory to get new file events, scan the directory to
> > > > > > actually find the new file(s), open them, and then wait for
> > > > > > NOTE_CLOSE (which might never arrive if the upload had already
> > > > > > finished).  Aside from that, the need to hold each monitored file
> > > > > > open is also a problem for large directory hierarchies as it's easy
> > > > > > to exhaust file descriptor limits.
> > > > > >
> > > > > > My initial solution was a new kqueue filter, EVFILT_FSWATCH, which
> > > > > > lets one watch for all file events under a mountpoint.  The consumer
> > > > > > would allocate a ring buffer with space to store paths and event
> > > > > > metadata, register that with the kernel, and the kernel would write
> > > > > > entries to the buffer, using reverse lookups to find a path for each
> > > > > > event vnode.  This prototype worked, but got somewhat hairy and I
> > > > > > decided it would be better to simply implement an existing
> > > > > > interface: inotify already exists and is commonly used, and has a
> > > > > > somewhat simpler model, as it merely watches for events within a
> > > > > > particular directory.    
> > > > > 
> > > > > I've found that more and more developers are blindly using
> > > > > Linux-specific interfaces these days, so +1 for natively supporting
> > > > > another one.
> > > > > 
> > > > > The more support we have for these, the easier porting/Linux emulation
> > > > > is. I think the benefits of this far outweighs the cost of maintaining
> > > > > the code.    
> > > > 
> > > > I think so too.  My perspective is that we should implement widely used
> > > > Linux interfaces as part of the larger goal of making existing software
> > > > usable on FreeBSD.  This is more important than the purity of the
> > > > kernel's interfaces or architecture, at least up to a certain point.
> > > > 
> > > > The whole purpose of an OS is to let users run the programs they want to
> > > > run, without getting in the way (too much).  
> > > 
> > > Yes, and no. While it's often useful in short-term perspective, such
> > > approach leaves FreeBSD without unique features so it becomes yet another
> > > "Linux, just poorer" with obvious then "why choose it?". It's
> > > understandable that in some cases it is simple to implement compatible
> > > API, but an alternative like "have more general solution with a
> > > compatibility shim layer via which their API is implemented" is better,
> > > when possible.  
> > 
> > Sure, but so far there is no clear description of a more general
> > solution, and the shortcomings of EVFILT_VNODE have been known for a
> > long time.
> 
> I am not to blame (better than _vnode interface is a win for FreeBSD now), I
> also see limitations of kqueue - however, it's very attractive and much better
> than Linux zoo mess of epoll() + xxxfd() kludges. So let's talk about
> extending it below...
> 
> > There's also nothing precluding this inotify implementation from being
> > extended or replaced, just so long as a compatible implementation can be
> > provided in libc.
> > 
> > > It's late in which particular topic as commit was landed, but for future we
> > > should think how to extend kqueue to be able more.  
> > 
> > As I mentioned in my original email, that's what I tried to do first.
> > It is immediately more complicated than inotify since kevent() doesn't
> > have a good way to return arbitrary data (particularly file names and
> > paths) to userspace.  It is possible if we make kevent() write to a user
> > pointer embedded in the knote, but it's not simple.  I note that XNU
> 
> Yeah, that's the way Windows took in it's WM_* messages.
> 
> > also does not use kqueue for this purpose, and I'm skeptical that it's
> > the right substrate for a file montoring interface.
> 
> Well, this is one problem of it (other below), let's discuss an idea,
> kinda brain-storming (not necesssary final)... David talked about message-bus,
> and while it's doubtful the kqueue() is the right place for it, this induces
> to idea of variable-length messages (also e.g. X11 has an extension of such
> kind). How this could be implemented? Suppose there is a flag for `flags`
> which tells this is not complete `struct kevent` but a series for it - e.g. a
> number in one of fields indicate 3 for three structs in total (192 bytes).
> Then aplication knows it must read 2 more `struct kevent`'s (if did not
> already) which must be placed consecutively in array, so that than a cast to
> `struct longXXXevent *` can be performed. Then, `struct longXXXevent` contains
> first fields identical to `struct kevent` but these are not repeated - e.g.
> rest two kevents would be raw data instead of real kevents. Such longXXXevent
> most certainly contains char[0] or uint8_t[0] as it's last field for
> variable-length data.

I don't think this idea really works.  Normally, kevent() doesn't copy
in the eventlist, it's an output-only parameter.  So now we'd have to
copy in the event list, see which event structures are "extended", and
be careful when copying out.  It could be implemented but IMO it's not
in the spirit of the interface.

In my EVFILT_FSWATCH prototype, I used the "ident" field of the kevent
to store a user pointer; when the filter's f_touch function is called,
it copies out any pending data to that pointer and activates the knote.
When the application returns from kevent() it has to process all the
data that was written, then it can call kevent() again and get fresh
event descriptions.

But, what's the real advantage of this approach over defining a new fd
type that you can read() from to get data?  It saves an extra system
call, but is that the primary goal?

> Of course, application must indicate to kernel it's prepared to receive such
> "train wagons" of events and must be ready to memmove() head and tail events if
> it got split between kqueue() calls.
> 
> For kernel, it mostly straightforward to split long structs to several kevents
> and post them to queue and forget, except the problem of automatically
> deleting unread events from kqueue e.g. on close(), especially when split one
> are partially read (though I think it's enough to track head only, if app
> began reading, then let's it finish and be prepared for races).
> 
> > > [E.g. I'd want to have notifications for my protocol with multiple streams
> > > inside one socket (think like QUIC), but it does not fit nicely into
> > > current struct kevent or socket API (multiple socket buffers with separate
> > > reading)]  
> 
> Another problem is fixed set of filters. It is not possible for a KLD to
> register it's own EVFILT_XXX so that software from ports could be used on a
> GENERIC kernel without recompiling. Probably khelp(9) could be a solution
> here, but I'm not familiar with this subsystem and seems it is not very
> straightforward to add such support.

Well, the EVFILT number is exposed to userspace, so it needs to be
stable to preserve the ABI.  I can imagine some dynamic filter registry,
where the application uses sysctl() to resolve the filter name to an ID
before first use.

But, if your KLD can create its own fds, then you can define your own
behaviour for standard fd-based filters, i.e., EVFILT_READ etc..  So
rather than creating new filter types, it seems more attractive to
define new functionality using file descriptors and just use existing
filters.