Re: kqueue extensibility (Was: native inotify implementation)

From: Vadim Goncharov <vadimnuclight_at_gmail.com>
Date: Sat, 12 Jul 2025 17:58:11 UTC
On Mon, 7 Jul 2025 10:31:18 -0400
Mark Johnston <markj@freebsd.org> wrote:

> On Sun, Jul 06, 2025 at 12:42:22AM +0300, Vadim Goncharov wrote:
> > On Sat, 5 Jul 2025 12:30:18 -0400
> > Mark Johnston <markj@freebsd.org> wrote:
> >   
> > > On Sat, Jul 05, 2025 at 03:49:46AM +0300, Vadim Goncharov wrote:  
> > > > On Sat, 17 May 2025 11:18:34 -0400
> > > > Mark Johnston <markj@freebsd.org> wrote:
> > > >     
> > > > > On Fri, May 16, 2025 at 11:02:33AM -0500, Jake Freeland wrote:    
> > > > > > On Mon May 12, 2025 at 3:58 PM CDT, Mark Johnston wrote:      
> > > > > > > For the past while I've been hacking on a native implementation
> > > > > > > of Linux's inotify.  Functionality-wise, this is similar to but
> > > > > > > not quite equivalent to the EVFILT_VNODE kqueue filter.  While we
> > > > > > > already have a userspace implementation of inotify built on top
> > > > > > > of kqueue, it shares the limitations of EVFILT_VNODE, and my
> > > > > > > version can also be used in the Linuxulator.  (Please let me
> > > > > > > know if you're interested in working on that and testing it
> > > > > > > out.)    
> > > > [...]    
> > > > > > > This work was largely motivated by a race condition in
> > > > > > > EVFILT_VNODE: in order to get events for a particular file, you
> > > > > > > first have to open it, by which point you may have missed the
> > > > > > > event(s) you care about. For instance, if some upload service
> > > > > > > adds files to a directory, and you want to know when a new file
> > > > > > > has finished uploading, you'd have to watch the directory to get
> > > > > > > new file events, scan the directory to actually find the new
> > > > > > > file(s), open them, and then wait for NOTE_CLOSE (which might
> > > > > > > never arrive if the upload had already finished).  Aside from
> > > > > > > that, the need to hold each monitored file open is also a
> > > > > > > problem for large directory hierarchies as it's easy to exhaust
> > > > > > > file descriptor limits.
> > > > > > >
> > > > > > > My initial solution was a new kqueue filter, EVFILT_FSWATCH,
> > > > > > > which lets one watch for all file events under a mountpoint.
> > > > > > > The consumer would allocate a ring buffer with space to store
> > > > > > > paths and event metadata, register that with the kernel, and the
> > > > > > > kernel would write entries to the buffer, using reverse lookups
> > > > > > > to find a path for each event vnode.  This prototype worked, but
> > > > > > > got somewhat hairy and I decided it would be better to simply
> > > > > > > implement an existing interface: inotify already exists and is
> > > > > > > commonly used, and has a somewhat simpler model, as it merely
> > > > > > > watches for events within a particular directory.      
> > > > > > 
> > > > > > I've found that more and more developers are blindly using
> > > > > > Linux-specific interfaces these days, so +1 for natively supporting
> > > > > > another one.
> > > > > > 
> > > > > > The more support we have for these, the easier porting/Linux
> > > > > > emulation is. I think the benefits of this far outweighs the cost
> > > > > > of maintaining the code.      
> > > > > 
> > > > > I think so too.  My perspective is that we should implement widely
> > > > > used Linux interfaces as part of the larger goal of making existing
> > > > > software usable on FreeBSD.  This is more important than the purity
> > > > > of the kernel's interfaces or architecture, at least up to a certain
> > > > > point.
> > > > > 
> > > > > The whole purpose of an OS is to let users run the programs they
> > > > > want to run, without getting in the way (too much).    
> > > > 
> > > > Yes, and no. While it's often useful in short-term perspective, such
> > > > approach leaves FreeBSD without unique features so it becomes yet
> > > > another "Linux, just poorer" with obvious then "why choose it?". It's
> > > > understandable that in some cases it is simple to implement compatible
> > > > API, but an alternative like "have more general solution with a
> > > > compatibility shim layer via which their API is implemented" is better,
> > > > when possible.    
> > > 
> > > Sure, but so far there is no clear description of a more general
> > > solution, and the shortcomings of EVFILT_VNODE have been known for a
> > > long time.  
> > 
> > I am not to blame (better than _vnode interface is a win for FreeBSD now),
> > I also see limitations of kqueue - however, it's very attractive and much
> > better than Linux zoo mess of epoll() + xxxfd() kludges. So let's talk
> > about extending it below...
> >   
> > > There's also nothing precluding this inotify implementation from being
> > > extended or replaced, just so long as a compatible implementation can be
> > > provided in libc.
> > >   
> > > > It's late in which particular topic as commit was landed, but for
> > > > future we should think how to extend kqueue to be able more.    
> > > 
> > > As I mentioned in my original email, that's what I tried to do first.
> > > It is immediately more complicated than inotify since kevent() doesn't
> > > have a good way to return arbitrary data (particularly file names and
> > > paths) to userspace.  It is possible if we make kevent() write to a user
> > > pointer embedded in the knote, but it's not simple.  I note that XNU  
> > 
> > Yeah, that's the way Windows took in it's WM_* messages.
> >   
> > > also does not use kqueue for this purpose, and I'm skeptical that it's
> > > the right substrate for a file montoring interface.  
> > 
> > Well, this is one problem of it (other below), let's discuss an idea,
> > kinda brain-storming (not necesssary final)... David talked about
> > message-bus, and while it's doubtful the kqueue() is the right place for
> > it, this induces to idea of variable-length messages (also e.g. X11 has an
> > extension of such kind). How this could be implemented? Suppose there is a
> > flag for `flags` which tells this is not complete `struct kevent` but a
> > series for it - e.g. a number in one of fields indicate 3 for three
> > structs in total (192 bytes). Then aplication knows it must read 2 more
> > `struct kevent`'s (if did not already) which must be placed consecutively
> > in array, so that than a cast to `struct longXXXevent *` can be performed.
> > Then, `struct longXXXevent` contains first fields identical to `struct
> > kevent` but these are not repeated - e.g. rest two kevents would be raw
> > data instead of real kevents. Such longXXXevent most certainly contains
> > char[0] or uint8_t[0] as it's last field for variable-length data.  
> 
> I don't think this idea really works.  Normally, kevent() doesn't copy
> in the eventlist, it's an output-only parameter.  So now we'd have to
> copy in the event list, see which event structures are "extended", and
> be careful when copying out.  It could be implemented but IMO it's not
> in the spirit of the interface.

By "copy in" you mean userland -> kernel direction? I thought about only the
opposite, kernel -> userland.

> In my EVFILT_FSWATCH prototype, I used the "ident" field of the kevent

Why not newer "extension" fields?

> to store a user pointer; when the filter's f_touch function is called,
> it copies out any pending data to that pointer and activates the knote.
> When the application returns from kevent() it has to process all the
> data that was written, then it can call kevent() again and get fresh
> event descriptions.

I'm not familiar how the -> userland queue is implemented in kernel. I'd could
just say here that Win32 API always also had pointers in it's WM_* messages,
but pointers in userspace are error-prone...

> But, what's the real advantage of this approach over defining a new fd
> type that you can read() from to get data?  It saves an extra system
> call, but is that the primary goal?

Unified event handling loops, of course - e.g. in aforementioned Windows WM_*
messages it is even more important. See also below for musctp_*

> > Of course, application must indicate to kernel it's prepared to receive
> > such "train wagons" of events and must be ready to memmove() head and tail
> > events if it got split between kqueue() calls.
> > 
> > For kernel, it mostly straightforward to split long structs to several
> > kevents and post them to queue and forget, except the problem of
> > automatically deleting unread events from kqueue e.g. on close(),
> > especially when split one are partially read (though I think it's enough
> > to track head only, if app began reading, then let's it finish and be
> > prepared for races). 
> > > > [E.g. I'd want to have notifications for my protocol with multiple
> > > > streams inside one socket (think like QUIC), but it does not fit
> > > > nicely into current struct kevent or socket API (multiple socket
> > > > buffers with separate reading)]    
> > 
> > Another problem is fixed set of filters. It is not possible for a KLD to
> > register it's own EVFILT_XXX so that software from ports could be used on a
> > GENERIC kernel without recompiling. Probably khelp(9) could be a solution
> > here, but I'm not familiar with this subsystem and seems it is not very
> > straightforward to add such support.  
> 
> Well, the EVFILT number is exposed to userspace, so it needs to be
> stable to preserve the ABI.  I can imagine some dynamic filter registry,
> where the application uses sysctl() to resolve the filter name to an ID
> before first use.

Sounds reasonable.

> But, if your KLD can create its own fds, then you can define your own
> behaviour for standard fd-based filters, i.e., EVFILT_READ etc..  So
> rather than creating new filter types, it seems more attractive to
> define new functionality using file descriptors and just use existing
> filters.

OK, let's example be on protocol I'm designing - L4 protocol [muSCTP] which
is like QUIC or SCTP supporting multiple independent streams inside one
connection, but want to overcome many QUIC/SCTP problems and limitations,
and ideally being in-kernel (contrast to QUIC) but living in ports compilable
on GENERIC. For example, application must be able to read each stream
independently of others (SCTP does not allow such elementary thing):

    int musctp_read(int fd, StreamID which_stream, void *buf, size_t len)

So, for this, kqueue() must be able to provide not only fd, but also a stream
on which event happened, of course with bytes available - but may be other
metadata, too. For example, ideally I would take idea from [SST] and make
`StreamID` not just `int` but be a real path in a streams tree, and put not
only bytes available, but e.g. Message ID and priority.

All of these won't necessarily fit into standard `struct kevent`, so what's
available - or will be available - in API influences protocol architecture
here. Of course, there are other problems here, e.g. kernel socket structure
assumes only two `sockbuf`s in it (inbound and outbound), not multiple
streams...
So knowing to which degree kqueue() could be extended would be beneficial. As
one of the goals is to provide simple-to-use modern API, additional
descriptors, given e.g. SCTP's API already very big and PITA by itself, would
be too stone-age and fragile.


[muSCTP] https://github.com/nuclight/musctp
[SST]    https://pdos.csail.mit.edu/archive/uia/sst/

-- 
WBR, @nuclight