Re: kqueue extensibility (Was: native inotify implementation)

In reply to: Vadim Goncharov : "Re: kqueue extensibility (Was: native inotify implementation)"
Go to: [ bottom of page ] [ top of archives ] [ this month ]
From: Mark Johnston <markj_at_freebsd.org>
Date: Sun, 13 Jul 2025 16:19:59 UTC
On Sat, Jul 12, 2025 at 08:58:11PM +0300, Vadim Goncharov wrote:
> On Mon, 7 Jul 2025 10:31:18 -0400
> Mark Johnston <markj@freebsd.org> wrote:
> 
> > On Sun, Jul 06, 2025 at 12:42:22AM +0300, Vadim Goncharov wrote:
> > > On Sat, 5 Jul 2025 12:30:18 -0400
> > > Mark Johnston <markj@freebsd.org> wrote:
> > > > also does not use kqueue for this purpose, and I'm skeptical that it's
> > > > the right substrate for a file montoring interface.  
> > > 
> > > Well, this is one problem of it (other below), let's discuss an idea,
> > > kinda brain-storming (not necesssary final)... David talked about
> > > message-bus, and while it's doubtful the kqueue() is the right place for
> > > it, this induces to idea of variable-length messages (also e.g. X11 has an
> > > extension of such kind). How this could be implemented? Suppose there is a
> > > flag for `flags` which tells this is not complete `struct kevent` but a
> > > series for it - e.g. a number in one of fields indicate 3 for three
> > > structs in total (192 bytes). Then aplication knows it must read 2 more
> > > `struct kevent`'s (if did not already) which must be placed consecutively
> > > in array, so that than a cast to `struct longXXXevent *` can be performed.
> > > Then, `struct longXXXevent` contains first fields identical to `struct
> > > kevent` but these are not repeated - e.g. rest two kevents would be raw
> > > data instead of real kevents. Such longXXXevent most certainly contains
> > > char[0] or uint8_t[0] as it's last field for variable-length data.  
> > 
> > I don't think this idea really works.  Normally, kevent() doesn't copy
> > in the eventlist, it's an output-only parameter.  So now we'd have to
> > copy in the event list, see which event structures are "extended", and
> > be careful when copying out.  It could be implemented but IMO it's not
> > in the spirit of the interface.
> 
> By "copy in" you mean userland -> kernel direction? I thought about only the
> opposite, kernel -> userland.

Right.  You mentioned setting a flag in the kevent structure to mark it
as "long", but how does the kernel know that the flag is set?  To do so,
it must first copy in the event list.

> > In my EVFILT_FSWATCH prototype, I used the "ident" field of the kevent
> 
> Why not newer "extension" fields?

Because those fields are uint64_t, apparently to mimic XNU's struct
kevent64, and I don't want to stuff a pointer into a uint64_t.  That'd
be incompatible with CHERI, at least.  There were no fds involved, so
the userspace address seemed like a reasonable unique identifier.

> > to store a user pointer; when the filter's f_touch function is called,
> > it copies out any pending data to that pointer and activates the knote.
> > When the application returns from kevent() it has to process all the
> > data that was written, then it can call kevent() again and get fresh
> > event descriptions.
> 
> I'm not familiar how the -> userland queue is implemented in kernel. I'd could
> just say here that Win32 API always also had pointers in it's WM_* messages,
> but pointers in userspace are error-prone...
> 
> > But, what's the real advantage of this approach over defining a new fd
> > type that you can read() from to get data?  It saves an extra system
> > call, but is that the primary goal?
> 
> Unified event handling loops, of course - e.g. in aforementioned Windows WM_*
> messages it is even more important. See also below for musctp_*

Why can't the new fd type be used with kqueue?

It's possible to wait for data to arrive on an inotify fd using
kevent(), of course, so inotify can be integrated into a kevent()-based
event loop.

> > > Of course, application must indicate to kernel it's prepared to receive
> > > such "train wagons" of events and must be ready to memmove() head and tail
> > > events if it got split between kqueue() calls.
> > > 
> > > For kernel, it mostly straightforward to split long structs to several
> > > kevents and post them to queue and forget, except the problem of
> > > automatically deleting unread events from kqueue e.g. on close(),
> > > especially when split one are partially read (though I think it's enough
> > > to track head only, if app began reading, then let's it finish and be
> > > prepared for races). 
> > > > > [E.g. I'd want to have notifications for my protocol with multiple
> > > > > streams inside one socket (think like QUIC), but it does not fit
> > > > > nicely into current struct kevent or socket API (multiple socket
> > > > > buffers with separate reading)]    
> > > 
> > > Another problem is fixed set of filters. It is not possible for a KLD to
> > > register it's own EVFILT_XXX so that software from ports could be used on a
> > > GENERIC kernel without recompiling. Probably khelp(9) could be a solution
> > > here, but I'm not familiar with this subsystem and seems it is not very
> > > straightforward to add such support.  
> > 
> > Well, the EVFILT number is exposed to userspace, so it needs to be
> > stable to preserve the ABI.  I can imagine some dynamic filter registry,
> > where the application uses sysctl() to resolve the filter name to an ID
> > before first use.
> 
> Sounds reasonable.
> 
> > But, if your KLD can create its own fds, then you can define your own
> > behaviour for standard fd-based filters, i.e., EVFILT_READ etc..  So
> > rather than creating new filter types, it seems more attractive to
> > define new functionality using file descriptors and just use existing
> > filters.
> 
> OK, let's example be on protocol I'm designing - L4 protocol [muSCTP] which
> is like QUIC or SCTP supporting multiple independent streams inside one
> connection, but want to overcome many QUIC/SCTP problems and limitations,
> and ideally being in-kernel (contrast to QUIC) but living in ports compilable
> on GENERIC. For example, application must be able to read each stream
> independently of others (SCTP does not allow such elementary thing):
> 
>     int musctp_read(int fd, StreamID which_stream, void *buf, size_t len)
> 
> So, for this, kqueue() must be able to provide not only fd, but also a stream
> on which event happened, of course with bytes available - but may be other
> metadata, too. For example, ideally I would take idea from [SST] and make
> `StreamID` not just `int` but be a real path in a streams tree, and put not
> only bytes available, but e.g. Message ID and priority.
> 
> All of these won't necessarily fit into standard `struct kevent`, so what's
> available - or will be available - in API influences protocol architecture
> here. Of course, there are other problems here, e.g. kernel socket structure
> assumes only two `sockbuf`s in it (inbound and outbound), not multiple
> streams...
> So knowing to which degree kqueue() could be extended would be beneficial. As
> one of the goals is to provide simple-to-use modern API, additional
> descriptors, given e.g. SCTP's API already very big and PITA by itself, would
> be too stone-age and fragile.

Is it impossible to assign different fds to different streams?