Re: kqueue extensibility (Was: native inotify implementation)
- In reply to: Mark Johnston : "Re: kqueue extensibility (Was: native inotify implementation)"
- Go to: [ bottom of page ] [ top of archives ] [ this month ]
Date: Mon, 07 Jul 2025 14:43:34 UTC
On Mon, Jul 07, 2025 at 10:31:18AM -0400, Mark Johnston wrote: > On Sun, Jul 06, 2025 at 12:42:22AM +0300, Vadim Goncharov wrote: > > On Sat, 5 Jul 2025 12:30:18 -0400 > > Mark Johnston <markj@freebsd.org> wrote: > > > > > On Sat, Jul 05, 2025 at 03:49:46AM +0300, Vadim Goncharov wrote: > > > > On Sat, 17 May 2025 11:18:34 -0400 > > > > Mark Johnston <markj@freebsd.org> wrote: > > > > > > > > > On Fri, May 16, 2025 at 11:02:33AM -0500, Jake Freeland wrote: > > > > > > On Mon May 12, 2025 at 3:58 PM CDT, Mark Johnston wrote: > > > > > > > For the past while I've been hacking on a native implementation of > > > > > > > Linux's inotify. Functionality-wise, this is similar to but not > > > > > > > quite equivalent to the EVFILT_VNODE kqueue filter. While we > > > > > > > already have a userspace implementation of inotify built on top of > > > > > > > kqueue, it shares the limitations of EVFILT_VNODE, and my version > > > > > > > can also be used in the Linuxulator. (Please let me know if you're > > > > > > > interested in working on that and testing it out.) > > > > [...] > > > > > > > This work was largely motivated by a race condition in EVFILT_VNODE: > > > > > > > in order to get events for a particular file, you first have to open > > > > > > > it, by which point you may have missed the event(s) you care about. > > > > > > > For instance, if some upload service adds files to a directory, and > > > > > > > you want to know when a new file has finished uploading, you'd have > > > > > > > to watch the directory to get new file events, scan the directory to > > > > > > > actually find the new file(s), open them, and then wait for > > > > > > > NOTE_CLOSE (which might never arrive if the upload had already > > > > > > > finished). Aside from that, the need to hold each monitored file > > > > > > > open is also a problem for large directory hierarchies as it's easy > > > > > > > to exhaust file descriptor limits. > > > > > > > > > > > > > > My initial solution was a new kqueue filter, EVFILT_FSWATCH, which > > > > > > > lets one watch for all file events under a mountpoint. The consumer > > > > > > > would allocate a ring buffer with space to store paths and event > > > > > > > metadata, register that with the kernel, and the kernel would write > > > > > > > entries to the buffer, using reverse lookups to find a path for each > > > > > > > event vnode. This prototype worked, but got somewhat hairy and I > > > > > > > decided it would be better to simply implement an existing > > > > > > > interface: inotify already exists and is commonly used, and has a > > > > > > > somewhat simpler model, as it merely watches for events within a > > > > > > > particular directory. > > > > > > > > > > > > I've found that more and more developers are blindly using > > > > > > Linux-specific interfaces these days, so +1 for natively supporting > > > > > > another one. > > > > > > > > > > > > The more support we have for these, the easier porting/Linux emulation > > > > > > is. I think the benefits of this far outweighs the cost of maintaining > > > > > > the code. > > > > > > > > > > I think so too. My perspective is that we should implement widely used > > > > > Linux interfaces as part of the larger goal of making existing software > > > > > usable on FreeBSD. This is more important than the purity of the > > > > > kernel's interfaces or architecture, at least up to a certain point. > > > > > > > > > > The whole purpose of an OS is to let users run the programs they want to > > > > > run, without getting in the way (too much). > > > > > > > > Yes, and no. While it's often useful in short-term perspective, such > > > > approach leaves FreeBSD without unique features so it becomes yet another > > > > "Linux, just poorer" with obvious then "why choose it?". It's > > > > understandable that in some cases it is simple to implement compatible > > > > API, but an alternative like "have more general solution with a > > > > compatibility shim layer via which their API is implemented" is better, > > > > when possible. > > > > > > Sure, but so far there is no clear description of a more general > > > solution, and the shortcomings of EVFILT_VNODE have been known for a > > > long time. > > > > I am not to blame (better than _vnode interface is a win for FreeBSD now), I > > also see limitations of kqueue - however, it's very attractive and much better > > than Linux zoo mess of epoll() + xxxfd() kludges. So let's talk about > > extending it below... > > > > > There's also nothing precluding this inotify implementation from being > > > extended or replaced, just so long as a compatible implementation can be > > > provided in libc. > > > > > > > It's late in which particular topic as commit was landed, but for future we > > > > should think how to extend kqueue to be able more. > > > > > > As I mentioned in my original email, that's what I tried to do first. > > > It is immediately more complicated than inotify since kevent() doesn't > > > have a good way to return arbitrary data (particularly file names and > > > paths) to userspace. It is possible if we make kevent() write to a user > > > pointer embedded in the knote, but it's not simple. I note that XNU > > > > Yeah, that's the way Windows took in it's WM_* messages. > > > > > also does not use kqueue for this purpose, and I'm skeptical that it's > > > the right substrate for a file montoring interface. > > > > Well, this is one problem of it (other below), let's discuss an idea, > > kinda brain-storming (not necesssary final)... David talked about message-bus, > > and while it's doubtful the kqueue() is the right place for it, this induces > > to idea of variable-length messages (also e.g. X11 has an extension of such > > kind). How this could be implemented? Suppose there is a flag for `flags` > > which tells this is not complete `struct kevent` but a series for it - e.g. a > > number in one of fields indicate 3 for three structs in total (192 bytes). > > Then aplication knows it must read 2 more `struct kevent`'s (if did not > > already) which must be placed consecutively in array, so that than a cast to > > `struct longXXXevent *` can be performed. Then, `struct longXXXevent` contains > > first fields identical to `struct kevent` but these are not repeated - e.g. > > rest two kevents would be raw data instead of real kevents. Such longXXXevent > > most certainly contains char[0] or uint8_t[0] as it's last field for > > variable-length data. > > I don't think this idea really works. Normally, kevent() doesn't copy > in the eventlist, it's an output-only parameter. So now we'd have to > copy in the event list, see which event structures are "extended", and > be careful when copying out. It could be implemented but IMO it's not > in the spirit of the interface. > > In my EVFILT_FSWATCH prototype, I used the "ident" field of the kevent > to store a user pointer; when the filter's f_touch function is called, > it copies out any pending data to that pointer and activates the knote. > When the application returns from kevent() it has to process all the > data that was written, then it can call kevent() again and get fresh > event descriptions. > > But, what's the real advantage of this approach over defining a new fd > type that you can read() from to get data? It saves an extra system > call, but is that the primary goal? I think this does not matter much. What is fatal with kqueue in many situations, is that kqueue fd is not passable, and it cannot be fixed by a simple patch to the API. > > > Of course, application must indicate to kernel it's prepared to receive such > > "train wagons" of events and must be ready to memmove() head and tail events if > > it got split between kqueue() calls. > > > > For kernel, it mostly straightforward to split long structs to several kevents > > and post them to queue and forget, except the problem of automatically > > deleting unread events from kqueue e.g. on close(), especially when split one > > are partially read (though I think it's enough to track head only, if app > > began reading, then let's it finish and be prepared for races). > > > > > > [E.g. I'd want to have notifications for my protocol with multiple streams > > > > inside one socket (think like QUIC), but it does not fit nicely into > > > > current struct kevent or socket API (multiple socket buffers with separate > > > > reading)] > > > > Another problem is fixed set of filters. It is not possible for a KLD to > > register it's own EVFILT_XXX so that software from ports could be used on a > > GENERIC kernel without recompiling. Probably khelp(9) could be a solution > > here, but I'm not familiar with this subsystem and seems it is not very > > straightforward to add such support. > > Well, the EVFILT number is exposed to userspace, so it needs to be > stable to preserve the ABI. I can imagine some dynamic filter registry, > where the application uses sysctl() to resolve the filter name to an ID > before first use. > > But, if your KLD can create its own fds, then you can define your own > behaviour for standard fd-based filters, i.e., EVFILT_READ etc.. So > rather than creating new filter types, it seems more attractive to > define new functionality using file descriptors and just use existing > filters.