kqueue extensibility (Was: native inotify implementation)
- Reply: Mark Johnston : "Re: kqueue extensibility (Was: native inotify implementation)"
- In reply to: Mark Johnston : "Re: native inotify implementation"
- Go to: [ bottom of page ] [ top of archives ] [ this month ]
Date: Sat, 05 Jul 2025 21:42:22 UTC
On Sat, 5 Jul 2025 12:30:18 -0400 Mark Johnston <markj@freebsd.org> wrote: > On Sat, Jul 05, 2025 at 03:49:46AM +0300, Vadim Goncharov wrote: > > On Sat, 17 May 2025 11:18:34 -0400 > > Mark Johnston <markj@freebsd.org> wrote: > > > > > On Fri, May 16, 2025 at 11:02:33AM -0500, Jake Freeland wrote: > > > > On Mon May 12, 2025 at 3:58 PM CDT, Mark Johnston wrote: > > > > > For the past while I've been hacking on a native implementation of > > > > > Linux's inotify. Functionality-wise, this is similar to but not > > > > > quite equivalent to the EVFILT_VNODE kqueue filter. While we > > > > > already have a userspace implementation of inotify built on top of > > > > > kqueue, it shares the limitations of EVFILT_VNODE, and my version > > > > > can also be used in the Linuxulator. (Please let me know if you're > > > > > interested in working on that and testing it out.) > > [...] > > > > > This work was largely motivated by a race condition in EVFILT_VNODE: > > > > > in order to get events for a particular file, you first have to open > > > > > it, by which point you may have missed the event(s) you care about. > > > > > For instance, if some upload service adds files to a directory, and > > > > > you want to know when a new file has finished uploading, you'd have > > > > > to watch the directory to get new file events, scan the directory to > > > > > actually find the new file(s), open them, and then wait for > > > > > NOTE_CLOSE (which might never arrive if the upload had already > > > > > finished). Aside from that, the need to hold each monitored file > > > > > open is also a problem for large directory hierarchies as it's easy > > > > > to exhaust file descriptor limits. > > > > > > > > > > My initial solution was a new kqueue filter, EVFILT_FSWATCH, which > > > > > lets one watch for all file events under a mountpoint. The consumer > > > > > would allocate a ring buffer with space to store paths and event > > > > > metadata, register that with the kernel, and the kernel would write > > > > > entries to the buffer, using reverse lookups to find a path for each > > > > > event vnode. This prototype worked, but got somewhat hairy and I > > > > > decided it would be better to simply implement an existing > > > > > interface: inotify already exists and is commonly used, and has a > > > > > somewhat simpler model, as it merely watches for events within a > > > > > particular directory. > > > > > > > > I've found that more and more developers are blindly using > > > > Linux-specific interfaces these days, so +1 for natively supporting > > > > another one. > > > > > > > > The more support we have for these, the easier porting/Linux emulation > > > > is. I think the benefits of this far outweighs the cost of maintaining > > > > the code. > > > > > > I think so too. My perspective is that we should implement widely used > > > Linux interfaces as part of the larger goal of making existing software > > > usable on FreeBSD. This is more important than the purity of the > > > kernel's interfaces or architecture, at least up to a certain point. > > > > > > The whole purpose of an OS is to let users run the programs they want to > > > run, without getting in the way (too much). > > > > Yes, and no. While it's often useful in short-term perspective, such > > approach leaves FreeBSD without unique features so it becomes yet another > > "Linux, just poorer" with obvious then "why choose it?". It's > > understandable that in some cases it is simple to implement compatible > > API, but an alternative like "have more general solution with a > > compatibility shim layer via which their API is implemented" is better, > > when possible. > > Sure, but so far there is no clear description of a more general > solution, and the shortcomings of EVFILT_VNODE have been known for a > long time. I am not to blame (better than _vnode interface is a win for FreeBSD now), I also see limitations of kqueue - however, it's very attractive and much better than Linux zoo mess of epoll() + xxxfd() kludges. So let's talk about extending it below... > There's also nothing precluding this inotify implementation from being > extended or replaced, just so long as a compatible implementation can be > provided in libc. > > > It's late in which particular topic as commit was landed, but for future we > > should think how to extend kqueue to be able more. > > As I mentioned in my original email, that's what I tried to do first. > It is immediately more complicated than inotify since kevent() doesn't > have a good way to return arbitrary data (particularly file names and > paths) to userspace. It is possible if we make kevent() write to a user > pointer embedded in the knote, but it's not simple. I note that XNU Yeah, that's the way Windows took in it's WM_* messages. > also does not use kqueue for this purpose, and I'm skeptical that it's > the right substrate for a file montoring interface. Well, this is one problem of it (other below), let's discuss an idea, kinda brain-storming (not necesssary final)... David talked about message-bus, and while it's doubtful the kqueue() is the right place for it, this induces to idea of variable-length messages (also e.g. X11 has an extension of such kind). How this could be implemented? Suppose there is a flag for `flags` which tells this is not complete `struct kevent` but a series for it - e.g. a number in one of fields indicate 3 for three structs in total (192 bytes). Then aplication knows it must read 2 more `struct kevent`'s (if did not already) which must be placed consecutively in array, so that than a cast to `struct longXXXevent *` can be performed. Then, `struct longXXXevent` contains first fields identical to `struct kevent` but these are not repeated - e.g. rest two kevents would be raw data instead of real kevents. Such longXXXevent most certainly contains char[0] or uint8_t[0] as it's last field for variable-length data. Of course, application must indicate to kernel it's prepared to receive such "train wagons" of events and must be ready to memmove() head and tail events if it got split between kqueue() calls. For kernel, it mostly straightforward to split long structs to several kevents and post them to queue and forget, except the problem of automatically deleting unread events from kqueue e.g. on close(), especially when split one are partially read (though I think it's enough to track head only, if app began reading, then let's it finish and be prepared for races). > > [E.g. I'd want to have notifications for my protocol with multiple streams > > inside one socket (think like QUIC), but it does not fit nicely into > > current struct kevent or socket API (multiple socket buffers with separate > > reading)] Another problem is fixed set of filters. It is not possible for a KLD to register it's own EVFILT_XXX so that software from ports could be used on a GENERIC kernel without recompiling. Probably khelp(9) could be a solution here, but I'm not familiar with this subsystem and seems it is not very straightforward to add such support. -- WBR, @nuclight