Re: native inotify implementation

In reply to: Mark Johnston : "Re: native inotify implementation"
Go to: [ bottom of page ] [ top of archives ] [ this month ]
From: David Chisnall <theraven_at_FreeBSD.org>
Date: Sun, 18 May 2025 17:19:09 UTC
On 18 May 2025, at 16:51, Mark Johnston <markj@freebsd.org> wrote:
> 
>> Paging some of this back in, Linux has non-advisory locks, so you can open a file with a read lock that will fail if there is an open write file descriptor (and cause attempts to open for writing to fail until the lock is dropped). This is a nice way of doing this in a race-free way. FreeBSD has no equivalent that I could see.
> 
> What's the Linux interface you're referring to?

Linux mandatory locking (non-POSIX) exposed via fcntl.

> Yes, this is one explicit incompatibility my implementation has.  Not
> because I thought it'd be useful, but rather just as a natural
> consequence of the way this implementation uses the name cache to find
> watched directories which contain an accessed vnode.

It is useful!  Now I’m paging in even older things, from when I tried to implement Spotlight-like filesytem indexing 15+ years ago…

> That is, if directories D1, D2 contain hard links L1, L2 of the same
> file, and D1 is watched, accesses via L1 or L2 will result in inotify
> events being published, whereas on Linux accesses via L2 would be
> ignored.  When D1 == D2, any access will two events, one for L1 and one
> for L2, which is perhaps not ideal.

The first of these is desirable.  The second is not ideal but more desirable than the Linux version.

> What else, aside from the questions of hard links and nullfs/bind
> mounts?

Those were the big ones.  The others I recall were:

Backpressure if too many things are using the notify APIs can result in performance issues.  Not sure if Linux addressed this, I think there are some cgroup-related limits you can put on a few things to reduce the worst case.

The way that the notify structure is communicated to userspace is annoying.  It’s a variable-sized structure but the file API is record-oriented so allocating receive space ends up wasting space or requiring too extra system calls.  A kevent that told you both how many pending records and how many pending bytes would be nice.

The cookie for rename events requires state tracking in an otherwise stateless API.  It would be nice if records could provide both the to and from names.

Various aspects of the API are racy.  For some use cases, it would be nice to have a more completion-ports-style interface.  For example, for filesystem indexing, the API that I actually want is ‘when changes to a file are finished, give me an open read-only file descriptor to the file’.  I think Linux may now let you build something like that with io_uring, but I’m not sure it’s actually plumbed through.

The atomic update pattern (create file, write, close file, then rename) generates some events that you need to track for filesystem indexing for other use cases but which you need to discard in this flow.  This isn’t really a limitation of inotify specifically, but a flow for atomic FS writes that plays nicely with both capsicum and inotify would be nice.

I vaguely remember something about the interaction between dropping privileges and inotify that caused problems for locate integration but I don’t remember the details.

David