From nobody Sat May 17 16:00:48 2025 X-Original-To: freebsd-hackers@mlmmj.nyi.freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2610:1c1:1:606c::19:1]) by mlmmj.nyi.freebsd.org (Postfix) with ESMTP id 4b07wD2ngqz5wmNw for ; Sat, 17 May 2025 16:01:00 +0000 (UTC) (envelope-from theraven@FreeBSD.org) Received: from smtp.freebsd.org (smtp.freebsd.org [96.47.72.83]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature RSA-PSS (4096 bits) server-digest SHA256 client-signature RSA-PSS (4096 bits) client-digest SHA256) (Client CN "smtp.freebsd.org", Issuer "R11" (verified OK)) by mx1.freebsd.org (Postfix) with ESMTPS id 4b07wD2BSBz445r; Sat, 17 May 2025 16:01:00 +0000 (UTC) (envelope-from theraven@FreeBSD.org) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=freebsd.org; s=dkim; t=1747497660; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: in-reply-to:in-reply-to:references:references; bh=uMLcUkeNr3VLhZt30WJ24pNazeE9cKIzbIC2ysUJysI=; b=Ab+tHKw8K8IdiOiCWg8DacwOHKfYvAwxmpWp6KQnsApNuepK0c/X+bq8riT4pTagVwOEwG OgqNEkG/FZP75xbuvyGGC1DcUfjOrdkqubis35u6e7xle3XC4QrokywtNE3ZS3JK/PJFCA J6TasovUFjXczmEUeFzDFsIKlf11r36mGut6qsxgj4gQ0C/thz6+LI68AtiyAUuR2dyl+7 OKv7vQllyHu3hf6Rv9lGXYHOGIq+JJw0mQm3dw7fS/N1tXS2Dtif9OBcYbSasaPoZ8i6Wj vzxIaIk1FeAdn+hyqLqcv+aZsjwjRryRlodD2ZSDEhQia+KgJBMidPTbbIIMNw== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=freebsd.org; s=dkim; t=1747497660; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: in-reply-to:in-reply-to:references:references; bh=uMLcUkeNr3VLhZt30WJ24pNazeE9cKIzbIC2ysUJysI=; b=uxkCYaRL5G9BWAoUAzw131W9gds5NwF4UVy8Et/GtvNtQQO8tkN1jMhN+BqMM1Mf4XLqaX NmBR0bRoyDOGoiKRSbnAaTONbPN7N1fQP9T8iQ0XkgIPCVMVSbkBxfT1zPfrJ20X60HBgn WyXpicEMORYfYlM1oFA+o3lmSecR2tUY1PwvUWqMHSiDox4N0piqyYRlzKfK5KuucpeL7n CbHbovn+7/yV92DYU7+cynQMCOYjlNBsEGgLymopN3G2P5vsKm6H4oQVgCPaYL7d1YB7G6 YufzEmZQt421Vne+LKP7Erm04A0YMs3eqFvIYKeu8QBgZMEux4o5WIXRYwHu9g== ARC-Seal: i=1; s=dkim; d=freebsd.org; t=1747497660; a=rsa-sha256; cv=none; b=hz+4GAxeOusxqTcM6YR9Jq8UfdEJhnGvD6nQwfwynj3+ELGhK/5wQLjiS9odOzFidrJLGO bFbH+qrXtQX3GRb5bWbUevi34mok7oEZjct8AARdDuWjBp4XUBOU4o5bseieA/pVEISwWc W5UoGanw+e0qkBH4tKtI8kQmdn1rhp4E2QeGXBdu57AIjZ+wIFXffSolIptmn6leAP7Cwf BWXWn9q1DY+Gn2GkSncQygXVHZ1p706C6oo+dx2HSQmi/IfzASDAwfEOSP9uxkiM0VRrcU SpMcX3A91fOpT7mtAVTWOqJ8Yf1rNmjeOcGuJWaGmrh68vYfBpiGt2/OudD+VA== ARC-Authentication-Results: i=1; mx1.freebsd.org; none Received: from smtp.theravensnest.org (smtp.theravensnest.org [45.77.103.195]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature RSA-PSS (4096 bits) server-digest SHA256) (Client did not present a certificate) (Authenticated sender: theraven) by smtp.freebsd.org (Postfix) with ESMTPSA id 4b07wD1MThzFrP; Sat, 17 May 2025 16:01:00 +0000 (UTC) (envelope-from theraven@FreeBSD.org) Received: from smtpclient.apple (host86-142-176-199.range86-142.btcentralplus.com [86.142.176.199]) by smtp.theravensnest.org (Postfix) with ESMTPSA id 9567310C27; Sat, 17 May 2025 17:00:59 +0100 (BST) From: David Chisnall Message-Id: <213C2EEE-B7AE-4A8C-8A0B-FFD0EE3D8462@FreeBSD.org> Content-Type: multipart/alternative; boundary="Apple-Mail=_B24A8BAF-C6BC-4787-95F1-41914B2C0333" List-Id: Technical discussions relating to FreeBSD List-Archive: https://lists.freebsd.org/archives/freebsd-hackers List-Help: List-Post: List-Subscribe: List-Unsubscribe: Sender: owner-freebsd-hackers@FreeBSD.org Mime-Version: 1.0 (Mac OS X Mail 16.0 \(3776.700.51.11.1\)) Subject: Re: native inotify implementation Date: Sat, 17 May 2025 17:00:48 +0100 In-Reply-To: Cc: freebsd-hackers To: Mark Johnston References: X-Mailer: Apple Mail (2.3776.700.51.11.1) --Apple-Mail=_B24A8BAF-C6BC-4787-95F1-41914B2C0333 Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset=utf-8 On 12 May 2025, at 21:58, Mark Johnston wrote: >=20 > This work was largely motivated by a race condition in EVFILT_VNODE: = in > order to get events for a particular file, you first have to open it, = by > which point you may have missed the event(s) you care about. For > instance, if some upload service adds files to a directory, and you = want > to know when a new file has finished uploading, you'd have to watch = the > directory to get new file events, scan the directory to actually find > the new file(s), open them, and then wait for NOTE_CLOSE (which might > never arrive if the upload had already finished). Aside from that, = the > need to hold each monitored file open is also a problem for large > directory hierarchies as it's easy to exhaust file descriptor limits. My experience as a user was that NOTE_CLOSE was unreliable. I tried = using it to detect when uploads had finished but I never saw it (on = ZFS). I have had producing a working reduced test case for this on my = todo list for a while, but I solved my problem by writing my own = sftp-server implementation that stored the received =E2=80=98file=E2=80=99= in a shared memory object and passed it to another process, so didn=E2=80= =99t end up depending on this. The only way that I found on FreeBSD to determine that a file was no = longer open for writing was via libprocstat, which required root. Linux = has an API for this, apparently, but I didn=E2=80=99t try it. > My initial solution was a new kqueue filter, EVFILT_FSWATCH, which = lets > one watch for all file events under a mountpoint. The consumer would > allocate a ring buffer with space to store paths and event metadata, > register that with the kernel, and the kernel would write entries to = the > buffer, using reverse lookups to find a path for each event vnode. = This > prototype worked, but got somewhat hairy and I decided it would be > better to simply implement an existing interface: inotify already = exists > and is commonly used, and has a somewhat simpler model, as it merely > watches for events within a particular directory. I think it=E2=80=99s worth discussing the design a bit. I=E2=80=99ve = used the Linux inotify implementation a bit and the fact that it = doesn=E2=80=99t see changes made through hard links or bind mounts is = quite problematic for several use cases. =46rom skimming your code, it = looks as if it might have the same limitation? This means, for example, = that a jailed application that watches its config files would miss = notifications if they are modified via their original location rather = than the nullfs mount. With containers, this is likely to be more = important: if the same volume is mounted in two containers (nullfs = mounted in two jails), one should be able to watch for changes made by = the other. =20 I had pondered an implementation using two layers of bloom filters to = track vnodes that are watched by any filter, and then specific filters, = which would track inode numbers and do the name lookup on the slow path = after matching, but I suspect there are some details that would make = this hard. =20 The approach on XNU is a lot more scalable than on Linux and seems to be = similar to your original proposal here. The kernel has a single = fsevents device node that tells a userspace daemon the directories that = contain files that have been modified. When a process wants to watch = for events in a tree, it notifies the userspace daemon, which maintains = all of the state. There is a ring buffer between the daemon and the = kernel. If the userspace daemon can=E2=80=99t keep up with kernel = events, it sees non-sequential message numbers and falls back to = examining the modification times of files in all watched paths to = determine if any files were modified in the period where it missed = messages. The benefit of the XNU approach is that filesystem watching never = backpressures the kernel. I=E2=80=99m not sure how such an approach = would work in a jail. David --Apple-Mail=_B24A8BAF-C6BC-4787-95F1-41914B2C0333 Content-Transfer-Encoding: quoted-printable Content-Type: text/html; charset=utf-8 On 12 May = 2025, at 21:58, Mark Johnston <markj@freebsd.org> = wrote:

This work was largely = motivated by a race condition in EVFILT_VNODE: in
order = to get events for a particular file, you first have to open it, = by
which = point you may have missed the event(s) you care about. =  For
instance, if some upload service adds files to a directory, = and you want
to know = when a new file has finished uploading, you'd have to watch = the
directory to get new file events, scan the directory to = actually find
the new = file(s), open them, and then wait for NOTE_CLOSE (which might
never = arrive if the upload had already finished).  Aside from that, = the
need to = hold each monitored file open is also a problem for large
directory hierarchies as it's easy to exhaust file = descriptor limits.

My = experience as a user was that NOTE_CLOSE was unreliable.  I tried = using it to detect when uploads had finished but I never saw it (on = ZFS).  I have had producing a working reduced test case for this on = my todo list for a while, but I solved my problem by writing my own = sftp-server implementation that stored the received =E2=80=98file=E2=80=99= in a shared memory object and passed it to another process, so didn=E2=80= =99t end up depending on this.

The only way = that I found on FreeBSD to determine that a file was no longer open for = writing was via libprocstat, which required root.  Linux has an API = for this, apparently, but I didn=E2=80=99t try = it.

My initial solution was = a new kqueue filter, EVFILT_FSWATCH, which lets
one = watch for all file events under a mountpoint.  The consumer = would
allocate a ring buffer with space to store paths and event = metadata,
register that with the kernel, and the kernel would write = entries to the
buffer, = using reverse lookups to find a path for each event vnode. =  This
prototype worked, but got somewhat hairy and I decided it = would be
better = to simply implement an existing interface: inotify already = exists
and is = commonly used, and has a somewhat simpler model, as it merely
watches = for events within a particular directory.

I think it=E2=80=99s = worth discussing the design a bit.  I=E2=80=99ve used the Linux = inotify implementation a bit and the fact that it doesn=E2=80=99t see = changes made through hard links or bind mounts is quite problematic for = several use cases.  =46rom skimming your code, it looks as if it = might have the same limitation?  This means, for example, that a = jailed application that watches its config files would miss = notifications if they are modified via their original location rather = than the nullfs mount.  With containers, this is likely to be more = important: if the same volume is mounted in two containers (nullfs = mounted in two jails), one should be able to watch for changes made by = the other.  

I had pondered an = implementation using two layers of bloom filters to track vnodes that = are watched by any filter, and then specific filters, which would track = inode numbers and do the name lookup on the slow path after matching, = but I suspect there are some details that would make this hard. =  

The approach on XNU is a lot more = scalable than on Linux and seems to be similar to your original proposal = here.  The kernel has a single fsevents device node that tells a = userspace daemon the directories that contain files that have been = modified.  When a process wants to watch for events in a tree, it = notifies the userspace daemon, which maintains all of the state. =  There is a ring buffer between the daemon and the kernel.  If = the userspace daemon can=E2=80=99t keep up with kernel events, it sees = non-sequential message numbers and falls back to examining the = modification times of files in all watched paths to determine if any = files were modified in the period where it missed = messages.

The benefit of the XNU approach is = that filesystem watching never backpressures the kernel.  I=E2=80=99m= not sure how such an approach would work in a = jail.

David

= --Apple-Mail=_B24A8BAF-C6BC-4787-95F1-41914B2C0333--