Impact of having a large number of open file descriptors

Tue Jun 3 10:38:12 UTC 2008

On Mon, 2 Jun 2008, Garance A Drosihn wrote:

> I remember a discussion of changes to MacOS10 in Leopard which made it 
> easier to implement features such as Spotlight and TimeMachine. The 
> description starts here, I think:
>
> http://arstechnica.com/reviews/os/mac-os-x-10-5.ars/7
>
> the section on file-system events.
>
> The idea I thought was interesting was to save the metadata on a directory 
> basis, instead of saving it on the file.  So, if file /some/dir/fname was 
> changed, then they'd record that *some* file under /some/dir has changed.
>
> So when your userland process comes along later on, it still has to scan all 
> files in that directory to see which file(s) actually changed.  But that's a 
> lot less work than scanning all files in the filesystem, and it also means 
> there is much less data that has to be kept track of.
>
> I have no idea how easy it would be to implement something similar on 
> FreeBSD, but the strategy seemed like a pretty neat idea.

fsevents allows user processes to subscribe, effectively on a per-filesystem 
basis, to namespace and file close operations.  The implementation is split 
into two parts: a kernel component, which captures events with possible 
coalescing, and a user daemon, fseventsd, which listens on a special device 
and then provides scope narrowing and persistence for subscriptions. 
Applications talk to fseventsd, using Mach ports, I believe, and fseventsd is 
responsible for tracking subscriptions, filtering events, and so on.

I'm aware of several limitations that should be considered very carefully 
before adopting this code:

(1) The user<->kernel interface is essentially a firehose, and available only
     to privileged processes.  fseventsd performs checks in user space to see
     whether each consumer is allowed access to each event, which can lead to
     confusing and potentially quite incorrect results.

(2) The kernel code requires a reliable conversion from vnode to path, which
     we don't have, as events are with respect to paths, and especially
     coalescing.

(3) The user daemon requires synchronous hooks into the file system umount
     event because fseventsd stores its events journal in the file system root,
     so must first close it before the file system can be unmounted.  In Mac OS
     X, this is satisfied by having the disk arbitration daemon, which performs
     unmounts, first send a message to fseventsd and wait for it to finish up.
     I've seen a number of occasions where the disk unmount process has become
     non-trivially stalled due to fseventsd, so there's a potential robustness
     question.

(4) As I understand it, events frequently come down to "file system X
     changed" in practice, which could be captured by a far simpler mechanism.
     I've not done any measurements to confirm whether this is the case, but
     it's not impossible to imagine on a busy system.

I think there's also considerable overlap with other kernel event systems, 
such as audit, and we might benefit from thinking seriously about enhancing 
those event systems rather than introducing a new one.  The design of fsevents 
is pretty much entirely dictated by the needs of Spotlight and later Time 
Machine.  In particular, it's not clear to me that the persistency 
requirements, which are a large part of the fsevents design, are important to 
us... or are they?

Robert N M Watson
Computer Laboratory
University of Cambridge