lockless file descriptor lookup

Thu May 21 13:33:46 UTC 2009

On Thursday 21 May 2009 5:37:09 am Bruce Evans wrote:
> On Wed, 20 May 2009, John Baldwin wrote:
> 
> > On Wednesday 20 May 2009 2:59:52 pm Jeff Roberson wrote:
> >> On Thu, 14 May 2009, Bruce Evans wrote:
> >>> Anyway, you probably need atomics that have suitable memory barriers.
> >>> Memory barriers must affect the compiler and make it perform refreshes
> >>> for them to work, so you shouldn't need any volatile casts.  E.g., all
> >>> atomic store operations (including cmpset) have release semantics even
> >>> if they aren't spelled with "_rel" or implemented using inline asm.
> >>> On amd64 and i386, they happen to be implemented using inline asm with
> >>> "memory" clobbers.  The "memory" clobbers force refreshes of all
> >>> non-local variables.
> 
> Actually, it is the "acquire" operations that happen to be implemented
> with "memory" clobbers on amd64 and i386.  "release" semantics are
> (completely?) automatic on amd64 and i386 so no "memory" clobbers are
> used for them (except IIRC in old versions).

However, that may be a bug as when I removed them I did so because the CPUs
did not need them.  They may still be needed to prevent the compiler from 
breaking things.  Specifically, I was under the (possibly mistaken) 
impression that '__asm __volatile()' was sufficient to prevent GCC from 
reordering an atomic operation with other operations.  However, I'm not sure 
that is the case based on some discussions I had with ups@ about a year ago.  
I think that __volatile may only ensure that the compiler may not optimize 
the operation out, but doesn't prevent it from moving it around.

> >> So I think I need an _acq memory barrier on the atomic cmpset of the
> >> refcount to prevent speculative loading of the fd_ofiles array pointer by
> >> the processor and the volatile in the second dereference as I have it
> >> now to prevent caching of the pointer by the compiler.  What do you 
think?
> 
> I thought that it was a _rel barrier that was needed due to my misreading
> of the "memory" clobbers corrected above.  Perhaps both _acq and _rel
> are needed in cases like yours where a single cmpset corresponds to a
> (lock, unlock) pair.  On amd64 and i386, plain atomic_cmpset already
> has both (_acq via the explicit "memory" clobber, and _rel implicitly),
> but the man page doesn't say that this is generic.  It only says that
> all stores have _rel semantics, and it uses an explicit _aqu suffixes
> in examples of how to use cmpset to implement locking (the examples
> are rotted copies of locking in sys/mutex.h).  Since a
> successful plain cmpset does a store, this implicitly says that plain
> cmpset's have _rel semantics and cmpset_acq has both _acq and _rel
> semantics.

Ah, I think the manpage is confusing.  The sentence "The atomic_store() 
functions always have release semantics." refers to the fact that there are 
not any "atomic_store_acq_*() or atomic_store_*()" functions.  That the only 
store operations provided by the atomic(9) API include a "_rel" memory 
barrier.  It does not mean that all store operations imply "_rel" semantics.  
Similarly for the statement about all atomic_load() operations and "_acq" 
semantics.  I can probably update that part of the manpage to be clearer.  
Thus, given that, plain atomics and atomic_acq's do not have _rel semantics.

In Jeff's case I think he only needs _acq semantics.  He does not need prior 
memory store operations to be drained before the atomic_cmpset() is 
performed.  Rather, he needs the compiler and the CPU to not reorder the read 
of fd_ofiles before performing the atomic_cmpset().  An _acq barrier should 
be sufficient for this.

> Mutex locking has always been careful to use an explicit _acq suffix,
> but most code in /sys isn't.  In a /sys tree deated ~March 30, there
> are 280 lines matching atomic_cmpset but only 72 lines matching
> atomic_cmpset_acq and 47 lines matching atomic_cmpset_rel.  Excluding
> the implementation (atomic.h), there are 153 lines matching atomic_cmpset,
> 35 matching atomic_cmpset_acq and 12 matching atomic_cmpset_rel; this
> gives 106 lines that are probably missing an _acq or a _rel suffix.
> No one replied to my previous mails about this.  I would require
> explicit suffix by not supporting plain cmpset, or not support the
> _rel suffix for stores since because stores are always _rel, it is hard
> to tell if an atomic store without the suffix really wants non-_rel or
> is sloppy.  Despite the proliferation of interfaces, there is no
> _acq_rel suffix to indicate that cmpset_acq is also _rel.

Not all places that do atomics need memory barriers.  Only if the atomic 
operations on an item in memory need to be ordered with respect to other 
memory access (e.g. with respect to the data a lock protects, or in this 
specific case fd_ofiles needs to be read after the cmpset to f_count).  There 
are no atomic stores without a _rel suffix.  (Well, actually, there are an 
absolute ton of them, but they are not encoded as atomic_*(), instead they 
look like 'x = y' :).)

> >> The references prior to the atomic increment have no real ordering
> >> requirements.  Only the ones afterwards need to be strict so that we can
> >> verify the results.
> 
> Most references are in a loop, so "before" and "after" are sort of the 
saeme:
> 
> %	for (;;) {
> %		fp = fdp->fd_ofiles[fd];
> %		if (fp == NULL)
> %			break;
> %		count = fp->f_count;
> %		if (count == 0)
> %			continue;
> %		if (atomic_cmpset_int(&fp->f_count, count, count + 1) != 1)
> %			continue;
> 
> I think we do depend on both _acq and _rel semantics here -- the missing
> _acq to volatilize everything, and the implicit _rel just (?) to force
> the memory copy of f_count to actually be incremented, as is required
> for an atomic store to actually work.

No, you do not need the _rel for f_count.  The atomic operation is always 
required to perform the actual "atomic operation" atomically.  Memory 
barriers are not supposed to control ordering/timing of the atomic ops 
themselves.  The atomic op is always synchronous, and the memory barriers are 
solely to order other memory accesses with respect to the atomic operation.

Specifically, a _rel would only be needed to ensure that an earlier store 
operation completed before the f_count update.  In this case there aren't any 
earlier stores.  Also, the prior reads all must be satisifed before the 
atomic op can be performed since they are dependencies of reading 'count'.

> I agree.
> 
> Please look at whether some of the ~106 other plain cmpset's need and
> _acq prefix or should have a _rel prefix for clarity.  You should be
> able to do this much faster than me, having written some of them :-).
> E.g., the one in sio.c is for implementing a lock so it shuld use _acq
> (though it might work without _acq since the lock is only used once),
> but the ones in sx.h and kern_sx.c might be correct since they are
> mostly for "trylock"-type operations.

Well, even trylock operations should use _acq since you need to not read data 
a lock protects until you have acquired the lock.  Many of the plain 
atomic_cmpset's are ok though such as the ones in sys/refcount.h.

I looked at (mtx, rw, sx) and found atomic_cmpset() used without memory 
barriers in the following places:

- unlocking a read/shared lock.  Releasing an exclusive lock requires a _rel 
barrier to drain any writes to the locked data.  However, none of the locked 
data should be modified under a read lock, so no barrier is needed here.
- setting contested flags.  This is when a waiter sets a flag to force 
a "hard" unlock in the owning thread so that the waiter gets woken up.  No 
memory barrier is needed here as the waiting thread will have to succesfully 
complete some other atomic_cmpset_acq() before it obtains the lock and that 
_acq provides sufficient protection.
- upgrading a read/shared lock to a write/exclusive lock.  No _acq barrier is 
needed in these cases since the previous read/shared lock acquisition already 
had an _acq barrier and a successful upgrade is fully "atomic" in that there 
is no window in between releasing the shared lock and acquiring the write 
lock where another thread could obtain a write lock and modify the data.

All the other atomic operations in those three primitives use appropriate 
memory barriers.

-- 
John Baldwin