SMP question w.r.t. reading kernel variables

Wed Apr 20 19:49:17 UTC 2011

On Wed, Apr 20, 2011 at 7:42 AM, Rick Macklem <rmacklem at uoguelph.ca> wrote:

> > On Tue, Apr 19, 2011 at 12:00:29PM +0000,
> > freebsd-hackers-request at freebsd.org wrote:
> > > Subject: Re: SMP question w.r.t. reading kernel variables
> > > To: Rick Macklem <rmacklem at uoguelph.ca>
> > > Cc: freebsd-hackers at freebsd.org
> > > Message-ID: <201104181712.14457.jhb at freebsd.org>
> >
> > [John Baldwin]
> > > On Monday, April 18, 2011 4:22:37 pm Rick Macklem wrote:
> > > > > On Sunday, April 17, 2011 3:49:48 pm Rick Macklem wrote:
> > ...
> > > > All of this makes sense. What I was concerned about was memory
> > > > cache
> > > > consistency and whet (if anything) has to be done to make sure a
> > > > thread
> > > > doesn't see a stale cached value for the memory location.
> > > >
> > > > Here's a generic example of what I was thinking of:
> > > > (assume x is a global int and y is a local int on the thread's
> > > > stack)
> > > > - time proceeds down the screen
> > > > thread X on CPU 0 thread Y on CPU 1
> > > > x = 0;
> > > >                                      x = 0; /* 0 for x's location
> > > >                                      in CPU 1's memory cache */
> > > > x = 1;
> > > >                                      y = x;
> > > > --> now, is "y" guaranteed to be 1 or can it get the stale cached
> > > > 0 value?
> > > >     if not, what needs to be done to guarantee it?
> > >
> > > Well, the bigger problem is getting the CPU and compiler to order
> > > the
> > > instructions such that they don't execute out of order, etc. Because
> > > of that,
> > > even if your code has 'x = 0; x = 1;' as adjacent threads in thread
> > > X,
> > > the 'x = 1' may actually execute a good bit after the 'y = x' on CPU
> > > 1.
> >
> > Actually, as I recall the rules for C, it's worse than that. For
> > this (admittedly simplified scenario), "x=0;" in thread X may never
> > execute unless it's declared volatile, as the compiler may optimize it
> > out and emit no code for it.
> >
> >
> > > Locks force that to sychronize as the CPUs coordinate around the
> > > lock cookie
> > > (e.g. the 'mtx_lock' member of 'struct mutex').
> > >
> > > > Also, I see cases of:
> > > >      mtx_lock(&np);
> > > >      np->n_attrstamp = 0;
> > > >      mtx_unlock(&np);
> > > > in the regular NFS client. Why is the assignment mutex locked? (I
> > > > had assumed
> > > > it was related to the above memory caching issue, but now I'm not
> > > > so sure.)
> > >
> > > In general I think writes to data that are protected by locks should
> > > always be
> > > protected by locks. In some cases you may be able to read data using
> > > "weaker"
> > > locking (where "no locking" can be a form of weaker locking, but
> > > also a
> > > read/shared lock is weak, and if a variable is protected by multiple
> > > locks,
> > > then any singe lock is weak, but sufficient for reading while all of
> > > the
> > > associated locks must be held for writing) than writing, but writing
> > > generally
> > > requires "full" locking (write locks, etc.).
> >
> Oops, I now see that you've differentiated between writing and reading.
> (I mistakenly just stated that you had recommended a lock for reading.
>  Sorry about my misinterpretation of the above on the first quick read.)
>
> > What he said. In addition to all that, lock operations generate
> > "atomic" barriers which a compiler or optimizer is prevented from
> > moving code across.
> >
> All good and useful comments, thanks.
>
> The above example was meant to be contrived, to indicate what I was
> worried about w.r.t. memory caches.
> Here's a somewhat simplified version of what my actual problem is:
> (Mostly fyi, in case you are interested.)
>
> Thread X is doing a forced dismount of an NFS volume, it (in dounmount()):
> - sets MNTK_UNMOUNTF
> - calls VFS_SYNC()/nfs_sync()
>  - so this doesn't get hung on an unresponsive server it must test
>    for MNTK_UNMOUNTF and return an error it is set. This seems fine,
>    since it is the same thread and in a called function. (I can't
>    imagine that the optimizer could move setting of a global flag
>    to after a function call which might use it.)
> - calls VFS_UNMOUNT()/nfs_unmount()
>  - now the fun begins...
>  after some other stuff, it calls nfscl_umount() to get rid of the
>  state info (opens/locks...)
>  nfscl_umount() - synchronizes with other threads that will use this
>    state (see below) using the combination of a mutex and a
>    shared/exclusive sleep lock. (Because of various quirks in the
>    code, this shared/exclusive lock is a locally coded version and
>    I happenned to call the shared case a refcnt and the exclusive
>    case just a lock.)
>
> Other threads that will use state info (open/lock...) will:
> -call nfscl_getcl()
>  - this function does two things that are relevant
>  1 - it allocates a new clientid, as required, while holding the mutex
>      - this case needs to check for MNTK_UNMOUNTF and return error, in
>        case the clientid has already been deleted by nfscl_umount() above.
>      (This happens before #2 because the sleep lock is in the clientd
> structure.)
> --> it must see the MNTK_UNMOUNTF set if it happens after (in a temporal
> sense)
>  being set by dounmount()
>  2 - while holding the mutex, it acquires the shared lock
>      - if this happens before nfscl_umount() gets the exclusive lock, it is
>        fine, since acquisition of the exclusive lock above will wait for
> its
>        release
>      however
>      - if it acquires the shared lock after nfscl_umount() has had the
>        exclusive lock or competes with nfscl_umount() for the lock, it
>        could be in trouble.
>        - my planned fix for this is to test for MNTK_UNMOUNTF as it tries
>          to get the shared lock (and fail with an error return if it is
> set)
>          before if goes to sleep waiting for the exclusive lock to be
>          released. (Othere things acquire this exclusive lock, so waiting
> for
>          it usually makes sense, except this case.)
> --> again, I think this is fine so long as it sees MNTK_UNMOUNTF set when
> the
>    the test occurs after (in a temporal sense) it was set by the forced
> dismount
>    thread, since that thread set it before calling
> VFS_UNMOUNT()...nfscl_umount().
>
> Now, in practice, I don't believe there is a problem, but since I have yet
> to
> see anyone state what the rules/assumptions for memory cache consistency
> are
> w.r.t. SMP, I thought I'd pursue it.
>
>
In essence, the rules are as follows.

Take any given memory location X.  Suppose a series of writes, X1, X2, X3,
.. are performed on this location.  Once a particular processor has read the
effect of say, X2, any future read on the same location by the same
processor can't roll back in time to read the effect of an earlier write
such as X1.  That future read must return at least the result of X2, but you
should not expect that the effect of a write, say X3, is seen
instantaneously by another processor.

Now, if you consider two memory locations X and Y simultaneously, this is
where things become complicated.  While one processor may see a write X1
before it sees a write Y1, i.e., it reads the effects of X1 and then Y0,
another processor may see these writes occurring in the opposite order,
i.e., it reads Y1 and then X0.  In fact, in some machines, this can happen
even if X1 and Y1 are performed by the same processor.

Memory barriers if used correctly allow you, the programmer, to control the
order in which writes to X and Y are seen.  For example, suppose one
processor performs a write X1 followed by a "release" write Y1, then another
processor performs an "acquire" read on location Y.   If this other
processor sees the effect of the "release" write Y1, then it is guaranteed
to see the effect of the write X1 when it later reads X.

This is why mtx_lock() is implemented using an "acquire" read-modify-write
operation and mtx_unlock() is implemented using a "release" operation.
However, a write doesn't have to be between the "acquire" and the "release"
to be ordered by the "release".  In other words, suppose that your Thread X
does

1. Set MNTK_UNMOUNTF
2. Acquire a standard FreeBSD mutex "m".
3. Update some data structures.
4. Release mutex "m".

Then, other threads that acquire "m" after step 4 has occurred will see
MNTK_UNMOUNTF as set.  But, other threads that beat thread X to step 2 may
or may not see MNTK_UNMOUNTF as set.

The question that I have about your specific scenario is concerned with
VOP_SYNC().  Do you care if another thread performing nfscl_getcl() after
thread X has performed VOP_SYNC() doesn't see MNTK_UNMOUNTF as set?  Another
relevant question is "Does VOP_SYNC() acquire and release the same mutex as
nfscl_umount() and  nfscl_getcl()?"

Regards,
Alan