svn commit: r213648 - head/sys/kern

Wed Oct 13 13:31:18 UTC 2010

On Tuesday, October 12, 2010 5:08:04 pm Gennady Proskurin wrote:
> On Sat, Oct 09, 2010 at 12:48:50PM +0300, Andriy Gapon wrote:
> > on 09/10/2010 12:33 Bruce Evans said the following:
> > > On Sat, 9 Oct 2010, Andriy Gapon wrote:
> > > 
> > >> Log:
> > >>  panic_cpu variable should be volatile
> > >>
> > >>  This is to prevent caching of its value in a register when it is checked
> > >>  and modified by multiple CPUs in parallel.
> > >>  Also, move the variable  into the scope of the only function that uses it.
> > >>
> > >>  Reviewed by:    jhb
> > >>  Hint from:    mdf
> > >>  MFC after:    1 week
> > > 
> > > I doubt that this is either necessary or sufficient.  Most variables
> > > aren't volatile while they are locked by a mutex.  But panic() cannot use
> > > normal mutex locking, so it should access this variable using atomic
> > > ops with memory barriers.  The compiler part of the memory barriers
> > > effectively make _all_ variables temporily volatile.  However, 2 of the
> > > the 4 accesses to this variable doesn't use an atomic op.
> > > 
> > > % #ifdef SMP
> > > %     /*
> > > %      * We don't want multiple CPU's to panic at the same time, so we
> > > %      * use panic_cpu as a simple spinlock.  We have to keep checking
> > > %      * panic_cpu if we are spinning in case the panic on the first
> > > %      * CPU is canceled.
> > > %      */
> > > %     if (panic_cpu != PCPU_GET(cpuid))
> > > 
> > > This access doesn't use an atomic op.
> > > 
> > > %         while (atomic_cmpset_int(&panic_cpu, NOCPU,
> > > %             PCPU_GET(cpuid)) == 0)
> > > 
> > > This access uses an atomic op.  Not all atomic ops use memory barriers,
> > > at least on i386.  However, this one does, at least on i386.
> > > 
> > >   (I'm always confused about what atomic_any() without acq or release
> > >   means.  Do they mean that you don't want a memory barrier (this is
> > >   what the mostly give, at least on i386), and if so, what use are
> > >   they?  There are far too many atomic ops, for far too many never-used
> > >   widths, with alternative spellings to encourage using a wrong one.
> > >   cmpset is is especially confusing since it you can spell it as cmpset,
> > >   cmpset_acq or compset_rel and always get the barrier, at least on
> > >   i386.  At least cmpset only supports 1 width, at least on i386.)
> > > 
> > > %             while (panic_cpu != NOCPU)
> > > 
> > > This access doesn't use an atomic op.
> > > 
> > > %                 ; /* nothing */
> > > % #endif
> > > % ...
> > > % #ifdef RESTARTABLE_PANICS
> > > %     /* See if the user aborted the panic, in which case we continue. */
> > > %     if (panicstr == NULL) {
> > > % #ifdef SMP
> > > %         atomic_store_rel_int(&panic_cpu, NOCPU);
> > > 
> > > This access uses an atomic op with a memory barrier, at least on i386.
> > > Now its rel semantics are clear.
> > > 
> > > panicstr is non-volatile and never accessed by an atomic op, so it probably
> > > strictly needs to be declared volatile even more than panic_cpu.  I think
> > 
> > I agree about panicstr.
> > But I am not sure if we have places in code (beyond panic function) itself where
> > volatile vs. non-volatile would make any actual difference.
> > But still.
> > 
> > > the only thing that makes it work now is that it is bogusly pubic, and
> > > compilers can't analyze the whole program yet -- if they could, then they
> > > would see that it is only set in panic().
> > > 
> > > % #endif
> > > %         return;
> > > %     }
> > > % #endif
> > > % #endif
> > > 
> > > Now, why don't the partial memory barriers prevent caching the variable?
> > > 
> > > %     if (panic_cpu != PCPU_GET(cpuid))
> > > %         while (atomic_cmpset_int(&panic_cpu, NOCPU,
> > > %             PCPU_GET(cpuid)) == 0)
> > > %             while (panic_cpu != NOCPU)
> > > %                 ; /* nothing */
> > > 
> > > The very first access can't reasonably use a cachec value.  atomic_cmpset()
> > > can change panic_cpu, so even without the memory barrier, panic_cpu must
> > > be reloaded for the third access the first time.  But then when the third
> > > access is repeated in the second while loop, the missing atomic op with
> > > barrier makes things very broken.  panic_cpu isn't changed by the loop,
> > > and the compiler thinks that it isn't changed by anything else either, so
> > > the compiler may reduce the loop to:
> > > 
> > > %             if (panic_cpu != NOCPU)
> > > %                 for (;;)
> > > %                     ; /* nothing */
> > 
> > 
> > Yes, it's exactly the last loop that had the problem.
> > On amd64 with -O2:
> >         .loc 1 544 0
> >         movl    panic_cpu(%rip), %eax
> > .LVL134:
> >         .p2align 4,,7
> > .L210:
> >         cmpl    $255, %eax
> >         jne     .L210
> >         jmp     .L225
> > 
> > > except I've seen claims that even an endless for loop can be optimized
> > > to nothing.  Declaring panic_cpu as volatile prevents the compiler doing
> > > this, but I think it is insufficient.  We really do want to see panic_cpu
> > > changed by other CPUs, and what is the point of atomic_load_acq*() if not
> > > to use for this -- if declaring things volatile were sufficient, then we
> > > could just use *(volatile *)&var.  atomic_load/store_acq/rel*() on i386
> > 
> > I discussed this with kib and the idea is that atomic operation is not needed in
> > that place and volatile is sufficient, because we don't need barrier semantics
> > there and only want to ensure that variable is reloaded from memory.
> 
> If I understand correctly, without acquiring barrier, variable is not
> guaranteed to be loaded from memory, it can end up with some stale value from
> cpu's cache or pipeline.  If incorrect value will be checked in the first "if"
> operator, it can lead to skiping all the atomic synchronization.
> The same about writing to variable, without barrier new value may be written to
> some local cpu cache and not be seen by readers until it is flushed from cache.

No, a barrier does _not_ force any cache flushes.  The point of the volatile
is to serve as a compiler barrier.  A memory barrier only enforces a ordering
in memory operations in the CPU, it does not flush any caches or force a CPU
to post writes to memory.  It is weaker than that. :)  For example, the
sparcv9 manuals explicitly state something along the lines that any write may
be held in the CPU's store buffer for an unspecified amount of time.
Barriers do not alter that, nor would it really be useful for them to do so.
What barriers allow you to do is order the operations on a lock cookie (such
as mtx_lock in mutexes) with respect to operations on other memory locations
(e.g. the data a lock protects).  By using a specific set of barriers and
protocols for accessing the lock cookie, you can ensure that the protected
data is not accessed without holding the lock.  However, for a standalone
word, memory barriers do not buy you anything.  And in fact, if one CPU has
written to a variable and that write is still sitting in the store buffer,
'atomic_load_acq' on another CPU may still return a stale value.  All that
FreeBSD requires is that 'atomic_cmpset' will not report success using stale
data.  It can either fail or block waiting for the write in a store buffer in
another CPU to drain.  We typically handle the 'fail' case by continuing to
loop until we observe a new state or succesfully perform 'atomic_cmpset'
(for an example, see setting of MTX_CONTESTED in mtx_lock).

Currently we do not have 'atomic_load()' and 'atomic_store()' variants without
memory barriers since 'x = y' is atomic (in that it is performed as a single
operation).  However, there are cases where 'x = y' needs a compiler memory
barrier (but not a CPU one).  (e.g. if 'y' can be updated by a signal handler
in userland, or an interrupt in the kernel.)  That is what 'volatile' is for.

-- 
John Baldwin