svn commit: r213648 - head/sys/kern

Wed Oct 13 20:03:11 UTC 2010

On Wednesday, October 13, 2010 12:21:44 pm Gennady Proskurin wrote:
> Thank you and Bruce for explanation.
> The key point here (which I did not understand) is that
> "something == PCPU_GET(cpuid))" may be true only if "something" is set by the
> current cpu, in which case its value is not stale.

Yes.  mtx_owned() (and mtx_assert()) depend on a similar "feature" in that
stale values don't hurt.

> On Wed, Oct 13, 2010 at 09:05:04AM -0400, John Baldwin wrote:
> > On Tuesday, October 12, 2010 5:08:04 pm Gennady Proskurin wrote:
> > > On Sat, Oct 09, 2010 at 12:48:50PM +0300, Andriy Gapon wrote:
> > > > on 09/10/2010 12:33 Bruce Evans said the following:
> > > > > On Sat, 9 Oct 2010, Andriy Gapon wrote:
> > > > > 
> > > > >> Log:
> > > > >>  panic_cpu variable should be volatile
> > > > >>
> > > > >>  This is to prevent caching of its value in a register when it is checked
> > > > >>  and modified by multiple CPUs in parallel.
> > > > >>  Also, move the variable  into the scope of the only function that uses it.
> > > > >>
> > > > >>  Reviewed by:    jhb
> > > > >>  Hint from:    mdf
> > > > >>  MFC after:    1 week
> > > > > 
> > > > > I doubt that this is either necessary or sufficient.  Most variables
> > > > > aren't volatile while they are locked by a mutex.  But panic() cannot use
> > > > > normal mutex locking, so it should access this variable using atomic
> > > > > ops with memory barriers.  The compiler part of the memory barriers
> > > > > effectively make _all_ variables temporily volatile.  However, 2 of the
> > > > > the 4 accesses to this variable doesn't use an atomic op.
> > > > > 
> > > > > % #ifdef SMP
> > > > > %     /*
> > > > > %      * We don't want multiple CPU's to panic at the same time, so we
> > > > > %      * use panic_cpu as a simple spinlock.  We have to keep checking
> > > > > %      * panic_cpu if we are spinning in case the panic on the first
> > > > > %      * CPU is canceled.
> > > > > %      */
> > > > > %     if (panic_cpu != PCPU_GET(cpuid))
> > > > > 
> > > > > This access doesn't use an atomic op.
> > > > > 
> > > > > %         while (atomic_cmpset_int(&panic_cpu, NOCPU,
> > > > > %             PCPU_GET(cpuid)) == 0)
> > > > > 
> > > > > This access uses an atomic op.  Not all atomic ops use memory barriers,
> > > > > at least on i386.  However, this one does, at least on i386.
> > > > > 
> > > > >   (I'm always confused about what atomic_any() without acq or release
> > > > >   means.  Do they mean that you don't want a memory barrier (this is
> > > > >   what the mostly give, at least on i386), and if so, what use are
> > > > >   they?  There are far too many atomic ops, for far too many never-used
> > > > >   widths, with alternative spellings to encourage using a wrong one.
> > > > >   cmpset is is especially confusing since it you can spell it as cmpset,
> > > > >   cmpset_acq or compset_rel and always get the barrier, at least on
> > > > >   i386.  At least cmpset only supports 1 width, at least on i386.)
> > > > > 
> > > > > %             while (panic_cpu != NOCPU)
> > > > > 
> > > > > This access doesn't use an atomic op.
> > > > > 
> > > > > %                 ; /* nothing */
> > > > > % #endif
> > > > > % ...
> > > > > % #ifdef RESTARTABLE_PANICS
> > > > > %     /* See if the user aborted the panic, in which case we continue. */
> > > > > %     if (panicstr == NULL) {
> > > > > % #ifdef SMP
> > > > > %         atomic_store_rel_int(&panic_cpu, NOCPU);
> > > > > 
> > > > > This access uses an atomic op with a memory barrier, at least on i386.
> > > > > Now its rel semantics are clear.
> > > > > 
> > > > > panicstr is non-volatile and never accessed by an atomic op, so it probably
> > > > > strictly needs to be declared volatile even more than panic_cpu.  I think
> > > > 
> > > > I agree about panicstr.
> > > > But I am not sure if we have places in code (beyond panic function) itself where
> > > > volatile vs. non-volatile would make any actual difference.
> > > > But still.
> > > > 
> > > > > the only thing that makes it work now is that it is bogusly pubic, and
> > > > > compilers can't analyze the whole program yet -- if they could, then they
> > > > > would see that it is only set in panic().
> > > > > 
> > > > > % #endif
> > > > > %         return;
> > > > > %     }
> > > > > % #endif
> > > > > % #endif
> > > > > 
> > > > > Now, why don't the partial memory barriers prevent caching the variable?
> > > > > 
> > > > > %     if (panic_cpu != PCPU_GET(cpuid))
> > > > > %         while (atomic_cmpset_int(&panic_cpu, NOCPU,
> > > > > %             PCPU_GET(cpuid)) == 0)
> > > > > %             while (panic_cpu != NOCPU)
> > > > > %                 ; /* nothing */
> > > > > 
> > > > > The very first access can't reasonably use a cachec value.  atomic_cmpset()
> > > > > can change panic_cpu, so even without the memory barrier, panic_cpu must
> > > > > be reloaded for the third access the first time.  But then when the third
> > > > > access is repeated in the second while loop, the missing atomic op with
> > > > > barrier makes things very broken.  panic_cpu isn't changed by the loop,
> > > > > and the compiler thinks that it isn't changed by anything else either, so
> > > > > the compiler may reduce the loop to:
> > > > > 
> > > > > %             if (panic_cpu != NOCPU)
> > > > > %                 for (;;)
> > > > > %                     ; /* nothing */
> > > > 
> > > > 
> > > > Yes, it's exactly the last loop that had the problem.
> > > > On amd64 with -O2:
> > > >         .loc 1 544 0
> > > >         movl    panic_cpu(%rip), %eax
> > > > .LVL134:
> > > >         .p2align 4,,7
> > > > .L210:
> > > >         cmpl    $255, %eax
> > > >         jne     .L210
> > > >         jmp     .L225
> > > > 
> > > > > except I've seen claims that even an endless for loop can be optimized
> > > > > to nothing.  Declaring panic_cpu as volatile prevents the compiler doing
> > > > > this, but I think it is insufficient.  We really do want to see panic_cpu
> > > > > changed by other CPUs, and what is the point of atomic_load_acq*() if not
> > > > > to use for this -- if declaring things volatile were sufficient, then we
> > > > > could just use *(volatile *)&var.  atomic_load/store_acq/rel*() on i386
> > > > 
> > > > I discussed this with kib and the idea is that atomic operation is not needed in
> > > > that place and volatile is sufficient, because we don't need barrier semantics
> > > > there and only want to ensure that variable is reloaded from memory.
> > > 
> > > If I understand correctly, without acquiring barrier, variable is not
> > > guaranteed to be loaded from memory, it can end up with some stale value from
> > > cpu's cache or pipeline.  If incorrect value will be checked in the first "if"
> > > operator, it can lead to skiping all the atomic synchronization.
> > > The same about writing to variable, without barrier new value may be written to
> > > some local cpu cache and not be seen by readers until it is flushed from cache.
> > 
> > No, a barrier does _not_ force any cache flushes.  The point of the volatile
> > is to serve as a compiler barrier.  A memory barrier only enforces a ordering
> > in memory operations in the CPU, it does not flush any caches or force a CPU
> > to post writes to memory.  It is weaker than that. :)  For example, the
> > sparcv9 manuals explicitly state something along the lines that any write may
> > be held in the CPU's store buffer for an unspecified amount of time.
> > Barriers do not alter that, nor would it really be useful for them to do so.
> > What barriers allow you to do is order the operations on a lock cookie (such
> > as mtx_lock in mutexes) with respect to operations on other memory locations
> > (e.g. the data a lock protects).  By using a specific set of barriers and
> > protocols for accessing the lock cookie, you can ensure that the protected
> > data is not accessed without holding the lock.  However, for a standalone
> > word, memory barriers do not buy you anything.  And in fact, if one CPU has
> > written to a variable and that write is still sitting in the store buffer,
> > 'atomic_load_acq' on another CPU may still return a stale value.  All that
> > FreeBSD requires is that 'atomic_cmpset' will not report success using stale
> > data.  It can either fail or block waiting for the write in a store buffer in
> > another CPU to drain.  We typically handle the 'fail' case by continuing to
> > loop until we observe a new state or succesfully perform 'atomic_cmpset'
> > (for an example, see setting of MTX_CONTESTED in mtx_lock).
> > 
> > Currently we do not have 'atomic_load()' and 'atomic_store()' variants without
> > memory barriers since 'x = y' is atomic (in that it is performed as a single
> > operation).  However, there are cases where 'x = y' needs a compiler memory
> > barrier (but not a CPU one).  (e.g. if 'y' can be updated by a signal handler
> > in userland, or an interrupt in the kernel.)  That is what 'volatile' is for.
> > 
> > -- 
> > John Baldwin
> > _______________________________________________
> > svn-src-head at freebsd.org mailing list
> > http://lists.freebsd.org/mailman/listinfo/svn-src-head
> > To unsubscribe, send any mail to "svn-src-head-unsubscribe at freebsd.org"
> > 
> 

-- 
John Baldwin