svn commit: r313037 - in head/sys: amd64/include kern mips/include net powerpc/include sparc64/include

Sun Feb 5 18:58:15 UTC 2017

Hmm, it's a good idea to consider the possibility of a barrier issue.  It
wouldn't be the first time we've had such a problem on a weakly-ordered
architecture. That said, I don't see a problem in this case.
 smp_rendezvous_cpus() takes a spinlock and then issues
atomic_store_rel_int()  to ensure the rendezvous params are visible to
other cpus.  The latter corresponds to lwsync on powerpc, which AFAIK
should be sufficient to ensure visibility of prior stores.

For now I'm going with the simpler explanation that I made a bad assumption
 in the powerpc get_pcpu() and there is some context in which the read of
sprg0 doesn't return a consistent pointer value.  Unfortunately I don't see
where that might be right now.

On the mips side, Kurt/Alexander can you test the attached patch?  It
contains a simple fix to ensure get_pcpu() returns the consistent per-cpu
pointer.

On Sat, Feb 4, 2017 at 1:34 PM, Svatopluk Kraus <onwahe at gmail.com> wrote:

> Probably not related. But when I took short look to the patch to see
> what could go wrong, I walked into the following comment in
> _rm_wlock(): "Assumes rm->rm_writecpus update is visible on other CPUs
> before rm_cleanIPI is called." There is no explicit barrier to ensure
> it. However, there might be some barriers inside of
> smp_rendezvous_cpus(). I have no idea what could happened if this
> assumption is not met. Note that rm_cleanIPI() is affected by the
> patch.
>
>
>
> On Sat, Feb 4, 2017 at 9:39 PM, Jason Harmening
> <jason.harmening at gmail.com> wrote:
> > Can you post an example of such panic?  Only 2 MI pieces were changed,
> > netisr and rmlock.  I haven't seen problems on my own amd64/i386/arm
> testing
> > of this, so a backtrace might help to narrow down the cause.
> >
> > On Sat, Feb 4, 2017 at 12:22 PM, Andreas Tobler <andreast at freebsd.org>
> > wrote:
> >>
> >> On 04.02.17 20:54, Jason Harmening wrote:
> >>>
> >>> I suspect this broke rmlocks for mips because the rmlock implementation
> >>> takes the address of the per-CPU pc_rm_queue when building tracker
> >>> lists.  That address may be later accessed from another CPU and will
> >>> then translate to the wrong physical region if the address was taken
> >>> relative to the globally-constant pcpup VA used on mips.
> >>>
> >>> Regardless, for mips get_pcpup() should be implemented as
> >>> pcpu_find(curcpu) since returning an address that may mean something
> >>> different depending on the CPU seems like a big POLA violation if
> >>> nothing else.
> >>>
> >>> I'm more concerned about the report of powerpc breakage.  For powerpc
> we
> >>> simply take each pcpu pointer from the pc_allcpu list (which is the
> same
> >>> value stored in the cpuid_to_pcpu array) and pass it through the
> ap_pcpu
> >>> global to each AP's startup code, which then stores it in sprg0.  It
> >>> should be globally unique and won't have the variable-translation
> issues
> >>> seen on mips.   Andreas, are you certain this change was responsible
> the
> >>> breakage you saw, and was it the same sort of hang observed on mips?
> >>
> >>
> >> I'm really sure. 313036 booted fine, allowed me to execute heavy
> >> compilation jobs, np. 313037 on the other side gave me various patterns
> of
> >> panics. During startup, but I also succeeded to get into multiuser and
> then
> >> the panic happend during port building.
> >>
> >> I have no deeper inside where pcpu data is used. Justin mentioned
> netisr?
> >>
> >> Andreas
> >>
> >
>