powerpc64 or 32-bit power context: FreeBSD lwsync use vs. th->th_generation handling (and related th-> fields) looks broken to me

Fri Apr 19 04:36:31 UTC 2019

First I review below lwsync behavior. It is based on a comparison/contrast
paper for the powerpc vs. arm memory models. It sets context for later
material specific to powerpc64 or 32-bit powerpc FreeBSD.

"For a write before a read, separated by a lwsync, the barrier will ensure that the write is
committed before the read is satisfied but lets the read be satisfied before the write has
been propagated to any other thread."

(By contrast, sync, guarantees that the write has propagated to all threads before the
read in question is satisfied, the read having been separated from the write by the
sync.)

Another wording in case it helps (from the same paper):

"The POWER lwsync does *not* ensure that writes before the barrier have propagated to
any other thread before sequent actions, though it does keep writes before and after
an lwsync in order as far as [each thread is] concerned". (Original used plural form:
"all threads are". I tired to avoid any potential implication of cross (hardware)
"thread" ordering constraints for seeing the updates when lwsync is used.)

Next I note FreeBSD powerpc64 and 32-bit powerpc details
that happen to involve lwsync, though lwsync is not the
only issue:

atomic_store_rel_int(&th->th_generation, ogen);

and:

gen = atomic_load_acq_int(&th->th_generation);

with:

static __inline void                                            \
atomic_store_rel_##TYPE(volatile u_##TYPE *p, u_##TYPE v)       \
{                                                               \
                                                                \
        powerpc_lwsync();                                       \
        *p = v;                                                 \
}

and:

static __inline u_##TYPE                                        \
atomic_load_acq_##TYPE(volatile u_##TYPE *p)                    \
{                                                               \
        u_##TYPE v;                                             \
                                                                \
        v = *p;                                                 \
        powerpc_lwsync();                                       \
        return (v);                                             \
}                                                               \

also:

static __inline void
atomic_thread_fence_acq(void)
{

        powerpc_lwsync();
}

First I list a simpler-than-full-context example to
try to make things clearer . . .

Here is a sequence, listing in an overall time
order, omitting other activity, despite the distinct
cpus, (N!=M):

(Presume th->th_generation==ogen-1 initially, then:)

cpu N: atomic_store_rel_int(&th->th_generation, ogen);
       (same th value as for cpu M below)

cpu M: gen = atomic_load_acq_int(&th->th_generation);

For the above sequence:

There is no barrier between the store and the later
load at all. This is important below.

So, if I have that much right . . .

Now for more actual "load side" context:
(Presume, for simplicity, that there is only one 
timehands instance instead of 2 or more timehands. So
th does not vary below and is the same on both cpu's
in the later example sequence of activity.)

        do {
                th = timehands;
                gen = atomic_load_acq_int(&th->th_generation);
                *bt = th->th_offset;
                bintime_addx(bt, th->th_scale * tc_delta(th));
                atomic_thread_fence_acq();
        } while (gen == 0 || gen != th->th_generation);

For simplicity of referring to things: I again show
a specific sequence in time. I only show the
&th->th_generation activity from cpu N, again for
simplicity.

(Presume timehands->th_generation==ogen-1 initially
and that M!=N:)

cpu M: th = timehands;
       (Could be after the "cpu N" lines.)

cpu N: atomic_store_rel_int(&th->th_generation, ogen);
       (same th value as for cpu M)

cpu M: gen = atomic_load_acq_int(&th->th_generation);
cpu M: *bt = th->th_offset;
cpu M: bintime_addx(bt, th->th_scale * tc_delta(th));
cpu M: atomic_thread_fence_acq();
cpu M: gen != th->th_generation
       (evaluated to false or to true)

So here:

A) gen ends up with: gen==ogen-1 || gen==ogen
   (either is allowed because of the lack of
   any barrier between the store and the
   involved load).

B) When gen==ogen: there was no barrier
   before the assignment to gen to guarantee
   other th-> field-value staging relationships.

C) When gen==ogen: gen!=th->th_generation false
   does not guarantee the *bt=. . . and
   bintime_addx(. . .) activities were based
   on a coherent set of th-> field-values.

If I'm correct about (C) then the likes of the
binuptime and sbinuptime implementations appear
to be broken on powerpc64 and 32-bit powerpc
unless there are extra guarantees always present.

So have I found at least a powerpc64/32-bit-powerpc
FreeBSD implementation problem?

Note: While I'm still testing, I've seen problems
on the two 970MP based 2-socket/2-cores-each G5
PowerMac11,2's that I've so far not seen on three
2-socket/1-core-each PowerMacs, two such 7455 G4
PowerMac3,6's and one such 970 G5 PowerMac7,2.
The two PowerMac11,2's are far more tested at
this point. But proving that any test-failure is
specifically because of (C) is problematical.

Note: arm apparently has no equivalent of lwsync,
just of sync (aka. hwsync and sync 0). If I
understand correctly, PowerPC/Power has the weakest
memory model of the modern tier-1/tier-2
architectures and, so, they might be broken for
memory model handling when everything else is
working.

===
Mark Millard
marklmi at yahoo.com
( dsl-only.net went
away in early 2018-Mar)