powerpc64 or 32-bit power context: FreeBSD lwsync use vs. th->th_generation handling (and related th-> fields) [Correction]

Fri Apr 19 05:17:54 UTC 2019

[I caught my mental mistake.]

On 2019-Apr-18, at 21:36, Mark Millard <marklmi at yahoo.com> wrote:

> First I review below lwsync behavior. It is based on a comparison/contrast
> paper for the powerpc vs. arm memory models. It sets context for later
> material specific to powerpc64 or 32-bit powerpc FreeBSD.
> 
> "For a write before a read, separated by a lwsync, the barrier will ensure that the write is
> committed before the read is satisfied but lets the read be satisfied before the write has
> been propagated to any other thread."
> 
> (By contrast, sync, guarantees that the write has propagated to all threads before the
> read in question is satisfied, the read having been separated from the write by the
> sync.)
> 
> Another wording in case it helps (from the same paper):
> 
> "The POWER lwsync does *not* ensure that writes before the barrier have propagated to
> any other thread before sequent actions, though it does keep writes before and after
> an lwsync in order as far as [each thread is] concerned". (Original used plural form:
> "all threads are". I tired to avoid any potential implication of cross (hardware)
> "thread" ordering constraints for seeing the updates when lwsync is used.)
> 
> 
> Next I note FreeBSD powerpc64 and 32-bit powerpc details
> that happen to involve lwsync, though lwsync is not the
> only issue:
> 
> atomic_store_rel_int(&th->th_generation, ogen);
> 
> and:
> 
> gen = atomic_load_acq_int(&th->th_generation);
> 
> with:
> 
> static __inline void                                            \
> atomic_store_rel_##TYPE(volatile u_##TYPE *p, u_##TYPE v)       \
> {                                                               \
>                                                                \
>        powerpc_lwsync();                                       \
>        *p = v;                                                 \
> }
> 
> and:
> 
> static __inline u_##TYPE                                        \
> atomic_load_acq_##TYPE(volatile u_##TYPE *p)                    \
> {                                                               \
>        u_##TYPE v;                                             \
>                                                                \
>        v = *p;                                                 \
>        powerpc_lwsync();                                       \
>        return (v);                                             \
> }                                                               \
> 
> also:
> 
> static __inline void
> atomic_thread_fence_acq(void)
> {
> 
>        powerpc_lwsync();
> }
> 
> 
> 
> First I list a simpler-than-full-context example to
> try to make things clearer . . .
> 
> Here is a sequence, listing in an overall time
> order, omitting other activity, despite the distinct
> cpus, (N!=M):
> 
> 
> (Presume th->th_generation==ogen-1 initially, then:)
> 
> cpu N: atomic_store_rel_int(&th->th_generation, ogen);
>       (same th value as for cpu M below)
> 
> cpu M: gen = atomic_load_acq_int(&th->th_generation);
> 
> 
> For the above sequence:
> 
> There is no barrier between the store and the later
> load at all. This is important below.
> 
> 
> So, if I have that much right . . .
> 
> Now for more actual "load side" context:
> (Presume, for simplicity, that there is only one 
> timehands instance instead of 2 or more timehands. So
> th does not vary below and is the same on both cpu's
> in the later example sequence of activity.)
> 
>        do {
>                th = timehands;
>                gen = atomic_load_acq_int(&th->th_generation);
>                *bt = th->th_offset;
>                bintime_addx(bt, th->th_scale * tc_delta(th));
>                atomic_thread_fence_acq();
>        } while (gen == 0 || gen != th->th_generation);
> 
> For simplicity of referring to things: I again show
> a specific sequence in time. I only show the
> &th->th_generation activity from cpu N, again for
> simplicity.
> 
> (Presume timehands->th_generation==ogen-1 initially
> and that M!=N:)
> 
> cpu M: th = timehands;
>       (Could be after the "cpu N" lines.)
> 
> cpu N: atomic_store_rel_int(&th->th_generation, ogen);
>       (same th value as for cpu M)
> 
> cpu M: gen = atomic_load_acq_int(&th->th_generation);
> cpu M: *bt = th->th_offset;
> cpu M: bintime_addx(bt, th->th_scale * tc_delta(th));
> cpu M: atomic_thread_fence_acq();
> cpu M: gen != th->th_generation
>       (evaluated to false or to true)
> 
> So here:
> 
> A) gen ends up with: gen==ogen-1 || gen==ogen
>   (either is allowed because of the lack of
>   any barrier between the store and the
>   involved load).
> 
> B) When gen==ogen: there was no barrier
>   before the assignment to gen to guarantee
>   other th-> field-value staging relationships.

(B) is just wrong: seeing the new value (ogen)
does guarantee some about the other th-> 
field-value staging relationships seen, given the
lwsync before the store and after the load.

> C) When gen==ogen: gen!=th->th_generation false
>   does not guarantee the *bt=. . . and
>   bintime_addx(. . .) activities were based
>   on a coherent set of th-> field-values.

Without (B), (C) does not follow.

> If I'm correct about (C) then the likes of the
> binuptime and sbinuptime implementations appear
> to be broken on powerpc64 and 32-bit powerpc
> unless there are extra guarantees always present.
> 
> So have I found at least a powerpc64/32-bit-powerpc
> FreeBSD implementation problem?

No: I did not find a problem.

> Note: While I'm still testing, I've seen problems
> on the two 970MP based 2-socket/2-cores-each G5
> PowerMac11,2's that I've so far not seen on three
> 2-socket/1-core-each PowerMacs, two such 7455 G4
> PowerMac3,6's and one such 970 G5 PowerMac7,2.
> The two PowerMac11,2's are far more tested at
> this point. But proving that any test-failure is
> specifically because of (C) is problematical.
> 
> 
> Note: arm apparently has no equivalent of lwsync,
> just of sync (aka. hwsync and sync 0). If I
> understand correctly, PowerPC/Power has the weakest
> memory model of the modern tier-1/tier-2
> architectures and, so, they might be broken for
> memory model handling when everything else is
> working.
> 

===
Mark Millard
marklmi at yahoo.com
( dsl-only.net went
away in early 2018-Mar)