[Bug 265974] SMR has several missing barriers

From: <bugzilla-noreply_at_freebsd.org>
Date: Sun, 21 Aug 2022 14:47:25 UTC
https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=265974

            Bug ID: 265974
           Summary: SMR has several missing barriers
           Product: Base System
           Version: CURRENT
          Hardware: arm64
                OS: Any
            Status: New
          Severity: Affects Many People
          Priority: ---
         Component: kern
          Assignee: bugs@FreeBSD.org
          Reporter: pierre@habouzit.net

Created attachment 236039
  --> https://bugs.freebsd.org/bugzilla/attachment.cgi?id=236039&action=edit
tarball of litmus tests and pdfs

1. smr_enter() ordering is incorrect on arm64 (and probably other weak memory
order systems)

```C
static inline void
smr_enter(smr_t smr)
{
[...]
        /* This is an add because we do not have atomic_store_acq_int */
        atomic_add_acq_int(&smr->c_seq, smr_shared_current(smr->c_shared));
}
```

the comment here has a point, there's no atomic_store_acq_int because that
would be a StoreLoad barrier which `atomic_add_acq_int` isn't sufficient for.
Here is a litmus test that proves why this code above isn't correct. Let's
assume 2 threads running something akin to this:

start:
    global = 0
    rd_seq = 123
    wr_seq = 123

thread 0:
    store(&global, 2);
    smr_synchronize();

thread 1:
    smr_enter();
    load(&global);


it should NOT be possible for thread 2 to load `0` and for smr_synchronize() to
think thread 1 was not in a critical section pinning `123`

Moreover, smr_poll() absolutely needs a full memory barrier on entry, the
`atomic_load_acq_int` performed by `smr_poll` aren't sufficient.


```
AArch64 MP
{
    global = 0;
    wr_seq = 123;
    p1_rd_seq = 0;

    0:x0 = global; 0:x1 = wr_seq; 0:x2 = p1_rd_seq;
    1:x0 = global; 1:x1 = wr_seq; 1:x2 = p1_rd_seq;
}
 P0                     | P1                         ;
 MOV      X8, #2        | LDR        X8, [X1]        ;
 STR      X8, [X0]      | STR        X8, [X2]        ;
 LDADDL   X8, X9, [X1]  | DMB        SY              ;
 DMB      SY            | LDR        X10, [X0]       ;
 LDR      X10, [X2]     |                            ;
exists (0:X10 = 0 /\ 1:X8 = 123 /\ 1:X10 = 0)
```

This litmus test above shows that the condition is unfortunately reachable, the
one below passes (adding a full barrier in smr_scan() and adding a full barrier
in smr_enter()):

```
AArch64 MP
{
    global = 0;
    wr_seq = 123;
    p1_rd_seq = 0;

    0:x0 = global; 0:x1 = wr_seq; 0:x2 = p1_rd_seq;
    1:x0 = global; 1:x1 = wr_seq; 1:x2 = p1_rd_seq;
}
 P0                     | P1                         ;
 MOV      X8, #2        | LDR        X8, [X1]        ;
 STR      X8, [X0]      | STR        X8, [X2]        ;
 LDADDL   X8, X9, [X1]  | DMB        SY              ;
 DMB      SY            | LDR        X10, [X0]       ;
 LDR      X10, [X2]     |                            ;
exists (0:X10 = 0 /\ 1:X8 = 123 /\ 1:X10 = 0)
```


Note that I think the code works on Intel today.

2. similarly smr_poll_scan() needs a full barrier after the scan _before_ it
updates the global rd_seq, this is about serializing the fast path of smr_poll
with CPUs that weren't in a critical section (while the one on entry of
smr_poll() is about synchronizing with the CPUs inside an active SMR critical
section and was demonstrated in (1)).

I confess that litmus test is much more painful to write and is left as an
exercise to the reader ;)


3. smr_deferred_advance() is similarly incorrect as it doesn't guarantee proper
visibility in this case:

start:
    global = 0
    rd_seq = 123
    wr_seq = 123

thread 0:
    store(&global, 2);
    smr_deferred_advance(); // returns 125 but didn't update wr_seq

thread 1:
    smr_synchronize(); // 123 -> 125
    load(&global);

We should never hit a case when thread1 didn't observe the global being 2 but
thread 1 returned 125.

Here is a modelization of this sequence with the added `dmb sy` from (1) in
smr_poll:

```
AArch64 MP
{
    global = 0;
    wr_seq = 123;

    0:x0 = global; 0:x1 = wr_seq; 0:x2 = 2;
    1:x0 = global; 1:x1 = wr_seq; 1:x2 = 2;
}
 P0                     | P1                         ;
 STR      X2, [X0]      | LDADDL     X2, X9, [X1]    ;
                        | DMB        SY              ;
 LDR      X9, [X1]      | LDR        X10, [X0]       ;
 ADD      X9, X9, X2    |                            ;
exists (0:X9 = 125 /\ 1:X9 = 123 /\ 1:X10 = 0)
```

this unfortunately tells us that the condition is reachable, and intuitively it
makes sense because there's nothing in "P0" ordering anything. The fix is that
when a deferred advance is made, a full barrier must be issued (it has to be a
StoreLoad between the store to the global and the load of the write sequence),
and indeed this litmus test passes:

```
AArch64 MP
{
    global = 0;
    wr_seq = 123;

    0:x0 = global; 0:x1 = wr_seq; 0:x2 = 2;
    1:x0 = global; 1:x1 = wr_seq; 1:x2 = 2;
}
 P0                     | P1                         ;
 STR      X2, [X0]      | LDADDL     X2, X9, [X1]    ;
 DMB      SY            | DMB        SY              ;
 LDR      X9, [X1]      | LDR        X10, [X0]       ;
 ADD      X9, X9, X2    |                            ;
exists (0:X9 = 125 /\ 1:X9 = 123 /\ 1:X10 = 0)
```

Attached is a folder with the litmus test and nice PDF renderings of why on
arm64 those barriers are needed.

The tentative fix IMO is something like this, though I don't use nor even know
how to build FreeBSD so it might have mistakes:

```diff
diff --git a/sys/kern/subr_smr.c b/sys/kern/subr_smr.c
index cbbf185fee79..f7c7ccef8d6e 100644
--- a/sys/kern/subr_smr.c
+++ b/sys/kern/subr_smr.c
@@ -307,8 +307,10 @@ static smr_seq_t
 smr_deferred_advance(smr_t smr, smr_shared_t s, smr_t self)
 {

-       if (++self->c_deferred < self->c_limit)
+       if (++self->c_deferred < self->c_limit) {
+               atomic_thread_fence_seq_cst();
                return (smr_shared_current(s) + SMR_SEQ_INCR);
+       }
        self->c_deferred = 0;
        return (smr_default_advance(smr, s));
 }
@@ -427,6 +429,8 @@ smr_poll_scan(smr_t smr, smr_shared_t s, smr_seq_t
s_rd_seq,
        CRITICAL_ASSERT(curthread);
        counter_u64_add_protected(poll_scan, 1);

+       atomic_thread_fence_seq_cst();
+
        /*
         * The read sequence can be no larger than the write sequence at
         * the start of the poll.
@@ -450,6 +454,8 @@ smr_poll_scan(smr_t smr, smr_shared_t s, smr_seq_t
s_rd_seq,
                        rd_seq = SMR_SEQ_MIN(rd_seq, c_seq);
        }

+       atomic_thread_fence_seq_cst();
+
        /*
         * Advance the rd_seq as long as we observed a more recent value.
         */
diff --git a/sys/sys/smr.h b/sys/sys/smr.h
index c110be9a66c2..3d543d1979ae 100644
--- a/sys/sys/smr.h
+++ b/sys/sys/smr.h
@@ -133,7 +133,12 @@ smr_enter(smr_t smr)
         * is detected and handled there.
         */
        /* This is an add because we do not have atomic_store_acq_int */
+#if __x86_64__ || __i386__
        atomic_add_acq_int(&smr->c_seq, smr_shared_current(smr->c_shared));
+#else
+       atomic_store_int(&smr->c_seq, smr_shared_current(smr->c_shared));
+       atomic_thread_fence_seq_cst();
+#endif
 }

 /*
```

-- 
You are receiving this mail because:
You are the assignee for the bug.