svn commit: r280759 - head/sys/netinet

Mon Mar 30 01:54:03 UTC 2015

On Sun, 29 Mar 2015, Ian Lepore wrote:

> On Mon, 2015-03-30 at 00:07 +0300, Gleb Smirnoff wrote:
>> On Sun, Mar 29, 2015 at 08:16:46AM -0700, Adrian Chadd wrote:
>> A> On 29 March 2015 at 01:13, Hans Petter Selasky <hps at selasky.org> wrote:
>> A> > On 03/28/15 20:16, Gleb Smirnoff wrote:
>> A> >>
>> A> >> +uint16_t
>> A> >> +ip_newid(void)
>> A> >> +{
>> A> >> +
>> A> >> +       counter_u64_add(V_ip_id, 1);
>> A> >> +       return (htons((*(uint64_t *)zpcpu_get(V_ip_id)) & 0xffff));
>> A> >> +}
>> A> >
>> A> > Technically you would need to enter a critical section here, so that the
>> A> > current process doesn't get swapped to a different CPU between the counter
>> A> > add and the zpcpu_get.
>> A>
>> A> +10000 here.
>> A>
>> A> You can get preempted and shifted to another CPU at almost any point
>> A> in time you're not explicitly asking not to be. It's.. frustrating.
>>
>> What happens in case of the race is that one CPU will use counter of the
>> other CPU, reusing the ID that was just used by the other CPU. And this
>> isn't a fatal race, it is just reuse of ID. The probability of such event
>> is much lower than probability of a collision due to overflowing uint16_t.

It's more confusing (and probably more harmless) than that.  When there is
a context switch, there are lots of atomic ops and delays which might
sync the memory system.  Otherwise, the reader on a different CPU might
not see stale data instead of the value that the same thread just wrote.
Then the read is non-atomic except possibly on 64-bit systems.  It may
see a mixture of old and new data.  But only 16 bits are used, and on
non-exotic systems these bits will be in the same memory access unit, so
the result may be new or old but not a mixture.  Code that runs
unsynchronized can never depend on getting an in-order result.  This is
correct iff the order is unimportant.

>> For example, if we emit 1 Mpps, then we overflow the ID counter 15 times
>> per second. Every ID is reused 15 times within this small interval of time,
>> which of course is larger than TTL. And we find this kinda acceptable.
>>
>> To illustrate the probability of 1 instruction race, I would suggest to
>> look at PCPU_ADD() implementation for all arches except amd64/i386 and
>> find out that it is prone to this race. Hitting the race in PCPU_ADD()
>> will usually lead to mangling vm_meter statistics. Did we ever observe
>> this in practice? No.

Ugh, PCPU_ADD() is broken on i386 too.  It violates its own stated
requirement that the update be atomic with respect to interrupts.  i386
can only naturally support up to 32-bit atomic accesses, but broken MI
code assumes that all arches support 64-bit accesses.  i386 version:

X /*
X  * Adds a value of the per-cpu counter name.  The implementation
X  * must be atomic with respect to interrupts.
X  */

But this implementation is not atomic with respect to interrupts.

X #define	__PCPU_ADD(name, val) do {					\
X 	__pcpu_type(name) __val;					\
X 	struct __s {							\
X 		u_char	__b[MIN(sizeof(__val), 4)];			\
X 	} __s;								\
X 									\
X 	__val = (val);							\
X 	if (sizeof(__val) == 1 || sizeof(__val) == 2 ||			\
X 	    sizeof(__val) == 4) {					\
X 		__s = *(struct __s *)(void *)&__val;			\
X 		__asm __volatile("add %1,%%fs:%0"			\
X 		    : "=m" (*(struct __s *)(__pcpu_offset(name)))	\
X 		    : "r" (__s));					\

This part uses asm partly to ensure that a suitable atomic instruction is
used.

X 	} else								\
X 		*__PCPU_PTR(name) += __val;				\

This part handles 64-bit integers.  Since i386 is 32 bits and SSE is
unavailable, the compiler has to do a bunch of non-atomic 32-bit accesses,
and without a compiler barrier it can do these in ways that are especially
bad for atomicity.

X } while (0)

PCPU_INC() is similarly broken.  PCPU_GET() and PCPU_SET() do similar
non-atomic accesses, but never claimed that they were atomic with
respect to interrupts.  PCPU_PTR() is racy as designed.

vmmeter uses a *(type_t *)&pcpu->pc_bar hack like the above.  This has
smaller atomicity problems because type_t is always int (vmmeter only
supports u_int).  Perhaps the i386 pcpu accesses are not so broken
after all, with most MI code only doing 32-bit accesses.

As you know, I don't like the implementation of counter64, but it is
missing some of the above problems.  On amd64, it uses a simple asm
with addq; PCPU_ADD() would do the same except for it not having a
compiler barrier.  On i386, to avoid the above bug it uses a slow
method using cmpxchg8b if available, else a slower method using
critical sections.  Most other arches use the critical section
method.  Even sparc64.  sparc64's PCP_*() is even sloppier.  Then
the access methods are sloppy, as for vmmeter.  kib wrote XXX
comments about only some of them.  i386 seems to be more careful --
it uses a complicated rendezvous method or critical section where
needed.

> I was under the impression that PCPU_ADD(), like all pcpu r-m-w
> accesses, requires the calling code to have taken steps to ensure the
> code doesn't migrate during the sequence.  If callers aren't doing so,
> that seems like a bug in the calling code, not the PCPU_ADD() code.

No, PCPU_ADD() is supposed to be atomic enough, and the caller is
only supposed to use it when it doesn't matter on which CPU the
counter is incremented on.  It is normal operation for the addition
to occur on a different CPU than the one that called PCPU_ADD().
Similarly for counter_64.  It doesn't matter where pure counters
are accumulated, especially since there are almost no ways to see
the counts for individual CPUs.

Bruce