rdr pass for proto tcp sometimes creates states with expire time zero and so breaking connections

Wed Nov 14 07:06:07 UTC 2018

On Tue, Nov 13, 2018 at 11:17:47PM +0100, Kristof Provost wrote:
> On 13 Nov 2018, at 22:01, Andreas Longwitz wrote:
> >>
> >>     Are there any hints why the counter pf_default_rule->states_cur
> >>     could get a negative value ?
> >>
> >> I’m afraid I have no idea right now.
> >>
> >
> > OK, in the meantime I did some more research and I am now quite sure 
> > the
> > problem with the bogus pf_default_rule->states_cur counter is not a
> > problem in pf. I am convinced it is a problem in counter(9) on i386
> > server. The critical code is the machine instruction cmpxchg8b used in
> > /sys/i386/include/counter.h.
> >
> I’m always happy to hear problems aren’t my fault :)
> 
> >> From intel instruction set reference manual:
> > Zhis instruction can be used with a LOCK prefix allow the instruction 
> > to
> > be executed atomically.
> >
> > We have two other sources in kernel using cmpxchg8b:
> >   /sys/i386/include/atomic.h   and
> >   /sys/cddl/contrib/opensolaris/common/atomic/i386/opensolaris_atomic.S
> >
> > Both make use of the LOCK feature, in atomic.h a detailed explanation 
> > is
> > given. Because counter.h lacks the LOCK prefix I propose the following
> > patch to get around the leak:
> >
> > --- counter.h.orig       2015-07-03 16:45:36.000000000 +0200
> > +++ counter.h   2018-11-13 16:07:20.329053000 +0100
> > @@ -60,6 +60,7 @@
> >         "movl   %%edx,%%ecx\n\t"
> >         "addl   (%%edi),%%ebx\n\t"
> >         "adcl   4(%%edi),%%ecx\n\t"
> > +       "lock   \n\t"
> >         "cmpxchg8b %%fs:(%%esi)\n\t"
> >         "jnz    1b"
> >         :
> > @@ -76,6 +77,7 @@
> >         __asm __volatile(
> >         "movl   %%eax,%%ebx\n\t"
> >         "movl   %%edx,%%ecx\n\t"
> > +       "lock   \n\t"
> >         "cmpxchg8b      (%2)"
> >         : "=a" (res_lo), "=d"(res_high)
> >         : "SD" (p)
> > @@ -121,6 +123,7 @@
> >         "xorl   %%ebx,%%ebx\n\t"
> >         "xorl   %%ecx,%%ecx\n\t"
> >  "1:\n\t"
> > +       "lock   \n\t"
> >         "cmpxchg8b      (%0)\n\t"
> >         "jnz    1b"
> >         :
> >
> That looks very plausible. I’m somewhat out of my depth here, so I’d 
> like the authors of the counter code to take a look at it.

No, it does not look correct.  The only atomicity guarantee that is required
from the counter.h inc and zero methods are atomicity WRT context switches.
The instructions are always executed on the CPU which owns the PCPU element
in the counter array, and since the update is executed as single instruction,
it does not require more expensive cache line lock AKA LOCK prefix.  This
is the main feature of the counters on x86.

It might read bogus value when fetching the counter but counter.h KPI only
guarantee is that the readouts are mostly correct.  If you have systematically
wrong value always read, there is probably something different going on.