rdr pass for proto tcp sometimes creates states with expire time zero and so breaking connections

Thu Jan 24 16:49:58 UTC 2019

after some more long term research I have an educated guess whats going
on in this problem.

The problem only occurs on i386.

If I replace the counter_u64_fetch() call in pf_state_expires() by
the value of V_pf_status.states, then pf works without problems, the
expire time zero problem is gone:

--- pf.c.1st    2018-08-14 10:17:41.000000000 +0200
+++ pf.c        2019-01-19 17:49:18.000000000 +0100
@@ -1542,7 +1542,7 @@
        start = state->rule.ptr->timeout[PFTM_ADAPTIVE_START];
        if (start) {
                end = state->rule.ptr->timeout[PFTM_ADAPTIVE_END];
-               states = counter_u64_fetch(state->rule.ptr->states_cur);
+               states = V_pf_status.states;
        } else {
                start = V_pf_default_rule.timeout[PFTM_ADAPTIVE_START];
                end = V_pf_default_rule.timeout[PFTM_ADAPTIVE_END];

The use of counter_u64_fetch() looks a little bit at random for me. For
all states not associated with the pf_default_rule the value of
pf_status.states is used and for me this value is ok for all rules.

Further the counter(9) framework was created for quick and lockless
write of counters, but fetching is more expansive. So I suggest to let
pf_state_expires() work without a counter fetch.

Further I can confirm that the counter states_cur of the pf_default_rule
remains correct, when the patch given above is active. Without the patch
the counter on my main firewall machine gets slowly negative. I have
verified this with a lot of live DTrace and kgdb script debugging.

>> OK, in the meantime I did some more research and I am now quite sure the
>> problem with the bogus pf_default_rule->states_cur counter is not a
>> problem in pf. I am convinced it is a problem in counter(9) on i386
>> server. The critical code is the machine instruction cmpxchg8b used in
>> /sys/i386/include/counter.h.
>> 
>> From intel instruction set reference manual:
>> Zhis instruction can be used with a LOCK prefix allow the instruction to
>> be executed atomically.
>> 
>> We have two other sources in kernel using cmpxchg8b:
>>   /sys/i386/include/atomic.h   and
>>   /sys/cddl/contrib/opensolaris/common/atomic/i386/opensolaris_atomic.S
> 
> A single CPU instruction is atomic by definition, with regards to the CPU.
> A preemption can not happen in a middle of instruction. What the "lock"
> prefix does is memory locking to avoid unlocked parallel access to the
> same address by different CPUs.
> 
> What is special about counter(9) is that %fs:%esi always points to a
> per-CPU address, because %fs is unique for every CPU and is constant,
> so no other CPU may write to this address, so lock prefix isn't needed.
> 
> Of course a true SMP i386 isn't a well tested arch, so I won't assert
> that counter(9) doesn't have bugs on this arch. However, I don't see
> lock prefix necessary here.

I think the problem is the cmpxchg8b instruction used in
counter_u64_fetch(), because this machine instruction always writes to
memory, also when we only want to read and have (EDX:EAX) = (ECX:EBX):

    TEMP64 <- DEST
    IF (EDX:EAX = TEMP64)
       THEN
          ZF <- 1
          DEST <- ECX:EBX
       ELSE
          ZF <- 0
          EDX:EAX <- TEMP64
          DEST <- TEMP64
    FI

If one CPU increments the counter in pf_create_state() and another does
the fetch, then both CPU's may run the xmpxschg8b at once with the
chance that both read the same memory value in TEMP64 and the fetching
CPU is the second CPU that writes and so the increment is lossed. Thats
what I see without the above patch two or three times a week.

Andreas