rdr pass for proto tcp sometimes creates states with expire time zero and so breaking connections

Sat Oct 27 21:48:38 UTC 2018

>     In the problem I have reported for states of "rdr pass" rules I see
>     start=6000, end=12000, timeout=86400 and (obviously erroneous, probably
>     negative) states=0xffffffd0.
> 
> I have no idea how that can happen. Just to make sure I understand: you
> know that states is negative here because of a printf() or SDT addition
> in pf_expire_states(), right?

I did not change the kernel, I use DTrace on my firewall server
fwextern. In pf.conf I have changed all productive "rdr pass" rules to a
rdr rule and an extra filter rule. Now only one "rdr pass" rule is left
for test:

  rdr pass on $if_internet proto tcp from 31.17.172.227 to $ip_internet
      port 8022 -> 10.0.0.254

Now I start the following DTrace script pfcounter.d, which will be
active when a SYN on port 8022 arrives:

#!/usr/sbin/dtrace -s

fbt::pf_normalize_tcp:entry
/((*(args[2]->m_hdr.mh_data + 33)) & 0x02) == 0x02 && htons(*(short
*)(args[2]->m_hdr.mh_data + 22)) == 8022/
       /* SYN + port 8022 */
{ self->flag1 = 1; }

fbt::pf_test:return
/self->flag1/ { self->flag1 = 0; }

fbt::pfioctl:entry
/args[1] == 3221767193 && ((struct pfioc_states *)args[2])->ps_len != 0/
       /* DIOCGETSTATES  &&  len != 0 */
{ self->flag2 = 1; }

fbt::counter_u64_fetch:entry
/self->flag2/ { }
fbt::counter_u64_fetch:return
/self->flag2/ { printf("        returncode (states_cur)=%d / 0x%x",
args[1], args[1]); }

fbt::pfioctl:return
/self->flag2/
{ self->flag2 = 0; }

Now I run on my remote test client (IP 31.17.172.227) the command

   ssh -p 8022 fwextern sleep 20

This creates on fwextern a state for the "rdr pass" rule with expire
time zero. I must be quick to run "pfctl -vss" on fwextern to see this
state and the output of the DTrace script shows me the "negative" value
of the counter:

=== root at fwextern (pts/0) -> ./pfcounter.d
dtrace: script './pfcounter.d' matched 6 probes
CPU     ID                    FUNCTION:NAME
  3  17624          counter_u64_fetch:entry
  3  17625         counter_u64_fetch:return         returncode
(states_cur)=4294967248 / 0xffffffd0

If I run on the test client the ssh command twice, then the counter is
one less negative than before:

=== root at fwextern (pts/0) -> ./pfcounter.d
dtrace: script './pfcounter.d' matched 6 probes
CPU     ID                    FUNCTION:NAME
  3  17624          counter_u64_fetch:entry
  3  17625         counter_u64_fetch:return         returncode
(states_cur)=4294967249 / 0xffffffd1
  3  17624          counter_u64_fetch:entry
  3  17625         counter_u64_fetch:return         returncode
(states_cur)=4294967249 / 0xffffffd1

Because of "sleep 20" the ssh command does not return and must be
killed. I have observed the problem on two of my firewall servers, the
pf rules never were reloaded since boot. I think there must be an
unknown event in the past, that triggered the negative counter value.

I will try to add a statement to the kernel that recognizes the problem
and go back to the "rdr pass" rules, so next time the problem occurres
we have more information than now.

Kindly regards,
Andreas