connect: not permitted by pf state lookup failures on heavier load

Thu Jul 26 09:16:18 UTC 2007

Hello,

Recently I've been playing around with a carp+pfsync+pound applevel proxy.
On a high connection rate I've noticed some failed connections and the
applevel proxy rendered the backend web servers DEAD, that means unreachable.

Pound sets on the gateway, accepts connections from the outside world and makes
connections to the backend servers.

The state table grew up to 32K states in total. On a very hight rate when
pound tried to reach a backend server with connect(2) it recieved an
"operation permitted" response, that was quite strange. Sometimes there
are "borken pipes", sometimes also around those state lookup failures.

On farther digging i've set pf's loglevel to misc and I've noticed state
table lookup failures before pound's connect(2) error messages.

It looks like this:
Jul 26 10:46:54 lvs1 kernel: pf: BAD state: TCP 192.168.4.55:80 192.168.4.55:80 192.168.4.251:42688 [lo=3773866253 high=3773932711 win=2003 modulator=155307840 wscale=5] [lo=9137549 high=9201645 win=33304 modulator=2788154389 wscale=1] 9:9 S seq=3822349776 ack=9137549 len=0 ackskew=0 pkts=35:42 dir=in,fwd
Jul 26 10:46:54 lvs1 kernel: pf: State failure on: 1       | 5

Also there are lots of operation timeouts and connection reset by peers.
. When I disable pf there's a lot less of them.

The pf.conf is the following:
--- BEGIN pf.conf ---
if_ext="em0"
if_vvv="fxp0"
if_sync="em1"

ip_pub="192.168.4.55"
ip_vvv="10.0.0.254"

ip_vvv1="10.0.0.1"
ip_vvv2="10.0.0.2"
ip_vvv3="10.0.0.3"

table <vvv> {$ip_vvv1, $ip_vvv2, $ip_vvv3}

# Options: tune the behavior of pf, default values are given.
set timeout { interval 5, frag 30 }
#set timeout { tcp.first 120, tcp.opening 30, tcp.established 86400 }
set timeout { tcp.closing 900, tcp.finwait 30, tcp.closed 60 }
#set timeout { udp.first 60, udp.single 30, udp.multiple 60 }
#set timeout { icmp.first 20, icmp.error 10 }
#set timeout { other.first 60, other.single 30, other.multiple 60 }
set timeout { adaptive.start 30000, adaptive.end 90000 }
set limit { states 100000, frags 2000 }
#set loginterface none
set block-policy return
set require-order yes
set fingerprints "/etc/pf.os"
set debug misc

set skip on lo0

#scrub in all

rdr on $if_ext proto tcp from any to $ip_pub port 10001 -> $ip_vvv1 port 22
rdr on $if_ext proto tcp from any to $ip_pub port 10002 -> $ip_vvv2 port 22
rdr on $if_ext proto tcp from any to $ip_pub port 10003 -> $ip_vvv3 port 22

block in log on $if_ext all

pass in quick on {$if_ext,$if_vvv} proto vrrp
pass out quick on {$if_ext,$if_vvv} proto vrrp

pass out quick on $if_ext proto udp from any to 192.168.4.200 port 123 keep state

pass in quick on $if_ext proto tcp from any to $if_ext:0 port 22 flags S/SA synproxy state (no-sync)
pass in quick on $if_ext proto tcp from any to $ip_pub port 80 flags S/SA modulate state (no-sync)

pass out quick on $if_ext proto udp from $if_ext:0 to port 53 keep state (no-sync)
pass out quick on $if_ext proto udp from any to port 53 keep state

pass out quick on $if_ext proto tcp from $if_ext:0 to port 80 flags S/SA keep state (no-sync)
pass out quick on $if_ext proto tcp from any to port 80 flags S/SA keep state

pass in quick on $if_ext proto tcp from any to <vvv> port 22 flags S/SA synproxy state

#pass out quick on $if_vvv proto tcp from ($if_vvv) to <vvv> port 80 flags S/SA keep state (no-sync)
pass out quick on $if_vvv proto tcp from ($if_vvv) to {$ip_vvv1,$ip_vvv2,$ip_vvv3} port 80 flags S/SA keep state (no-sync)
--- END pf.conf ---

Here, i've player around with the tcp timeouts, scrubbing, adaptive settings
and swapped the last two rules (table VS individual rules), but they lead to
nowhere, nothing changed.

I'm testing this proxy with around 10-15 ab's (apache benchmark, part of the port),
with 8 or 16 connections per instance and 500 requests/instance in an infinite loop.

Here's an hour's messages log. pf wasn't enabled for the whole hour, but accordingly
more then half an hour:
http://phoemix.harmless.hu/messages-pffail.0.bz2

The question is, what can cause this high rate of connection failures?
What have I done wrong? What's happening, I've never seen such a thing
from pf?
How could it be repaired to make pf behave stable on a heavier load?

Sincerely,

Gergely Czuczy
mailto: gergely.czuczy at harmless.hu

-- 
Weenies test. Geniuses solve problems that arise.
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 3051 bytes
Desc: not available
Url : http://lists.freebsd.org/pipermail/freebsd-pf/attachments/20070726/031121a3/attachment.pgp