RELENG_8 pf stack issue (state count spiraling out of control)
Vlad Galu
dudu at dudu.ro
Tue May 3 05:22:51 UTC 2011
On Tue, May 3, 2011 at 3:58 AM, Jeremy Chadwick <freebsd at jdc.parodius.com>wrote:
> (Please keep me CC'd as I'm not subscribed to freebsd-pf. And apologies
> for cross-posting, but the issue is severe enough that I wanted to make
> it known on -stable)
>
> The below issue I'm describing is from a machine running 8.2-PRERELEASE
> (RELENG_8) using src dated Tue Feb 15 05:46:02 PST 2011.
>
> Please read the story in full, as I have taken the time to describe
> everything I did, plus log output, as well as induce a panic via "call
> doadump" from ddb so I have a capture of the system at the time. I also
> have a theory as to what caused the problem, but how to trigger it is
> unknown; it may be a rare race condition.
>
>
> This morning I woke up to find a report from one of our users that he
> could not connect to a specific TCP port (not SSH) on one of our
> servers. I also found that I couldn't SSH into the same box. Serial
> console was working fine, and the serial console log showed no sign of
> any problems.
>
> I started to debug the issue of me not being able to SSH into the
> machine and within a few minutes became immediately concerned: pfctl
> indicated we had reached the maximum number permitted state table
> entries (10,000).
>
> ============================================================
> # pfctl -s info
> Status: Enabled for 76 days 06:49:10 Debug: Urgent
>
> Interface Stats for em0 IPv4 IPv6
> Bytes In 8969748840 0
> Bytes Out 8296135477 0
> Packets In
> Passed 128211763 0
> Blocked 621379 0
> Packets Out
> Passed 138483868 0
> Blocked 2579 0
>
> State Table Total Rate
> current entries 10000
> searches 267316807 40.6/s
> inserts 4440553 0.7/s
> removals 4430553 0.7/s
> Counters
> match 5067474 0.8/s
> bad-offset 0 0.0/s
> fragment 324 0.0/s
> short 0 0.0/s
> normalize 32 0.0/s
> memory 336946 0.1/s
> bad-timestamp 0 0.0/s
> congestion 0 0.0/s
> ip-option 0 0.0/s
> proto-cksum 1611 0.0/s
> state-mismatch 509 0.0/s
> state-insert 0 0.0/s
> state-limit 0 0.0/s
> src-limit 0 0.0/s
> synproxy 0 0.0/s
>
> # pfctl -s memory
> states hard limit 10000
> src-nodes hard limit 10000
> frags hard limit 5000
> tables hard limit 1000
> table-entries hard limit 100000
> ============================================================
>
> The above is mainly for em0 (our WAN interface); our LAN interface (em1)
> was not impacted because we use "set skip on em1". And it's a good
> thing too: we have lots of LAN-based services that this machine provides
> that would have been impacted. We also use "set skip on lo0".
>
> I immediately went to look at our monitoring graphs, which monitor pf
> state (specifically state table entries), polled via bsnmpd(8). This
> data is even more frightening:
>
> http://jdc.parodius.com/freebsd/pf-issue/pf_states-day.png
> http://jdc.parodius.com/freebsd/pf-issue/pf_states-week.png
>
> Literally something was spiraling out of control, starting at approx.
> 2011/05/01 (Sun) at 12:30 PDT. The situation became dire at approx.
> 19:45 PDT the same day, but I wasn't aware of it until said user brought
> an issue to my attention.
>
> You can see from the network I/O graphs (taken from SNMP polling our
> switch, NOT from the host/box itself) that there was no DoS attack or
> anything like that occurring -- this was something within FreeBSD
> itself. More evidence of that will become apparent.
>
> http://jdc.parodius.com/freebsd/pf-issue/port_03-day.png
> http://jdc.parodius.com/freebsd/pf-issue/port_03-week.png
>
> The first thing I did was "/etc/rc.d/pf reload". This command hung.
> Any attempt to send Ctrl-C/SIGINT did nothing. I was able to
> Ctrl-Z/SIGSTOP it, then use kill %1, but the actual reload process did
> not truly die (despite csh stating "Terminated"). The only way to kill
> it was to kill -9.
>
> Attempts to shut down any daemons which utilised the network --
> including things that only used em1 -- would not shut down. This
> included things like postfix, mysqld, and some inet-based services. I
> was forced to kill -9 them. Things like bsnmpd, however, did shut down.
>
> Equally as uncomfortable, "shutdown -r now" did not reboot the system.
> That is to say, wall(1)'s announcement was shown, but the actual
> stopping of services did not begin.
>
> The next thing I tried was "/etc/rc.d/pf stop", which worked. Then I
> did "/etc/rc.d/pf start", which also worked. However, what I saw next
> surely indicated a bug in the pf layer somewhere -- "pfctl -s states"
> and "pfctl -s info" disagreed on the state count:
>
> ============================================================
> # pfctl -s info
> Status: Enabled for 0 days 00:00:16 Debug: Urgent
>
> Interface Stats for em0 IPv4 IPv6
> Bytes In 3459 0
> Bytes Out 0 0
> Packets In
> Passed 0 0
> Blocked 29 0
> Packets Out
> Passed 0 0
> Blocked 0 0
>
> State Table Total Rate
> current entries 10000
> searches 29 1.8/s
> inserts 0 0.0/s
> removals 0 0.0/s
> Counters
> match 29 1.8/s
> bad-offset 0 0.0/s
> fragment 0 0.0/s
> short 0 0.0/s
> normalize 0 0.0/s
> memory 18 1.1/s
> bad-timestamp 0 0.0/s
> congestion 0 0.0/s
> ip-option 0 0.0/s
> proto-cksum 0 0.0/s
> state-mismatch 0 0.0/s
> state-insert 0 0.0/s
> state-limit 0 0.0/s
> src-limit 0 0.0/s
> synproxy 0 0.0/s
>
> # pfctl -s state | wc -l
> 0
> ============================================================
>
> The "pf uptime" shown above, by the way, matches system uptime.
>
> I then attempted "pfctl -F state", but nothing changed (looked the same
> as above).
>
> Since I could not reboot the box, I was forced to drop to ddb via serial
> console. I did some commands like "ps" and the like, and then "call
> doadump" to induce a kernel panic, and then "reboot" (which worked).
>
> Once the machine came back up, savecore(8) ran, wrote the data out, and
> everything has been fine since. /var/crash/core.txt.0 is ~68KBytes and
> I do not feel comfortable sharing its content publicly, but will be
> happy to hand it to developer(s) who are interested. Relevant tidbits I
> can discern:
>
> ------------------------------------------------------------------------
> ps -axl
>
> UID PID PPID CPU PRI NI VSZ RSS MWCHAN STAT TT TIME COMMAND
> 0 422 0 0 -16 0 0 0 pftm DL ?? 1362773081:04.00
> [pfpurge]
> ------------------------------------------------------------------------
>
> ------------------------------------------------------------------------
> vmstat -z
>
> ITEM SIZE LIMIT USED FREE REQUESTS
> FAILURES
> pfsrctrpl: 152, 10000, 0, 0, 0,
> 0
> pfrulepl: 912, 0, 40, 88, 806,
> 0
> pfstatepl: 392, 10000, 10000, 0, 4440553,
> 341638
> pfaltqpl: 240, 0, 0, 0, 0,
> 0
> pfpooladdrpl: 88, 0, 0, 0, 0,
> 0
> pfrktable: 1296, 1002, 4, 20, 112,
> 0
> pfrkentry: 216, 100008, 603, 891, 15384,
> 0
> pfrkentry2: 216, 0, 0, 0, 0,
> 0
> pffrent: 32, 5050, 0, 303, 1620,
> 0
> pffrag: 80, 0, 0, 135, 807,
> 0
> pffrcache: 80, 10035, 0, 0, 0,
> 0
> pffrcent: 24, 50022, 0, 0, 0,
> 0
> pfstatescrub: 40, 0, 0, 0, 0,
> 0
> pfiaddrpl: 120, 0, 0, 0, 0,
> 0
> pfospfen: 112, 0, 696, 30, 18096,
> 0
> pfosfp: 40, 0, 407, 97, 10582,
> 0
> ------------------------------------------------------------------------
>
> You can see evidence of processes not exiting/doing what they should do
> here:
>
> ------------------------------------------------------------------------
> fstat
>
> USER CMD PID FD MOUNT INUM MODE SZ|DV R/W
> root shutdown 91155 root / 2 drwxr-xr-x 512 r
> root shutdown 91155 wd / 2 drwxr-xr-x 512 r
> root shutdown 91155 text / 47195 -r-sr-x--- 15912 r
> root shutdown 91155 0 /dev 38 crw------- ttyu0 rw
> root shutdown 91155 1 /dev 38 crw------- ttyu0 rw
> root shutdown 91155 2 /dev 38 crw------- ttyu0 rw
> root sh 91129 root / 2 drwxr-xr-x 512 r
> root sh 91129 wd / 2 drwxr-xr-x 512 r
> root sh 91129 text / 44 -r-xr-xr-x 134848 r
> root sh 91129 0 /dev 38 crw------- ttyu0 rw
> root sh 91129 1* pipe ffffff01e78fc9e0 <-> ffffff01e78fc888
> 0 rw
> root sh 91129 2 /dev 20 crw-rw-rw- null w
> root shutdown 91115 root / 2 drwxr-xr-x 512 r
> root shutdown 91115 wd /storage 5 drwx------ 37 r
> root shutdown 91115 text / 47195 -r-sr-x--- 15912 r
> root shutdown 91115 0 /dev 38 crw------- ttyu0 rw
> root shutdown 91115 1 /dev 38 crw------- ttyu0 rw
> root shutdown 91115 2 /dev 38 crw------- ttyu0 rw
> root shutdown 91115 3* local dgram ffffff008ff92960
> root sh 90818 root / 2 drwxr-xr-x 512 r
> root sh 90818 wd / 70659 drwxr-xr-x 512 r
> root sh 90818 text / 44 -r-xr-xr-x 134848 r
> root sh 90818 0 /dev 38 crw------- ttyu0 rw
> root sh 90818 1* pipe ffffff0043f1ecb8 <-> ffffff0043f1eb60
> 0 rw
> root sh 90818 2 /dev 20 crw-rw-rw- null w
> root csh 90802 root / 2 drwxr-xr-x 512 r
> root csh 90802 wd / 2 drwxr-xr-x 512 r
> root csh 90802 text / 51 -r-xr-xr-x 358752 r
> root csh 90802 15 /dev 38 crw------- ttyu0 rw
> root csh 90802 16 /dev 38 crw------- ttyu0 rw
> root csh 90802 17 /dev 38 crw------- ttyu0 rw
> root csh 90802 18 /dev 38 crw------- ttyu0 rw
> root csh 90802 19 /dev 38 crw------- ttyu0 rw
> ------------------------------------------------------------------------
>
> No indication of mbuf exhaustion, putting further focus on the pf stack:
>
> ------------------------------------------------------------------------
> netstat -m
>
> 2054/1786/3840 mbufs in use (current/cache/total)
> 2048/1414/3462/25600 mbuf clusters in use (current/cache/total/max)
> 2048/896 mbuf+clusters out of packet secondary zone in use (current/cache)
> 0/320/320/12800 4k (page size) jumbo clusters in use
> (current/cache/total/max)
> 0/0/0/19200 9k jumbo clusters in use (current/cache/total/max)
> 0/0/0/12800 16k jumbo clusters in use (current/cache/total/max)
> 4609K/4554K/9164K bytes allocated to network (current/cache/total)
> 0/0/0 requests for mbufs denied (mbufs/clusters/mbuf+clusters)
> 0/0/0 requests for jumbo clusters denied (4k/9k/16k)
> 0 requests for sfbufs denied
> 0 requests for sfbufs delayed
> 0 requests for I/O initiated by sendfile
> 0 calls to protocol drain routines
> ------------------------------------------------------------------------
>
> Here's one piece of core.0.txt which makes no sense to me -- the "rate"
> column. I have a very hard time believing that was the interrupt rate
> of all the relevant devices at the time (way too high). Maybe this data
> becomes wrong only during a coredump? The total column I could believe.
>
> ------------------------------------------------------------------------
> vmstat -i
>
> interrupt total rate
> irq4: uart0 54768 912
> irq6: fdc0 1 0
> irq17: uhci1+ 172 2
> irq23: uhci3 ehci1+ 2367 39
> cpu0: timer 13183882632 219731377
> irq256: em0 260491055 4341517
> irq257: em1 127555036 2125917
> irq258: ahci0 225923164 3765386
> cpu2: timer 13183881837 219731363
> cpu1: timer 13002196469 216703274
> cpu3: timer 13183881783 219731363
> Total 53167869284 886131154
> ------------------------------------------------------------------------
>
> Here's what a normal "vmstat -i" shows from the command-line:
>
> # vmstat -i
> interrupt total rate
> irq4: uart0 518 0
> irq6: fdc0 1 0
> irq23: uhci3 ehci1+ 145 0
> cpu0: timer 19041199 1999
> irq256: em0 614280 64
> irq257: em1 168529 17
> irq258: ahci0 355536 37
> cpu2: timer 19040462 1999
> cpu1: timer 19040458 1999
> cpu3: timer 19040454 1999
> Total 77301582 8119
>
> We graph many aspects of this box, including CPU load, memory/swap
> usage, etc. and none show any sign that the interrupt rate on all of
> those devices was even remotely out of control. (I would expect to see
> CPU through the roof given the above data)
>
> I have since rebuilt/reinstalled world/kernel on the machine with the
> latest RELENG_8 code (box is now 8.2-STABLE #0: Mon May 2 14:44:18 PDT
> 2011), hoping whatever this was may have been fixed.
>
> As for what I think may have triggered it, but I have no hard evidence
> of such: on April 29th, I changed our pf.conf and did "/etc/rc.d/pf
> reload". The pf.conf change was a single line:
>
> Old: scrub on em0 all
> New: scrub in on em0 all
>
> Why it took the problem approximately 3 days to start is unknown. It's
> the only change we've made to the system (truly/honestly), and it was a
> change to pf.conf.
>
> If anyone has advice (or has seen the above problem), or is interested
> in debugging it -- as I said, I have a vmcore -- I'm happy to assist in
> any way I can. I would hate for someone else to get bit by this, and
> really am hoping its something that has been fixed between February and
> now.
>
>
I'm seeing this as well. You could change your scrub rules so that you
specifically avoid TCP reassembly (that creates states).
--
Good, fast & cheap. Pick any two.
More information about the freebsd-pf
mailing list