RELENG_8 pf stack issue (state count spiraling out of control)

Tue May 3 06:10:02 UTC 2011

On Tue, May 3, 2011 at 8:01 AM, Jeremy Chadwick <freebsd at jdc.parodius.com>wrote:

> On Tue, May 03, 2011 at 07:22:10AM +0200, Vlad Galu wrote:
> > On Tue, May 3, 2011 at 3:58 AM, Jeremy Chadwick <
> freebsd at jdc.parodius.com>wrote:
> >
> > > (Please keep me CC'd as I'm not subscribed to freebsd-pf.  And
> apologies
> > > for cross-posting, but the issue is severe enough that I wanted to make
> > > it known on -stable)
> > >
> > > The below issue I'm describing is from a machine running 8.2-PRERELEASE
> > > (RELENG_8) using src dated Tue Feb 15 05:46:02 PST 2011.
> > >
> > > Please read the story in full, as I have taken the time to describe
> > > everything I did, plus log output, as well as induce a panic via "call
> > > doadump" from ddb so I have a capture of the system at the time.  I
> also
> > > have a theory as to what caused the problem, but how to trigger it is
> > > unknown; it may be a rare race condition.
> > >
> > >
> > > This morning I woke up to find a report from one of our users that he
> > > could not connect to a specific TCP port (not SSH) on one of our
> > > servers.  I also found that I couldn't SSH into the same box.  Serial
> > > console was working fine, and the serial console log showed no sign of
> > > any problems.
> > >
> > > I started to debug the issue of me not being able to SSH into the
> > > machine and within a few minutes became immediately concerned: pfctl
> > > indicated we had reached the maximum number permitted state table
> > > entries (10,000).
> > >
> > > ============================================================
> > > # pfctl -s info
> > > Status: Enabled for 76 days 06:49:10          Debug: Urgent
> > >
> > > Interface Stats for em0               IPv4             IPv6
> > >  Bytes In                      8969748840                0
> > >  Bytes Out                     8296135477                0
> > >  Packets In
> > >    Passed                       128211763                0
> > >    Blocked                         621379                0
> > >  Packets Out
> > >    Passed                       138483868                0
> > >    Blocked                           2579                0
> > >
> > > State Table                          Total             Rate
> > >  current entries                    10000
> > >  searches                       267316807           40.6/s
> > >  inserts                          4440553            0.7/s
> > >  removals                         4430553            0.7/s
> > > Counters
> > >  match                            5067474            0.8/s
> > >  bad-offset                             0            0.0/s
> > >  fragment                             324            0.0/s
> > >  short                                  0            0.0/s
> > >  normalize                             32            0.0/s
> > >  memory                            336946            0.1/s
> > >  bad-timestamp                          0            0.0/s
> > >  congestion                             0            0.0/s
> > >  ip-option                              0            0.0/s
> > >  proto-cksum                         1611            0.0/s
> > >  state-mismatch                       509            0.0/s
> > >  state-insert                           0            0.0/s
> > >  state-limit                            0            0.0/s
> > >  src-limit                              0            0.0/s
> > >  synproxy                               0            0.0/s
> > >
> > > # pfctl -s memory
> > > states        hard limit    10000
> > > src-nodes     hard limit    10000
> > > frags         hard limit     5000
> > > tables        hard limit     1000
> > > table-entries hard limit   100000
> > > ============================================================
> > >
> > > The above is mainly for em0 (our WAN interface); our LAN interface
> (em1)
> > > was not impacted because we use "set skip on em1".  And it's a good
> > > thing too: we have lots of LAN-based services that this machine
> provides
> > > that would have been impacted.  We also use "set skip on lo0".
> > >
> > > I immediately went to look at our monitoring graphs, which monitor pf
> > > state (specifically state table entries), polled via bsnmpd(8).  This
> > > data is even more frightening:
> > >
> > > http://jdc.parodius.com/freebsd/pf-issue/pf_states-day.png
> > > http://jdc.parodius.com/freebsd/pf-issue/pf_states-week.png
> > >
> > > Literally something was spiraling out of control, starting at approx.
> > > 2011/05/01 (Sun) at 12:30 PDT.  The situation became dire at approx.
> > > 19:45 PDT the same day, but I wasn't aware of it until said user
> brought
> > > an issue to my attention.
> > >
> > > You can see from the network I/O graphs (taken from SNMP polling our
> > > switch, NOT from the host/box itself) that there was no DoS attack or
> > > anything like that occurring -- this was something within FreeBSD
> > > itself.  More evidence of that will become apparent.
> > >
> > > http://jdc.parodius.com/freebsd/pf-issue/port_03-day.png
> > > http://jdc.parodius.com/freebsd/pf-issue/port_03-week.png
> > >
> > > The first thing I did was "/etc/rc.d/pf reload".  This command hung.
> > > Any attempt to send Ctrl-C/SIGINT did nothing.  I was able to
> > > Ctrl-Z/SIGSTOP it, then use kill %1, but the actual reload process did
> > > not truly die (despite csh stating "Terminated").  The only way to kill
> > > it was to kill -9.
> > >
> > > Attempts to shut down any daemons which utilised the network --
> > > including things that only used em1 -- would not shut down.  This
> > > included things like postfix, mysqld, and some inet-based services.  I
> > > was forced to kill -9 them.  Things like bsnmpd, however, did shut
> down.
> > >
> > > Equally as uncomfortable, "shutdown -r now" did not reboot the system.
> > > That is to say, wall(1)'s announcement was shown, but the actual
> > > stopping of services did not begin.
> > >
> > > The next thing I tried was "/etc/rc.d/pf stop", which worked.  Then I
> > > did "/etc/rc.d/pf start", which also worked.  However, what I saw next
> > > surely indicated a bug in the pf layer somewhere -- "pfctl -s states"
> > > and "pfctl -s info" disagreed on the state count:
> > >
> > > ============================================================
> > > # pfctl -s info
> > > Status: Enabled for 0 days 00:00:16           Debug: Urgent
> > >
> > > Interface Stats for em0               IPv4             IPv6
> > >  Bytes In                            3459                0
> > >  Bytes Out                              0                0
> > >  Packets In
> > >    Passed                               0                0
> > >    Blocked                             29                0
> > >  Packets Out
> > >    Passed                               0                0
> > >    Blocked                              0                0
> > >
> > > State Table                          Total             Rate
> > >  current entries                    10000
> > >  searches                              29            1.8/s
> > >  inserts                                0            0.0/s
> > >  removals                               0            0.0/s
> > > Counters
> > >  match                                 29            1.8/s
> > >  bad-offset                             0            0.0/s
> > >  fragment                               0            0.0/s
> > >  short                                  0            0.0/s
> > >  normalize                              0            0.0/s
> > >  memory                                18            1.1/s
> > >  bad-timestamp                          0            0.0/s
> > >  congestion                             0            0.0/s
> > >  ip-option                              0            0.0/s
> > >  proto-cksum                            0            0.0/s
> > >  state-mismatch                         0            0.0/s
> > >  state-insert                           0            0.0/s
> > >  state-limit                            0            0.0/s
> > >  src-limit                              0            0.0/s
> > >  synproxy                               0            0.0/s
> > >
> > > # pfctl -s state | wc -l
> > >       0
> > > ============================================================
> > >
> > > The "pf uptime" shown above, by the way, matches system uptime.
> > >
> > > I then attempted "pfctl -F state", but nothing changed (looked the same
> > > as above).
> > >
> > > Since I could not reboot the box, I was forced to drop to ddb via
> serial
> > > console.  I did some commands like "ps" and the like, and then "call
> > > doadump" to induce a kernel panic, and then "reboot" (which worked).
> > >
> > > Once the machine came back up, savecore(8) ran, wrote the data out, and
> > > everything has been fine since.  /var/crash/core.txt.0 is ~68KBytes and
> > > I do not feel comfortable sharing its content publicly, but will be
> > > happy to hand it to developer(s) who are interested.  Relevant tidbits
> I
> > > can discern:
> > >
> > >
> ------------------------------------------------------------------------
> > > ps -axl
> > >
> > >  UID   PID  PPID CPU PRI NI   VSZ   RSS MWCHAN STAT  TT       TIME
> COMMAND
> > >    0   422     0   0 -16  0     0     0 pftm   DL    ??
>  1362773081:04.00
> > > [pfpurge]
> > >
> ------------------------------------------------------------------------
> > >
> > >
> ------------------------------------------------------------------------
> > > vmstat -z
> > >
> > > ITEM                     SIZE     LIMIT      USED      FREE  REQUESTS
> > >  FAILURES
> > > pfsrctrpl:                152,    10000,        0,        0,        0,
> > >    0
> > > pfrulepl:                 912,        0,       40,       88,      806,
> > >    0
> > > pfstatepl:                392,    10000,    10000,        0,  4440553,
> > > 341638
> > > pfaltqpl:                 240,        0,        0,        0,        0,
> > >    0
> > > pfpooladdrpl:              88,        0,        0,        0,        0,
> > >    0
> > > pfrktable:               1296,     1002,        4,       20,      112,
> > >    0
> > > pfrkentry:                216,   100008,      603,      891,    15384,
> > >    0
> > > pfrkentry2:               216,        0,        0,        0,        0,
> > >    0
> > > pffrent:                   32,     5050,        0,      303,     1620,
> > >    0
> > > pffrag:                    80,        0,        0,      135,      807,
> > >    0
> > > pffrcache:                 80,    10035,        0,        0,        0,
> > >    0
> > > pffrcent:                  24,    50022,        0,        0,        0,
> > >    0
> > > pfstatescrub:              40,        0,        0,        0,        0,
> > >    0
> > > pfiaddrpl:                120,        0,        0,        0,        0,
> > >    0
> > > pfospfen:                 112,        0,      696,       30,    18096,
> > >    0
> > > pfosfp:                    40,        0,      407,       97,    10582,
> > >    0
> > >
> ------------------------------------------------------------------------
> > >
> > > You can see evidence of processes not exiting/doing what they should do
> > > here:
> > >
> > >
> ------------------------------------------------------------------------
> > > fstat
> > >
> > > USER     CMD          PID   FD MOUNT      INUM MODE         SZ|DV R/W
> > > root     shutdown   91155 root /             2 drwxr-xr-x     512  r
> > > root     shutdown   91155   wd /             2 drwxr-xr-x     512  r
> > > root     shutdown   91155 text /         47195 -r-sr-x---   15912  r
> > > root     shutdown   91155    0 /dev         38 crw-------   ttyu0 rw
> > > root     shutdown   91155    1 /dev         38 crw-------   ttyu0 rw
> > > root     shutdown   91155    2 /dev         38 crw-------   ttyu0 rw
> > > root     sh         91129 root /             2 drwxr-xr-x     512  r
> > > root     sh         91129   wd /             2 drwxr-xr-x     512  r
> > > root     sh         91129 text /            44 -r-xr-xr-x  134848  r
> > > root     sh         91129    0 /dev         38 crw-------   ttyu0 rw
> > > root     sh         91129    1* pipe ffffff01e78fc9e0 <->
> ffffff01e78fc888
> > >      0 rw
> > > root     sh         91129    2 /dev         20 crw-rw-rw-    null  w
> > > root     shutdown   91115 root /             2 drwxr-xr-x     512  r
> > > root     shutdown   91115   wd /storage      5 drwx------      37  r
> > > root     shutdown   91115 text /         47195 -r-sr-x---   15912  r
> > > root     shutdown   91115    0 /dev         38 crw-------   ttyu0 rw
> > > root     shutdown   91115    1 /dev         38 crw-------   ttyu0 rw
> > > root     shutdown   91115    2 /dev         38 crw-------   ttyu0 rw
> > > root     shutdown   91115    3* local dgram ffffff008ff92960
> > > root     sh         90818 root /             2 drwxr-xr-x     512  r
> > > root     sh         90818   wd /         70659 drwxr-xr-x     512  r
> > > root     sh         90818 text /            44 -r-xr-xr-x  134848  r
> > > root     sh         90818    0 /dev         38 crw-------   ttyu0 rw
> > > root     sh         90818    1* pipe ffffff0043f1ecb8 <->
> ffffff0043f1eb60
> > >      0 rw
> > > root     sh         90818    2 /dev         20 crw-rw-rw-    null  w
> > > root     csh        90802 root /             2 drwxr-xr-x     512  r
> > > root     csh        90802   wd /             2 drwxr-xr-x     512  r
> > > root     csh        90802 text /            51 -r-xr-xr-x  358752  r
> > > root     csh        90802   15 /dev         38 crw-------   ttyu0 rw
> > > root     csh        90802   16 /dev         38 crw-------   ttyu0 rw
> > > root     csh        90802   17 /dev         38 crw-------   ttyu0 rw
> > > root     csh        90802   18 /dev         38 crw-------   ttyu0 rw
> > > root     csh        90802   19 /dev         38 crw-------   ttyu0 rw
> > >
> ------------------------------------------------------------------------
> > >
> > > No indication of mbuf exhaustion, putting further focus on the pf
> stack:
> > >
> > >
> ------------------------------------------------------------------------
> > > netstat -m
> > >
> > > 2054/1786/3840 mbufs in use (current/cache/total)
> > > 2048/1414/3462/25600 mbuf clusters in use (current/cache/total/max)
> > > 2048/896 mbuf+clusters out of packet secondary zone in use
> (current/cache)
> > > 0/320/320/12800 4k (page size) jumbo clusters in use
> > > (current/cache/total/max)
> > > 0/0/0/19200 9k jumbo clusters in use (current/cache/total/max)
> > > 0/0/0/12800 16k jumbo clusters in use (current/cache/total/max)
> > > 4609K/4554K/9164K bytes allocated to network (current/cache/total)
> > > 0/0/0 requests for mbufs denied (mbufs/clusters/mbuf+clusters)
> > > 0/0/0 requests for jumbo clusters denied (4k/9k/16k)
> > > 0 requests for sfbufs denied
> > > 0 requests for sfbufs delayed
> > > 0 requests for I/O initiated by sendfile
> > > 0 calls to protocol drain routines
> > >
> ------------------------------------------------------------------------
> > >
> > > Here's one piece of core.0.txt which makes no sense to me -- the "rate"
> > > column.  I have a very hard time believing that was the interrupt rate
> > > of all the relevant devices at the time (way too high).  Maybe this
> data
> > > becomes wrong only during a coredump?  The total column I could
> believe.
> > >
> > >
> ------------------------------------------------------------------------
> > > vmstat -i
> > >
> > > interrupt                          total       rate
> > > irq4: uart0                        54768        912
> > > irq6: fdc0                             1          0
> > > irq17: uhci1+                        172          2
> > > irq23: uhci3 ehci1+                 2367         39
> > > cpu0: timer                  13183882632  219731377
> > > irq256: em0                    260491055    4341517
> > > irq257: em1                    127555036    2125917
> > > irq258: ahci0                  225923164    3765386
> > > cpu2: timer                  13183881837  219731363
> > > cpu1: timer                  13002196469  216703274
> > > cpu3: timer                  13183881783  219731363
> > > Total                        53167869284  886131154
> > >
> ------------------------------------------------------------------------
> > >
> > > Here's what a normal "vmstat -i" shows from the command-line:
> > >
> > > # vmstat -i
> > > interrupt                          total       rate
> > > irq4: uart0                          518          0
> > > irq6: fdc0                             1          0
> > > irq23: uhci3 ehci1+                  145          0
> > > cpu0: timer                     19041199       1999
> > > irq256: em0                       614280         64
> > > irq257: em1                       168529         17
> > > irq258: ahci0                     355536         37
> > > cpu2: timer                     19040462       1999
> > > cpu1: timer                     19040458       1999
> > > cpu3: timer                     19040454       1999
> > > Total                           77301582       8119
> > >
> > > We graph many aspects of this box, including CPU load, memory/swap
> > > usage, etc. and none show any sign that the interrupt rate on all of
> > > those devices was even remotely out of control.  (I would expect to see
> > > CPU through the roof given the above data)
> > >
> > > I have since rebuilt/reinstalled world/kernel on the machine with the
> > > latest RELENG_8 code (box is now 8.2-STABLE #0: Mon May  2 14:44:18 PDT
> > > 2011), hoping whatever this was may have been fixed.
> > >
> > > As for what I think may have triggered it, but I have no hard evidence
> > > of such: on April 29th, I changed our pf.conf and did "/etc/rc.d/pf
> > > reload".  The pf.conf change was a single line:
> > >
> > > Old: scrub on em0 all
> > > New: scrub in on em0 all
> > >
> > > Why it took the problem approximately 3 days to start is unknown.  It's
> > > the only change we've made to the system (truly/honestly), and it was a
> > > change to pf.conf.
> > >
> > > If anyone has advice (or has seen the above problem), or is interested
> > > in debugging it -- as I said, I have a vmcore -- I'm happy to assist in
> > > any way I can.  I would hate for someone else to get bit by this, and
> > > really am hoping its something that has been fixed between February and
> > > now.
> > >
> > >
> > I'm seeing this as well.  You could change your scrub rules so that you
> > specifically avoid TCP reassembly (that creates states).
>
> Thank you very much.  This helps -- and I'm also glad someone else has
> seen this behaviour.  That confirms it's not specific to my equipment,
> which is good.
>
> Regarding scrubbing and TCP reassembly (option "reassemble tcp"
> according to pf.conf): I wasn't under the impression this option was
> enabled by default.  This got me wondering what the defaults actually
> are (pf.conf(5) is somewhat vague in this regard, but it does state
> that "fragment reassemble" is enabled by default).  The rule I use:
>
>  scrub in on em0 all
>
> Appears to get evaluated into this (per "pfctl -s rules -v"):
>
>  scrub in on em0 all fragment reassemble
>

Ah, ok, my bad then. I used to have "reassemble tcp" enabled and disabled it
this morning. However, if your symptoms are there without it, that's bad
news.
I disabled tcp reassembly to make sure there are no other states than the
ones I allow via explicit keep/modulate/synproxy state rules. Disabling
scrubbing altogether seems like a good next step.

> Did you mean to tell me to disable the "fragment reassemble" option
> (which is different from "reassemble tcp")?  If so, how do I do that?
> It looks like I could use a "no scrub" rule, except I can't find any
> examples of this on the web or in the docs.  Or is it just better to
> remove use of scrub entirely until whatever this is gets fixed?
>
> --
> | Jeremy Chadwick                                   jdc at parodius.com |
> | Parodius Networking                       http://www.parodius.com/ |
> | UNIX Systems Administrator                  Mountain View, CA, USA |
> | Making life hard for others since 1977.               PGP 4BD6C0CB |
>
>

-- 
Good, fast & cheap. Pick any two.