RELENG_8 pf stack issue (state count spiraling out of control)

Tue May 3 07:32:12 UTC 2011

On Tue, May 03, 2011 at 09:00:42AM +0200, Daniel Hartmeier wrote:
> I read those graphs differently: the problem doesn't arise slowly,
> but rather seems to start suddenly at 13:00.
> 
> Right after 13:00, traffic on em0 drops, i.e. the firewall seems
> to stop forwarding packets completely.
> 
> Yet, at the same time, the states start to increase, almost linearly
> at about one state every two seconds, until the limit of 10,000 is
> reached. Reaching the limit seems to be only a side-effect of a
> problem that started at 13:00.
> 
> > Here's one piece of core.0.txt which makes no sense to me -- the "rate"
> > column.  I have a very hard time believing that was the interrupt rate
> > of all the relevant devices at the time (way too high).  Maybe this data
> > becomes wrong only during a coredump?  The total column I could believe.
> > 
> > ------------------------------------------------------------------------
> > vmstat -i
> > 
> > interrupt                          total       rate
> > irq4: uart0                        54768        912
> > irq6: fdc0                             1          0
> > irq17: uhci1+                        172          2
> > irq23: uhci3 ehci1+                 2367         39
> > cpu0: timer                  13183882632  219731377
> > irq256: em0                    260491055    4341517
> > irq257: em1                    127555036    2125917
> > irq258: ahci0                  225923164    3765386
> > cpu2: timer                  13183881837  219731363
> > cpu1: timer                  13002196469  216703274
> > cpu3: timer                  13183881783  219731363
> > Total                        53167869284  886131154
> > ------------------------------------------------------------------------
> 
> I find this suspect as well, but I don't have an explanation yet.
> 
> Are you using anything non-GENERIC related to timers, like change
> HZ or enable polling?

HZ is standard (1000 is the default I believe), and I do not use
polling.

> Are you sure the problem didn't start right at 13:00, and cause complete
> packet loss for the entire period, and that it grew gradually worse
> instead?

It's hard to discern from the graphs, but I can tell you exactly what I
saw TCP-wise since I did have some already-existing/established TCP
connections to the box (e.g. connections which already had ESTABLISHED
states according to pfctl -s state) when it began exhibiting issues.

Any packets which already had existing state entries in pf's state table
continued to work, and bidirectionally.  New inbound connections to the
box via em0 would result in no response/timeout (and as indicated per
pfctl, such packets were being dropped due to the state limit being
reached).  Outbound connections from the box via em0 to the outside
world also resulted in no response/timeout.  I will show you evidence of
the latter.

The first indication of a problem in syslog is the following message
from named -- this is the first in my entire life I've ever seen this
message, but seems to indicate some kind of internal watchdog was fired
within named itself.  The log I'm looking at, by the way, is
/var/log/all.log -- yes, I do turn that on (for reasons exactly like
this).  This box is a secondary nameserver (public), so keep that in
mind too.  Anyway:

May  1 12:50:14 isis named[728]: *** POKED TIMER ***

Seconds later, I see unexpected RCODE messages, lame server messages,
etc.. -- all which indicate packets to some degree are working ("the
usual" badly-configured nameservers on the Internet).

A few minutes later:

May  1 12:53:15 isis named[728]: *** POKED TIMER ***
May  1 12:53:54 isis named[728]: *** POKED TIMER ***

With more of the usual unexpected RCODE/SERVFAIL messages after that.
The next message:

May  1 13:28:55 isis named[728]: *** POKED TIMER ***
May  1 13:29:13 isis named[728]: *** POKED TIMER ***
May  1 13:30:11 isis last message repeated 3 times

Then more RCODE/SERVFAIL and something called "FORMERR" but that could
be normal as well.  Remember, all from named.

This "cycle" of behaviour continued, with the number of POKED TIMER
messages gradually increasing more and more as time went on.  By 16:07
on May 1st, these messages were arriving usually in "bursts" of 5 or 6.

Things finally "exploded", from named's perspective, here (with slaved
zones X'd out):

May  1 19:23:21 isis named[728]: *** POKED TIMER ***
May  1 19:28:59 isis named[728]: zone XXXXXXXX/IN: refresh: failure trying master x.x.x.x#53 (source x.x.x.x#0): operation canceled
May  1 19:35:32 isis named[728]: host unreachable resolving 'dns2.djaweb.dz/A/IN': 213.179.160.66#53
May  1 19:35:32 isis named[728]: host unreachable resolving 'dns2.djaweb.dz/A/IN': 193.0.12.4#53
May  1 19:35:32 isis named[728]: host unreachable resolving 'dns2.djaweb.dz/A/IN': 193.194.64.242#53
May  1 19:35:32 isis named[728]: host unreachable resolving 'dns2.djaweb.dz/A/IN': 192.134.0.49#53

And many other slaved zones reporting the exact same error.  The
hostnames which couldn't be resolved have no relevancy (honestly there
is no consistency in any of them, and given what our service does, the
randomness is expected.  (Read: I just picked linear log lines above;
don't focus on dns2.djaweb.dz like it's somehow responsible for this))

To me, this indicates a serious IP stack, firewall, or network
exhaustion situation.  The "host unreachable" messages became more and
more regular, along with the occasional POKED TIMER.  So what I'm saying
is that the "most chatty" of all the daemons on the machine during this
time was named, which does not surprise me given the machine's role.

I finally logged in on May 2nd at 13:41 and tried to put an end to the
madness.

Other non-syslog-based daemons on the machine doing logging show no
signs of issues, I/O errors (network or otherwise), etc..  Again, this
doesn't surprise me, because nobody with existing connections to the
machine was complaining (and hadn't).  Only when I was told by a user
that *new* connections were failing did I begin to investigate.

So, it seems likely that network I/O mostly stopped because the IP stack
stopped handling new I/O due to whatever was going on with pf.  While at
the same time, pf's state counter started gradually incrementing for
reasons unknown -- an indicator that something bad was happening, almost
certainly within pf itself, or somewhere within the kernel.  I'm inclined
to believe pf, because existing/ESTABLISHED stateful entries continued
to get processed okay.

Therefore, to me, the graphs are indicators that the problem started at
around 12:50 PDT (meaning impact began at 12:50 PDT), and something very
bad continued to happen within the pf stack, until I managed to
intervene.

-- 
| Jeremy Chadwick                                   jdc at parodius.com |
| Parodius Networking                       http://www.parodius.com/ |
| UNIX Systems Administrator                  Mountain View, CA, USA |
| Making life hard for others since 1977.               PGP 4BD6C0CB |