IPsec crash in TCP, also NFS DRC patches (was: Re: Limits on jumbo mbuf cluster allocation)

Sat Mar 30 02:16:42 UTC 2013

<<On Tue, 12 Mar 2013 23:48:00 -0400 (EDT), Rick Macklem <rmacklem at uoguelph.ca> said:

> The patch includes a lot of drc2.patch and drc3.patch, so don't try
> and apply it to a patched kernel. Hopefully it will apply cleanly to
> vanilla sources.

> Tha patch has been minimally tested.

Well, it's taken a long time, but I was finally able to get some
testing.  The user whose OpenStack cluster jobs had eaten previous
file servers killed this one, too, but not in a way that's
attributable to the NFS code.  He was able to put on a fairly heavy
load from about 630 virtual machines in our cluster without the server
even getting particularly busy.  Another cluster job, however,
repeatedly panicked the server.  Thankfully, there's a backtrace:

Fatal trap 12: page fault while in kernel mode
cpuid = 3; apic id = 03
fault virtual address   = 0x0
fault code              = supervisor read data, page not present

instruction pointer     = 0x20:0xffffffff8074ee11
stack pointer           = 0x28:0xffffff9a469ee6d0
frame pointer           = 0x28:0xffffff9a469ee710
code segment            = base 0x0, limit 0xfffff, type 0x1b
                        = DPL 0, pres 1, long 1, def32 0, gran 1
processor eflags        = interrupt enabled, resume, IOPL = 0
current process         = 12 (irq260: ix0:que 3)
trap number             = 12
panic: page fault
cpuid = 3
KDB: stack backtrace:
db_trace_self_wrapper() at db_trace_self_wrapper+0x2a
kdb_backtrace() at kdb_backtrace+0x37
panic() at panic+0x1ce
trap_fatal() at trap_fatal+0x290
trap_pfault() at trap_pfault+0x21c
trap() at trap+0x365
calltrap() at calltrap+0x8
--- trap 0xc, rip = 0xffffffff8074ee11, rsp = 0xffffff9a469ee6d0, rbp = 0xffffff9a469ee710 ---
ipsec_getpolicybysock() at ipsec_getpolicybysock+0x31
ipsec46_in_reject() at ipsec46_in_reject+0x24
ipsec4_in_reject() at ipsec4_in_reject+0x9

tcp_input() at tcp_input+0x498
ip_input() at ip_input+0x1de
netisr_dispatch_src() at netisr_dispatch_src+0x20b
ether_demux() at ether_demux+0x14d
ether_nh_input() at ether_nh_input+0x1f4
netisr_dispatch_src() at netisr_dispatch_src+0x20b
ixgbe_rxeof() at ixgbe_rxeof+0x1cb
ixgbe_msix_que() at ixgbe_msix_que+0xa8
intr_event_execute_handlers() at intr_event_execute_handlers+0x104
ithread_loop() at ithread_loop+0xa6
fork_exit() at fork_exit+0x11f
fork_trampoline() at fork_trampoline+0xe
--- trap 0, rip = 0, rsp = 0xffffff9a469eecf0, rbp = 0 ---

ipsec_setspidx_inpcb() is inlined here; the fault is on the line:

        error = ipsec_setspidx(m, &inp->inp_sp->sp_in->spidx, 1);

where inp->inp_sp is being dereferenced:

0xffffffff8074ee02 <ipsec_getpolicybysock+34>:  mov    0xf0(%rdx),%rax
0xffffffff8074ee09 <ipsec_getpolicybysock+41>:  mov    $0x1,%edx
0xffffffff8074ee0e <ipsec_getpolicybysock+46>:  mov    %rcx,%r15
0xffffffff8074ee11 <ipsec_getpolicybysock+49>:  mov    (%rax),%rsi <-- FAULT!
0xffffffff8074ee14 <ipsec_getpolicybysock+52>:  add    $0x34,%rsi
0xffffffff8074ee18 <ipsec_getpolicybysock+56>:  callq  0xffffffff8074e6f0 <ipsec_setspidx>

(inp is in %rdx here).  The crash occurs when the clients are making
about 200 connections per second.  (We're not sure if this is by
design or if it's a broken NAT implementation on the OpenStack nodes.
My money is on a broken NAT, because we were also seeing lots of data
being sent on apparently-closed connections.  The kernel was also
logging many [ECONNABORTED] errors when nfsd tried to accept() new
client connections.  A capture is available if anyone wants to look at
this in more detail, although obviously not from the traffic that
actually caused the crash.)

inp_sp is declared thus:
        struct  inpcbpolicy *inp_sp;    /* (s) for IPSEC */

The locking code "(s)" is supposed to indicate "protected by another
subsystem's locks", but there is no locking at all in
ipsec_getpolicybysock(), so that seems to be a misstatement at best.

I quickly installed a new kernel with IPSEC disabled (we have no need
for it), and it seems to have survived further abuse, although it
seems that the user's test jobs were winding down at that time as well
so I can't say for certain that it wouldn't have crashed somewhere
else.

-GAWollman