Re: 12.2 Splay Tree ipfw potential panic source

From: Karl Denninger <karl_at_denninger.net>
Date: Sat, 10 Jul 2021 02:41:00 UTC
On 7/9/2021 18:06, Karl Denninger wrote:
> On 7/9/2021 16:17, Ryan Stone wrote:
>> On Thu, Jul 8, 2021 at 8:54 PM Karl Denninger <karl@denninger.net> 
>> wrote:
>>> I will see if I can get at least a panic backtrace, although the
>>> impacted box is a pcEngines firewall that boots of an SD card.
>> Have you checked whether netdump supports your NICs?  You should be
>> able to get a full vmcore off if so.
>
> Yes; the box in question is in heavy production and I will not be able 
> to get an isolated period of time to pull a core (assuming the remote 
> dump works) until sometime this weekend.
>
> Will advise once I (hopefully) have it.
>
Ok, so I have good news and bad news.

I have the trap and it is definitely in libalias which appears to come 
about as a result of a NAT translation attempt.

Fatal trap 18: integer divide fault while in kernel mode
cpuid = 1; apic id = 01
instruction pointer     = 0x20:0xffffffff8275b7cc
stack pointer           = 0x28:0xfffffe0017b6b310
frame pointer           = 0x28:0xfffffe0017b6b320
code segment            = base 0x0, limit 0xfffff, type 0x1b
                         = DPL 0, pres 1, long 1, def32 0, gran 1
processor eflags        = interrupt enabled, resume, IOPL = 0
current process         = 0 (if_io_tqg_1)
trap number             = 18
panic: integer divide fault
cpuid = 1
time = 1625883072
KDB: stack backtrace:
db_trace_self_wrapper() at db_trace_self_wrapper+0x2b/frame 
0xfffffe0017b6b020
vpanic() at vpanic+0x17b/frame 0xfffffe0017b6b070
panic() at panic+0x43/frame 0xfffffe0017b6b0d0
trap_fatal() at trap_fatal+0x391/frame 0xfffffe0017b6b130
trap() at trap+0x67/frame 0xfffffe0017b6b240
calltrap() at calltrap+0x8/frame 0xfffffe0017b6b240
--- trap 0x12, rip = 0xffffffff8275b7cc, rsp = 0xfffffe0017b6b310, rbp = 
0xfffffe0017b6b320 ---
HouseKeeping() at HouseKeeping+0x1c/frame 0xfffffe0017b6b320
LibAliasInLocked() at LibAliasInLocked+0x2f/frame 0xfffffe0017b6b3e0
LibAliasIn() at LibAliasIn+0x46/frame 0xfffffe0017b6b410
ipfw_nat() at ipfw_nat+0x234/frame 0xfffffe0017b6b460
ipfw_chk() at ipfw_chk+0x1350/frame 0xfffffe0017b6b670
ipfw_check_packet() at ipfw_check_packet+0xf0/frame 0xfffffe0017b6b760
pfil_run_hooks() at pfil_run_hooks+0xb0/frame 0xfffffe0017b6b7f0
ip_input() at ip_input+0x427/frame 0xfffffe0017b6b8a0
netisr_dispatch_src() at netisr_dispatch_src+0xca/frame 0xfffffe0017b6b8f0
ether_demux() at ether_demux+0x138/frame 0xfffffe0017b6b920
ether_nh_input() at ether_nh_input+0x33b/frame 0xfffffe0017b6b980
netisr_dispatch_src() at netisr_dispatch_src+0xca/frame 0xfffffe0017b6b9d0
ether_input() at ether_input+0x4b/frame 0xfffffe0017b6ba00
iflib_rxeof() at iflib_rxeof+0xad6/frame 0xfffffe0017b6bae0
_task_fn_rx() at _task_fn_rx+0x72/frame 0xfffffe0017b6bb20
gtaskqueue_run_locked() at gtaskqueue_run_locked+0x121/frame 
0xfffffe0017b6bb80
gtaskqueue_thread_loop() at gtaskqueue_thread_loop+0xb6/frame 
0xfffffe0017b6bbb0
fork_exit() at fork_exit+0x7e/frame 0xfffffe0017b6bbf0
fork_trampoline() at fork_trampoline+0xe/frame 0xfffffe0017b6bbf0
--- trap 0, rip = 0, rsp = 0, rbp = 0 ---
Uptime: 7m23s
netdump: overwriting mbuf zone pointers
netdump in progress. searching for server...
netdumping to 192.168.10.100 (ac:1f:6b:ad:d8:cb)
Dumping 190 out of 1882 MB:. . . . . . . . . . . . .
** DUMP FAILED (ERROR 60) **

Now the bad news -- as you can see, an attempted remote dump fails, 
possibly because the network code at that point is hosed. I get a 69632 
length file (exactly and repeatedly) on the remote machine where the 
dump is set to go; it looks like the first piece of it is indeed 
received but that's it and then the panic'd unit reboots.

On the server (remote) end I have this in the "info" file:

Dump from IpGw [192.168.10.200]
Dump incomplete: client timed out

So it looks like it got the first part of it, the server replied but the 
crashed box never sent anything else.

-rw-------   1 root  wheel      2 Jul  9 22:11 bounds.IpGw
-rw-------   1 root  wheel     66 Jul  9 22:10 info.IpGw.0
-rw-------   1 root  wheel      0 Jul  9 22:11 info.IpGw.1
-rw-------   1 root  wheel  69632 Jul  9 22:00 vmcore.IpGw.0
-rw-------   1 root  wheel  69632 Jul  9 22:11 vmcore.IpGw.1

Without a complete core I can't give you a good traceback.  I may be 
able to get a local device on this unit sometime over the weekend 
sometime -- not sure as of yet as it is in production use.

This is an extremely reliable panic -- uptime is only a few minutes 
before it blows up.

-- 
Karl Denninger
karl@denninger.net <mailto:karl@denninger.net>
/The Market Ticker/
/[S/MIME encrypted email preferred]/