[Bug 251840] panic in iflib_netdump_poll -> _iflib_fl_refill

Mon Dec 14 17:53:17 UTC 2020

https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=251840

            Bug ID: 251840
           Summary: panic in iflib_netdump_poll -> _iflib_fl_refill
           Product: Base System
           Version: CURRENT
          Hardware: Any
                OS: Any
            Status: New
          Severity: Affects Some People
          Priority: ---
         Component: kern
          Assignee: bugs at FreeBSD.org
          Reporter: mammoottym at yahoo.com

When there is heavy network traffic on a system most of the cpus can be
processing rx packets in the interrupt task. Now if the node crashes or we
break into debugger, most of the cpus will be stopped while running
_task_fn_rx. Running netdump in this state will make it to go though the same
queues that was partially processed in the _task_fn_rx. This might cause
multiple issues as explained below.

I was able to reproduce some of the the issues easily by panicing the node
while running multiple iperf threads.

===========================================================

1960 _iflib_fl_refill(if_ctx_t ctx, iflib_fl_t fl, int count)
1961 {
....
1972 sd_m = fl->ifl_sds.ifsd_m;
1973 sd_map = fl->ifl_sds.ifsd_map;
....
1976 pidx = fl->ifl_pidx;
1977 idx = pidx;
1978 frag_idx = fl->ifl_fragidx;
1979 credits = fl->ifl_credits;
....
1982 n = count;
....
1997 while (n--) {
....
2007 bit_ffc(fl->ifl_rx_bitmap, fl->ifl_size, &frag_idx);
2008 MPASS(frag_idx >= 0);
2009 if ((cl = sd_cl[frag_idx]) == NULL) {
2010 cl = m_cljget(NULL, M_NOWAIT, fl->ifl_buf_size);
....
2015 MPASS(sd_map != NULL);
....
2025 sd_cl[frag_idx] = cl;
....
2031 }
....
2035 MPASS(sd_m[frag_idx] == NULL);
2036 m = m_gethdr(M_NOWAIT, MT_NOINIT);
....
2039 sd_m[frag_idx] = m;
2040 bit_set(fl->ifl_rx_bitmap, frag_idx);
....
2049 credits++;
2051 idx++;
....
2060 if (n == 0 || i == IFLIB_MAX_RX_REFRESH) {
....
2064 fl->ifl_pidx = idx;
2065 fl->ifl_credits = credits;
2066 }
2067 }
===============================================================

The above function _iflib_fl_refill() is called to refill rxq free buffer list
with new packet buffers. The number of new buffers to fill is passed in as
input parameter, count. Callers make sure that the count does not exceed the
queue's capacity, by checking 'count < (fl->if_use - fl->ifl_credits)'. There
is a bitmap to indicale which list is free to allocate.

As show in the code snippet above, the allocation is done in a while loop,
lines 1997-2067.  After finding an available entry from the bit map, buffer is
allocated for it, bit in the  map is set to indicate it is more available to
fill. This is repeated for all the n counts.  Once allocation of all the
required number of buffers are done fl->ifl_credits is bumped by  the number of
buffers allocated at line 2065.

Suppose one of the CPU is running the above loop, it has done some allocations,
set the bits in the bitmap to indicate that those are no more available. But it
has not completed the loop to set the ifl_credits field. Now suppose the node
crashes, stops the cpu, runs netdump. Netdump will poll for the packets and in
the Rx context will check the rxq to find that there are free buffer list to
fill by looking at the ifl->credits field. While
looking for the available bits in the bitmap, it might find that there are no
more bits available, resulting in Assertion failure MPASS(frag_idx >= 0) at
line 2008. We may be able to workaround this particular issue by modifying the
code, but that might impact the performance during the normal reception. This
was the assertion failure found by the cert team, when they opened this bug.
-----------------------------------------------------------------------------

#0 wbinvd () at ./machine/cpufunc.h:417
#1 cpustop_handler () at /b/mnt/src/sys/x86/x86/mp_x86.c:1424
#2 0xffffffff80a42214 in ipi_nmi_handler () at
/b/mnt/src/sys/x86/x86/mp_x86.c:1363
#3 0xffffffff809cc693 in trap (frame=0xfffffe9d4edfaf30)
at /b/mnt/src/sys/amd64/amd64/trap.c:210
#4 nmi_calltrap () at /b/mnt/src/sys/amd64/amd64/exception.S:792
#5 item_ctor (zone=0xfffffea265fab600, uz_flags=<optimized out>,
size=<optimized out>,
udata=<optimized out>, flags=1, item=0xfffff80057b18800)
at /b/mnt/src/sys/vm/uma_core.c:3269
#6 0xffffffff8095a68d in cache_alloc_item (zone=<optimized out>,
cache=<optimized out>,
udata=<optimized out>, flags=<optimized out>, bucket=<optimized out>)
at /b/mnt/src/sys/vm/uma_core.c:3395
#7 uma_zalloc_arg (zone=0xfffffea265fab600, udata=0x0, flags=1)
at /b/mnt/src/sys/vm/uma_core.c:3491
#8 0xffffffff80610da0 in m_cljget (m=0x0, how=1, size=2048)
at /b/mnt/src/sys/kern/kern_mbuf.c:1020
#9 0xffffffff80763111 in _iflib_fl_refill (ctx=0xfffff802f1c56c00,
fl=0xfffff802f1c56400,
count=<optimized out>) at /b/mnt/src/sys/net/iflib.c:2010
#10 0xffffffff8076296b in __iflib_fl_refill_lt (ctx=<optimized out>, max=24,
fl=<optimized out>)
at /b/mnt/src/sys/net/iflib.c:2108
#11 iflib_rxeof (rxq=<optimized out>, budget=24) at
/b/mnt/src/sys/net/iflib.c:2802
#12 0xffffffff8075e5b9 in _task_fn_rx (context=0xfffffea2665bffa0)
at /b/mnt/src/sys/net/iflib.c:3778
...

#0 kdb_enter (why=0xffffffff80b7e201 "panic", msg=<optimized out>)
at /b/mnt/src/sys/kern/subr_kdb.c:483
#1 0xffffffff80634c27 in panic_finish () at
/b/mnt/src/sys/kern/kern_shutdown.c:1154
#2 0xffffffff806345be in panic (fmt=<optimized out>) at
/b/mnt/src/sys/kern/kern_shutdown.c:947
#3 0xffffffff807634d3 in _iflib_fl_refill (ctx=0xfffff802f1c56c00,
fl=0xfffff802f1c56400,
count=<optimized out>) at /b/mnt/src/sys/net/iflib.c:2008
#4 0xffffffff8076285b in __iflib_fl_refill_lt (ctx=<optimized out>, max=24,
fl=<optimized out>)
at /b/mnt/src/sys/net/iflib.c:2108
#5 iflib_rxeof (rxq=0xfffffea2665bffa0, budget=24) at
/b/mnt/src/sys/net/iflib.c:2741
#6 0xffffffff80761ff4 in iflib_netdump_poll (ifp=<optimized out>,
count=<optimized out>)
at /b/mnt/src/sys/net/iflib.c:6591

----------------------------------------------------------------------

A second scenario that can happen is: Suppose one of the CPU is running the
ablve loop and has done the allocation and assignment, sd_m[frag_idx] = m, at
line 2039. Now suppose the node crashes, stops the cpu, runs netdump. Netdump
in the Rx poll path will come to this function, see this bit is available in
the bit map and do the assertion MPASS(sd_m[frag_idx] == NULL) at line 2035
before doing the memory allocation. This assertion will fail since the
allocation and assignment were done before the cpu got NMI.

----------------------------------------------------------------------

[Switching to thread 730 (Thread 100048)]
#0 wbinvd () at ./machine/cpufunc.h:417
417 in ./machine/cpufunc.h
(gdb) bt
#0 wbinvd () at ./machine/cpufunc.h:417
#1 cpustop_handler () at /b/mnt/src/sys/x86/x86/mp_x86.c:1424
#2 0xffffffff80a42214 in ipi_nmi_handler () at
/b/mnt/src/sys/x86/x86/mp_x86.c:1363
#3 0xffffffff809cc693 in trap (frame=0xffffffff813e6920 <nmi0_stack+3888>)
 at /b/mnt/src/sys/amd64/amd64/trap.c:210
#4 nmi_calltrap () at /b/mnt/src/sys/amd64/amd64/exception.S:792
#5 0xffffffff80763370 in _bit_mask (_bit=<optimized out>) at
/b/mnt/src/sys/sys/bitstring.h:104
#6 bit_set (_bitstr=0xfffff802f1d23e00, _bit=<optimized out>) at
/b/mnt/src/sys/sys/bitstring.h:148
#7 _iflib_fl_refill (ctx=0xfffff802f1cf6800, fl=0xfffff802f1cf6400,
count=<optimized out>)
 at /b/mnt/src/sys/net/iflib.c:2040
#8 0xffffffff80762a2b in __iflib_fl_refill_lt (ctx=<optimized out>, max=24,
fl=<optimized out>)
 at /b/mnt/src/sys/net/iflib.c:2108
#9 iflib_rxeof (rxq=<optimized out>, budget=24) at
/b/mnt/src/sys/net/iflib.c:2802
#10 0xffffffff8075e679 in _task_fn_rx (context=0xfffffe9deef3c9c0) at
/b/mnt/src/sys/net/iflib.c:3778
....

[Switching to thread 1 (Thread 100659)]
#0 kdb_enter (why=0xffffffff80b7e20f "panic", msg=<optimized out>) at
/b/mnt/src/sys/kern/subr_kdb.c:483
483 kdb_why = KDB_WHY_UNSET;
(gdb) bt
#0 kdb_enter (why=0xffffffff80b7e20f "panic", msg=<optimized out>) at
/b/mnt/src/sys/kern/subr_kdb.c:483
#1 0xffffffff80634c67 in panic_finish () at
/b/mnt/src/sys/kern/kern_shutdown.c:1154
#2 0xffffffff806345fe in panic (fmt=<optimized out>) at
/b/mnt/src/sys/kern/kern_shutdown.c:947
#3 0xffffffff807635ba in _iflib_fl_refill (ctx=0xfffff802f1cf6800,
fl=0xfffff802f1cf6400,
{{ count=<optimized out>) at /b/mnt/src/sys/net/iflib.c:2035}}
#4 0xffffffff80762a2b in __iflib_fl_refill_lt (ctx=<optimized out>, max=24,
fl=<optimized out>)
 at /b/mnt/src/sys/net/iflib.c:2108
#5 iflib_rxeof (rxq=<optimized out>, budget=24) at
/b/mnt/src/sys/net/iflib.c:2802
#6 0xffffffff807620b4 in iflib_netdump_poll (ifp=<optimized out>,
count=<optimized out>)
 at /b/mnt/src/sys/net/iflib.c:6594

------------------------------------------------------------------------------

In order to avoid multithreading problems in the shutdown path we shutdown CPUs
and allow only one cpu to run the netdump. We also disable interrupt on the
single running CPU.That’s why netdump works with polling rx queues for received
packets.

Since the same network stack is used by netdump to communicate with the netdump
server, netdump expects the stack to be in a sane state for netdump to perform
tx/rx.

After a panic we can’t trust a system 100%, netdump runs as best effort.
It is unfortunately possible to hit these sort of issues given that the panic
can happen when cpus were in the network stack.

Since on panic we only have a single CPU that works with all the others stopped
through NMI, it can also cause other types of breakages like a CPU stopped
while the thread running on it owns a spinlock, for example, that is needed to
complete netdump.

This bug is opened to investigate further and see if we can address some of
these issues.

-- 
You are receiving this mail because:
You are the assignee for the bug.