LOR: "taskqueue_drain with the following non-sleepable locks held" with if_em

Tue May 7 23:06:26 UTC 2013

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA512

On 05/07/13 15:03, Garrett Cooper wrote:
> Saw the following LOR on a CURRENT build as of yesterday with an 
> almost idle machine processing ARP requests:
> 
> root at wf220:/mnt # taskqueue_drain with the following non-sleepable
> locks held: exclusive rw lle (lle) r = 0 (0xfffffe001450b410)
> locked @ /usr/src/sys/netinet/in.c:1484 KDB: stack backtrace: 
> db_trace_self_wrapper() at db_trace_self_wrapper+0x2b/frame
> 0xffffff848d4f7690 kdb_backtrace() at kdb_backtrace+0x39/frame
> 0xffffff848d4f7740 witness_warn() at witness_warn+0x4a8/frame
> 0xffffff848d4f7800 taskqueue_drain() at taskqueue_drain+0x3a/frame
> 0xffffff848d4f7840 set_timeout() at set_timeout+0x4a/frame
> 0xffffff848d4f7860 netevent_callback() at
> netevent_callback+0x16/frame 0xffffff848d4f7870 arpintr() at
> arpintr+0x9b5/frame 0xffffff848d4f7930 netisr_dispatch_src() at
> netisr_dispatch_src+0x60/frame 0xffffff848d4f79a0 ether_demux() at
> ether_demux+0x130/frame 0xffffff848d4f79d0 ether_nh_input() at
> ether_nh_input+0x369/frame 0xffffff848d4f7a30 netisr_dispatch_src()
> at netisr_dispatch_src+0x60/frame 0xffffff848d4f7aa0 em_rxeof() at
> em_rxeof+0x30e/frame 0xffffff848d4f7b10 em_msix_rx() at
> em_msix_rx+0x33/frame 0xffffff848d4f7b40 
> intr_event_execute_handlers() at 
> intr_event_execute_handlers+0x80/frame 0xffffff848d4f7b70 
> ithread_loop() at ithread_loop+0x128/frame 0xffffff848d4f7bb0 
> fork_exit() at fork_exit+0x71/frame 0xffffff848d4f7bf0 
> fork_trampoline() at fork_trampoline+0xe/frame 0xffffff848d4f7bf0 
> --- trap 0, rip = 0, rsp = 0xffffff848d4f7cb0, rbp = 0 --- 
> root at wf220:/mnt # uname -a FreeBSD wf220.west.isilon.com
> 10.0-CURRENT FreeBSD 10.0-CURRENT #1: Tue May  7 08:04:59 PDT 2013 
> root at wf220.west.isilon.com:/usr/obj/usr/src/sys/ISI-GENERIC  amd64
> 
> I've seen this issue before for a few weeks/months, so it's nothing
> new (but probably should be fixed...). Thanks!

This have nothing to do with em(4) but looks like a bug in our Linux
compatibility wrapper.  In the InfiniBand code, its
_handle_arp_update_event() calls netevent_callback() with
NETEVENT_NEIGH_UPDATE, where a cancel_delayed_work() causes the drain.

Looking at the Linux code, it seems that we just shouldn't do the
drain in the cancel_delayed_work() wrapper
(sys/ofed/include/linux/workqueue.h) so it seems like we need
something like this:

Index: sys/ofed/include/linux/workqueue.h
===================================================================
- --- sys/ofed/include/linux/workqueue.h	(revision 250337)
+++ sys/ofed/include/linux/workqueue.h	(working copy)
@@ -184,9 +184,9 @@
 {

 	callout_stop(&work->timer);
- -	if (work->work.taskqueue &&
- -	    taskqueue_cancel(work->work.taskqueue, &work->work.work_task, NULL))
- -		taskqueue_drain(work->work.taskqueue, &work->work.work_task);
+	if (work->work.taskqueue)
+		return (taskqueue_cancel(work->work.taskqueue,
+		    &work->work.work_task, NULL) != 0);
 	return 0;
 }



I've added Jeff to Cc.

Cheers,
- -- 
Xin LI <delphij at delphij.net>    https://www.delphij.net/
FreeBSD - The Power to Serve!           Live free or die
-----BEGIN PGP SIGNATURE-----

iQEcBAEBCgAGBQJRiYjwAAoJEG80Jeu8UPuzOC0H+wbTxVq3nPOuQqZynOLcxHVj
L19b1D8opm8hl3AwXfvbOyCbEEenoHJm0FjBd+5eas+9ol1kuRoOyBKVnoZRr2vO
7hcFt/iA7WAQKrZR7ReLUebjLcIymjzDRO6ztZCPMwSzIg1CzypY4KdJhlW438te
DvAkzYbgy1YG4C8Uxjg7wR7PR4SY1UgLFYPMeNyvwCCJmSEN/RQB1qrOaJovFks5
C53j713BIHOI0H4G3IhKJd9ujPhVrfQperItlJ4Lg7y0Ix5HlLFdSNRkpzvNrXN4
TN6Xb/atMo1EIiDReqx8Mpus52yUOl3oHXkKzTRZpGM3mW0vLIieajCK0JGBd6c=
=tU/S
-----END PGP SIGNATURE-----