14.0-RELEASE and 15-CURRENT have high interrupt count/latency in Xen domu (xcp-ng host)
- Go to: [ bottom of page ] [ top of archives ] [ this month ]
Date: Sun, 10 Mar 2024 00:27:03 UTC
On an xcp-ng VM running 14.0-RELEASE (also tested on p5 and 15-CURRENT), top shows a lot of CPU time in interrupts (always a double-digit percentage, usually around 70% but it varies all over the place). Based on dtrace, the problem appears to be in ata_interrupt and I’d guess it’s mutex-related.
What else do I want to do to investigate this issue?
All of below is on an instance with 1GB RAM and 2 vCPUs; the host is xcp-ng 8.2 with all available patches.
dtrace -n '::*interrupt*:entry { self->ts = timestamp; } ::*interrupt*:return /self->ts/ { @[probefunc] = quantize(timestamp - self->ts); self->ts = 0; }'
run for one minute on an otherwise idle VM produces:
zio_delay_interrupt
value ------------- Distribution ------------- count
512 | 0
1024 |@@@@@@@ 2
2048 |@@@ 1
4096 | 0
8192 |@@@@@@@@@@@@@@@@@@@@ 6
16384 |@@@@@@@@@@ 3
32768 | 0
ata_interrupt
value ------------- Distribution ------------- count
65536 | 0
131072 |@@@ 3
262144 |@@@@@@@@@@@@@@@@@ 19
524288 |@@@@@@@@ 9
1048576 |@@@@@@@@@@@@ 13
2097152 | 0
On CURRENT, that dtrace invocation returns nothing at all; dtrace -l shows an entry but not return for ata_interrupt so that appears to be an fbt issue.
The disk does appear to be using the PV driver:
xbd0: 32768MB <Virtual Block Device> at device/vbd/768 on xenbusb_front0
xbd0: attaching as ada0
xbd0: features: write_barrier
xbd0: synchronize cache commands enabled.
There are two mutex warnings that show up on CURRENT, but they seem unrelated:
Invoking IPv6 network device address event may sleep with the following non-sleepable locks held:
exclusive sleep mutex xnrx_1 (netfront receive lock) r = 0 (0xfffffe00496d2d48) locked @ /usr/src/sys/dev/xen/netfront/netfront.c:679
stack backtrace:
#0 0xffffffff80bc9a35 at witness_debugger+0x65
#1 0xffffffff80bcab99 at witness_warn+0x3e9
#2 0xffffffff80d793c4 at in6_update_ifa+0xcb4
#3 0xffffffff80da4f80 at in6_ifadd+0x1e0
#4 0xffffffff80da16d9 at nd6_ra_input+0x1029
#5 0xffffffff80d73de0 at icmp6_input+0x870
#6 0xffffffff80d8c4ae at ip6_input+0xc8e
#7 0xffffffff80cadbbd at netisr_dispatch_src+0xad
#8 0xffffffff80c9065a at ether_demux+0x16a
#9 0xffffffff80c91d73 at ether_nh_input+0x403
#10 0xffffffff80cadbbd at netisr_dispatch_src+0xad
#11 0xffffffff80c90ab9 at ether_input+0xd9
#12 0xffffffff809c6ebc at xn_rxeof+0x5dc
#13 0xffffffff809c8238 at xn_intr+0x48
#14 0xffffffff80b0d556 at ithread_loop+0x266
#15 0xffffffff80b09972 at fork_exit+0x82
#16 0xffffffff8102e06e at fork_trampoline+0xe
lock order reversal: (sleepable after non-sleepable)
1st 0xfffffe00496d2d48 xnrx_1 (netfront receive lock, sleep mutex) @ /usr/src/sys/dev/xen/netfront/netfront.c:679
2nd 0xffffffff81abe830 in6_multi_sx (in6_multi_sx, sx) @ /usr/src/sys/netinet6/in6_mcast.c:1217
lock order netfront receive lock -> in6_multi_sx attempted at:
#0 0xffffffff80bc95f3 at witness_checkorder+0xb63
#1 0xffffffff80b5f0fd at _sx_xlock+0x5d
#2 0xffffffff80d81b11 at in6_joingroup+0x31
#3 0xffffffff80d7977d at in6_update_ifa+0x106d
#4 0xffffffff80da4f80 at in6_ifadd+0x1e0
#5 0xffffffff80da16d9 at nd6_ra_input+0x1029
#6 0xffffffff80d73de0 at icmp6_input+0x870
#7 0xffffffff80d8c4ae at ip6_input+0xc8e
#8 0xffffffff80cadbbd at netisr_dispatch_src+0xad
#9 0xffffffff80c9065a at ether_demux+0x16a
#10 0xffffffff80c91d73 at ether_nh_input+0x403
#11 0xffffffff80cadbbd at netisr_dispatch_src+0xad
#12 0xffffffff80c90ab9 at ether_input+0xd9
#13 0xffffffff809c6ebc at xn_rxeof+0x5dc
#14 0xffffffff809c8238 at xn_intr+0x48
#15 0xffffffff80b0d556 at ithread_loop+0x266
#16 0xffffffff80b09972 at fork_exit+0x82
#17 0xffffffff8102e06e at fork_trampoline+0xe
(The "non-sleepable locks held” backtrace shows up a second time but without a lock reversal.)
systat -vmstat on RELEASE:
3 users Load 0.13 0.19 0.22 Mar 10 00:21:55
Mem usage: 89%Phy 78%Kmem VN PAGER SWAP PAGER
Mem: REAL VIRTUAL in out in out
Tot Share Tot Share Free count
Act 47944K 20684K 406M 24480K 107M pages
All 50736K 23368K 446M 57372K ioflt Interrupts
Proc: cow 69 total
r p d s w Csw Trp Sys Int Sof Flt zfod atkbd0 1
31 91 2 67 36 1 ozfod ata1 15
%ozfod uhci0 23
0.0%Sys 78.9%Intr 0.0%User 0.0%Nice 21.1%Idle daefr 17 cpu0:xen
| | | | | | | | | | | prcfr 17 cpu1:xen
+++++++++++++++++++++++++++++++++++++++ totfr cpu0:r
dtbuf react cpu0:itlb
Namei Name-cache Dir-cache 62482 maxvn pdwak 4 cpu0:b
Calls hits % hits % 17168 numvn 5 pdpgs cpu1:r
7 7 100 16002 frevn intrn cpu1:itlb
772M wire 6 cpu1:b
Disks ada0 cd0 pass0 13M act 13 xen_et0:c0
KB/t 0.00 0.00 0.00 60M inact 11 xen_et0:c1
tps 0 0 0 0 laund xenstore0
MB/s 0.00 0.00 0.00 107M free xbd0 2113
%busy 0 0 0 0 buf 1 xn0 2114
xn0 2115
xn0 2116
xn0 2117
and on CURRENT:
4 users Load 0.15 0.12 0.16 Mar 10 00:23:11
Mem usage: 35%Phy 8%Kmem VN PAGER SWAP PAGER
Mem: REAL VIRTUAL in out in out
Tot Share Tot Share Free count
Act 103M 14608K 454M 16848K 626M pages
All 105M 16908K 486M 49620K ioflt Interrupts
Proc: cow 274 total
r p d s w Csw Trp Sys Int Sof Flt zfod atkbd0 1
29 272 1 49 137 1 ozfod uhci0 23
%ozfod 114 cpu0:xen
0.0%Sys 56.0%Intr 0.0%User 0.0%Nice 44.0%Idle daefr 23 cpu1:xen
| | | | | | | | | | | prcfr 2 cpu0:r
++++++++++++++++++++++++++++ totfr cpu0:itlb
6 dtbuf react 12 cpu0:b
Namei Name-cache Dir-cache 62833 maxvn pdwak 1 cpu1:r
Calls hits % hits % 920 numvn 10 pdpgs cpu1:itlb
7 7 100 126 frevn intrn 17 cpu1:b
217M wire 100 xen_et0:c0
Disks ada0 31M act 4 xen_et0:c1
KB/t 0.00 84M inact xenstore0
tps 0 0 laund xbd0 2113
MB/s 0.00 626M free 1 xn0 2114
%busy 0 80M buf xn0 2115
xn0 2116
xn0 2117
—
Brad Ackerman
brad@facefault.org