Re: ZFS + FreeBSD XEN dom0 panic

From: Ze Dupsys <zedupsys_at_gmail.com>
Date: Sun, 06 Mar 2022 12:41:17 UTC
On 2022.03.03. 12:39, Roger Pau Monné wrote:
> On Wed, Mar 02, 2022 at 07:26:18PM +0200, Ze Dupsys wrote:
>> Today managed to crash lab Dom0 with:
>> xen_cmdline="dom0_mem=6144M dom0_max_vcpus=2 dom0=pvh,verbose=1
>> console=vga,com1 com1=9600,8n1 guest_loglvl=all loglvl=all sync_console=1
>> reboot=no"
> Hm, it's weird that reboot=no doesn't work for you. Does noreboot
> instead make a difference?
>
>> I wrote ' vmstat -m | sort -k 2 -r' each 120 seconds, the latest one was as
>> in attachment, panic was with the same fingerprint as the one with
>> "rman_is_region_manager" line already reported.
>
>
>> The scripts i ran in parallel generally were the same as attached in bug
>> report, just a bit modified.
>> 1) ./libexec.sh zfs_volstress_fast_4g (this just creates new ZVOLs and
>> instead of 2GB, it writes 4BG in each ZVOL created dd if=/dev/zero)
>> 2)  ./test_vm1_zvol_3gb.sh (this loops commands: start first DomU, write
>> 3GB in it's /tmp, restart DomU, removes /tmp, repeat)
>> 3) ./test_vm2_zvol_5_on_off.sh (this loops: start second DomU, which has 5
>> disks attached, turn off DomU, repeat)
> Right. So the trigger for this seem to be related to creating (and
> destroying) VMs in a loop?
>
> Do you still see the same if you only execute steps 1 and 4 from the
> repro described above?
>
>> 4) monitoring, sleep 120 seconds, print vmstat | sort in serial output.
>>
>> Around dom id 108, system started to behave suspiciously, xl list showed
>> DomUs created, but they did not really start up, script timeout-ed for ssh
>> connection, no vnc. When i did xl destroy manually, and xl create, system
>> panic happened.
> Could you also add the output of `top -n1` to see where memory is
> going?
>
> I'm quite sure we have a leak in some of the backends, maybe the
> bounce buffer used by blkback.
>
> Thanks, Roger.
Hello,

About XEN command line option changing "reboot=no" to "noreboot"; this 
did not change anything. Physical machine reboots on kernel panic. Maybe 
i have misunderstood the use for this option. I somehow thought that 
reboot=no means that if Dom0 crashes/panics, then physical machine is 
not rebooted, but in reality it does reboot. At this moment though, this 
is not important for, since i am recording all console output from 
serial anyways, so messages are not lost as long as they are flushed out 
on a timely manner. Maybe kernel itself after panic calls reset to XEN?

If i execute steps 1 and 4, it seems that machine does not crash; tested 
8 hours Dom0 8GB RAM, 4 hours Dom0 2GB RAM. So creating a lot of ZVOLs 
and monitoring it does not seem to be the cause.

Yesterday was a happy and a sad day at the same time. The happy thing is 
that i think that i have understood one panic reason and somewhat better 
located where  one problem might be, but at the same time, i think i 
have located 2 bugs that cause one specific panic, but 2 other panic 
reasons are due to different circumstances.

Further is my pure speculation about two things, somewhat related; i do 
not know XEN or kernel internals, thus take this with grain of salt.

Knowing that this seemed to be RAM problem and thinking how ZFS sysctl 
tuning values differ, i started to look at "sysctl -a" diffs for Dom0 
with 2GB RAM and 8GB RAM, by pure luck because comparing Dom0 that was 
running longer and just recently restarted, i noticed there are diff 
lines like so:
..
+dev.xbbd.202.%driver: xbbd
+dev.xbbd.202.%location:
+dev.xbbd.202.%parent: xenbusb_back0
+dev.xbbd.202.%pnpinfo:
+dev.xbbd.202.xenbus_connection_state: Closed
+dev.xbbd.202.xenbus_dev_type: vbd
+dev.xbbd.202.xenbus_peer_domid: 309
+dev.xbbd.202.xenstore_path: backend/vbd/309/51728
+dev.xbbd.202.xenstore_peer_path: /local/domain/309/device/vbd/51728
+dev.xbbd.203.%desc: Backend Virtual Block Device
+dev.xbbd.203.%driver: xbbd
+dev.xbbd.203.%location:
+dev.xbbd.203.%parent: xenbusb_back0
+dev.xbbd.203.%pnpinfo:
+dev.xbbd.203.xenbus_connection_state: Closed
..
although actually in system at the time there were only 2 DomUs running 
max. This seems to be the reason why this felt like memory leak, since 
after long time of start/stop VMs, most probably sysctl hit some sort of 
RAM limit (which most probably is auto-calculated from total RAM 
somehow) and panics. I can repeat this sysctl growing with just 
starting/stopping VMs in loop although stressing ZVOL at the same time 
speeds this up for unknown reasons to me. What i do not know though is 
what exactly triggers this, because there are times when xbbd.$number 
does not grow larger than total number of disk devices attached to all 
DomUs. I do not know what would be the best way to continue with this.

Then i caught another sysctl variable that is growing due to XEN, 
"kern.msgbuf: Contents of kernel message buffer". I do not know how this 
variable grows or by which component it is managed, but in VM start/stop 
case it grows and contains lines with pattern like so:
..
xnb(xnb_rxpkt2rsp:2059): Got error -1 for hypervisor gnttab_copy status
xnb(xnb_ring2pkt:1526): Unknown extra info type 255.  Discarding packet
xnb(xnb_dump_txreq:299): netif_tx_request index =0
xnb(xnb_dump_txreq:300): netif_tx_request.gref  =0
xnb(xnb_dump_txreq:301): netif_tx_request.offset=0
xnb(xnb_dump_txreq:302): netif_tx_request.flags =8
xnb(xnb_dump_txreq:303): netif_tx_request.id    =69
xnb(xnb_dump_txreq:304): netif_tx_request.size  =1000
xnb(xnb_dump_txreq:299): netif_tx_request index =1
xnb(xnb_dump_txreq:300): netif_tx_request.gref  =255
xnb(xnb_dump_txreq:301): netif_tx_request.offset=0
xnb(xnb_dump_txreq:302): netif_tx_request.flags =0
xnb(xnb_dump_txreq:303): netif_tx_request.id    =0
xnb(xnb_dump_txreq:304): netif_tx_request.size  =0
..

Those lines in that variable just keep growing and growing, it is not 
that they are flushed, trimmed or anything. Each time i get the same 
message on serial output, it has one more section of error appended to 
"same-previous" serial output message and sysctl variable as well. Thus 
at some point serial output and sysctl contains a large block of those 
errors while VM is starting. So at some point the value of this sysctl 
could be reaching max allowed/available and this makes the system 
panic.  I do not know the reason for those errors, but actually if there 
was a patch to suppress them, this could be "solved". Another diff chunk 
might be related to this:
+dev.xnb.1.xenstore_peer_path: /local/domain/7/device/vif/0
+dev.xnb.1.xenbus_peer_domid: 7
+dev.xnb.1.xenbus_connection_state: InitWait
+dev.xnb.1.xenbus_dev_type: vif
+dev.xnb.1.xenstore_path: backend/vif/7/0
+dev.xnb.1.dump_rings:
+dev.xnb.1.unit_test_results: xnb_rxpkt2rsp_empty:1765 Assertion Error: 
nr_reqs == 0
+xnb_rxpkt2rsp_empty:1767 Assertion Error: memcmp(&rxb_backup, 
&xnb_unit_pvt.rxb, sizeof(rxb_backup)) == 0
+xnb_rxpkt2rsp_empty:1769 Assertion Error: memcmp(&rxs_backup, 
xnb_unit_pvt.rxs, sizeof(rxs_backup)) == 0
+52 Tests Passed
+1 Tests FAILED

What was suspicious about this is that i am using 13.0-RELEASE-p7, it is 
not a DEV version of XEN or anything, but there are sysctl variables 
with "dev" prefix. Could it be that ports have been accidentally 
compiled XEN version with some development flags turned on?

So now it felt for me that i understand this panic case:
Fatal trap 12: page fault while in kernel mode
cpuid = 1; apic id = 02
fault virtual address   = 0x22710028
fault code              = supervisor read data, page not present
instruction pointer     = 0x20:0xffffffff80c45892
stack pointer           = 0x28:0xfffffe009656a930
frame pointer           = 0x28:0xfffffe009656a930
code segment            = base 0x0, limit 0xfffff, type 0x1b
                         = DPL 0, pres 1, long 1, def32 0, gran 1
processor eflags        = interrupt enabled, resume, IOPL = 0
current process         = 17633 (devmatch)
trap number             = 12
panic: page fault
cpuid = 1
time = 1646139877
KDB: stack backtrace:
#0 0xffffffff80c57525 at kdb_backtrace+0x65
#1 0xffffffff80c09f01 at vpanic+0x181
#2 0xffffffff80c09d73 at panic+0x43
#3 0xffffffff8108b1a7 at trap+0xbc7
#4 0xffffffff8108b1ff at trap+0xc1f
#5 0xffffffff8108a85d at trap+0x27d
#6 0xffffffff81061b18 at calltrap+0x8
#7 0xffffffff80c62011 at rman_is_region_manager+0x241
#8 0xffffffff80c1a051 at sbuf_new_for_sysctl+0x101
#9 0xffffffff80c1949c at kernel_sysctl+0x43c
#10 0xffffffff80c19b13 at userland_sysctl+0x173
#11 0xffffffff80c1995f at sys___sysctl+0x5f
#12 0xffffffff8108baac at amd64_syscall+0x10c
#13 0xffffffff8106243e at Xfast_syscall+0xfe

The function "kernel_sysctl" shows that this is sysctl related, and 
based on previous knowledge we know that there might be too many 
variables. If i assume that "Xfast_syscall" is a XEN function, then that 
shows that panic came from XEN call that wanted to do something to 
sysctl variable, and panic happened. This is all just my assumption.

So those were my conclusions.

What do you think? How should we proceed with this? Should i try to 
somehow build XEN from git sources?

For what it's worth, system was crashed with 8GB Dom0 RAM, 16GB swap 
space on dedicated HDD not the one ZFS is on.

xen_cmdline="dom0_mem=8192M dom0_max_vcpus=2 dom0=pvh,verbose=1 
console=vga,com1 com1=9600,8n1 guest_loglvl=all loglvl=all 
sync_console=1 noreboot"

"top -n 1" output before panic:
last pid: 18906;  load averages:  1.39,  2.62,  2.90  up 0+02:22:00    
19:54:50
42 processes:  1 running, 41 sleeping
CPU:  3.0% user,  0.0% nice, 27.7% system,  2.3% interrupt, 67.0% idle
Mem: 872K Active, 4176K Inact, 4356K Laundry, 7656M Wired, 48M Buf, 238M 
Free
ARC: 5183M Total, 6320K MFU, 4752M MRU, 219M Anon, 192M Header, 12M Other
      4674M Compressed, 4746M Uncompressed, 1.02:1 Ratio
Swap: 16G Total, 40M Used, 16G Free

   PID USERNAME    THR PRI NICE   SIZE    RES STATE    C   TIME WCPU 
COMMAND
  1010 root          1  23    0    13M  3044K select   1   1:34 0.98% 
xenstored

At this point i do not think "vmstat -m" output before panic is useful. 
What i can tell though is that grep'ing out vmstat messages on process 
"sysctloid" i see that it's RAM consumption most of the time is steadily 
growing.

The sad part about all this though is that i do not believe that these 
are the bugs that cause panic on my pre-production machine. Maybe sysctl 
variable "kern.msgbuf" overgrows, but other than that, normally i do not 
start/stop VMs so often that it would create so many xbbd sysctl 
variables... Anyways we could try to deal with this and then see, if it 
fixes.

Best wishes!