Re: ZFS + FreeBSD XEN dom0 panic

From: Roger Pau Monné <roger.pau_at_citrix.com>
Date: Wed, 09 Mar 2022 08:42:52 UTC
On Sun, Mar 06, 2022 at 02:41:17PM +0200, Ze Dupsys wrote:
> On 2022.03.03. 12:39, Roger Pau Monné wrote:
> > On Wed, Mar 02, 2022 at 07:26:18PM +0200, Ze Dupsys wrote:
> > > Today managed to crash lab Dom0 with:
> > > xen_cmdline="dom0_mem=6144M dom0_max_vcpus=2 dom0=pvh,verbose=1
> > > console=vga,com1 com1=9600,8n1 guest_loglvl=all loglvl=all sync_console=1
> > > reboot=no"
> > Hm, it's weird that reboot=no doesn't work for you. Does noreboot
> > instead make a difference?
> > 
> > > I wrote ' vmstat -m | sort -k 2 -r' each 120 seconds, the latest one was as
> > > in attachment, panic was with the same fingerprint as the one with
> > > "rman_is_region_manager" line already reported.
> > 
> > 
> > > The scripts i ran in parallel generally were the same as attached in bug
> > > report, just a bit modified.
> > > 1) ./libexec.sh zfs_volstress_fast_4g (this just creates new ZVOLs and
> > > instead of 2GB, it writes 4BG in each ZVOL created dd if=/dev/zero)
> > > 2)  ./test_vm1_zvol_3gb.sh (this loops commands: start first DomU, write
> > > 3GB in it's /tmp, restart DomU, removes /tmp, repeat)
> > > 3) ./test_vm2_zvol_5_on_off.sh (this loops: start second DomU, which has 5
> > > disks attached, turn off DomU, repeat)
> > Right. So the trigger for this seem to be related to creating (and
> > destroying) VMs in a loop?
> > 
> > Do you still see the same if you only execute steps 1 and 4 from the
> > repro described above?
> > 
> > > 4) monitoring, sleep 120 seconds, print vmstat | sort in serial output.
> > > 
> > > Around dom id 108, system started to behave suspiciously, xl list showed
> > > DomUs created, but they did not really start up, script timeout-ed for ssh
> > > connection, no vnc. When i did xl destroy manually, and xl create, system
> > > panic happened.
> > Could you also add the output of `top -n1` to see where memory is
> > going?
> > 
> > I'm quite sure we have a leak in some of the backends, maybe the
> > bounce buffer used by blkback.
> > 
> > Thanks, Roger.
> Hello,
> 
> About XEN command line option changing "reboot=no" to "noreboot"; this did
> not change anything. Physical machine reboots on kernel panic. Maybe i have
> misunderstood the use for this option. I somehow thought that reboot=no
> means that if Dom0 crashes/panics, then physical machine is not rebooted,
> but in reality it does reboot.

noreboot (or reboot=no) prevent Xen from rebooting when Xen crashes.
If it's FreeBSD dom0 that crashes it's dom0 the one that's triggering
the reboot. Can you try to set the sysctl kern.panic_reboot_wait_time
= -1 on FreeBSD?

> At this moment though, this is not important
> for, since i am recording all console output from serial anyways, so
> messages are not lost as long as they are flushed out on a timely manner.
> Maybe kernel itself after panic calls reset to XEN?

Yes, I think that's the case.

> If i execute steps 1 and 4, it seems that machine does not crash; tested 8
> hours Dom0 8GB RAM, 4 hours Dom0 2GB RAM. So creating a lot of ZVOLs and
> monitoring it does not seem to be the cause.
> 
> Yesterday was a happy and a sad day at the same time. The happy thing is
> that i think that i have understood one panic reason and somewhat better
> located where  one problem might be, but at the same time, i think i have
> located 2 bugs that cause one specific panic, but 2 other panic reasons are
> due to different circumstances.
> 
> Further is my pure speculation about two things, somewhat related; i do not
> know XEN or kernel internals, thus take this with grain of salt.
> 
> Knowing that this seemed to be RAM problem and thinking how ZFS sysctl
> tuning values differ, i started to look at "sysctl -a" diffs for Dom0 with
> 2GB RAM and 8GB RAM, by pure luck because comparing Dom0 that was running
> longer and just recently restarted, i noticed there are diff lines like so:
> ..
> +dev.xbbd.202.%driver: xbbd
> +dev.xbbd.202.%location:
> +dev.xbbd.202.%parent: xenbusb_back0
> +dev.xbbd.202.%pnpinfo:
> +dev.xbbd.202.xenbus_connection_state: Closed
> +dev.xbbd.202.xenbus_dev_type: vbd
> +dev.xbbd.202.xenbus_peer_domid: 309
> +dev.xbbd.202.xenstore_path: backend/vbd/309/51728
> +dev.xbbd.202.xenstore_peer_path: /local/domain/309/device/vbd/51728
> +dev.xbbd.203.%desc: Backend Virtual Block Device
> +dev.xbbd.203.%driver: xbbd
> +dev.xbbd.203.%location:
> +dev.xbbd.203.%parent: xenbusb_back0
> +dev.xbbd.203.%pnpinfo:
> +dev.xbbd.203.xenbus_connection_state: Closed
> ..
> although actually in system at the time there were only 2 DomUs running max.
> This seems to be the reason why this felt like memory leak, since after long
> time of start/stop VMs, most probably sysctl hit some sort of RAM limit
> (which most probably is auto-calculated from total RAM somehow) and panics.

Those dev.xbbd.XXX sysctls reference a specific blkback instance, and
should be gone once the guest is shutdown. Do you mean that you have a
range of dev.xbbd.[0-203] entries in your sysctl?

Note that in dev.xbbd.XXX XXX matches the domain ID of the guest
that's using the backend.

> I can repeat this sysctl growing with just starting/stopping VMs in loop
> although stressing ZVOL at the same time speeds this up for unknown reasons
> to me. What i do not know though is what exactly triggers this, because
> there are times when xbbd.$number does not grow larger than total number of
> disk devices attached to all DomUs. I do not know what would be the best way
> to continue with this.

$number should be the domain ID, it will grown as long as you keep
creating new guests, but that's not really a problem.

> Then i caught another sysctl variable that is growing due to XEN,
> "kern.msgbuf: Contents of kernel message buffer". I do not know how this
> variable grows or by which component it is managed, but in VM start/stop
> case it grows and contains lines with pattern like so:
> ..
> xnb(xnb_rxpkt2rsp:2059): Got error -1 for hypervisor gnttab_copy status
> xnb(xnb_ring2pkt:1526): Unknown extra info type 255.  Discarding packet
> xnb(xnb_dump_txreq:299): netif_tx_request index =0
> xnb(xnb_dump_txreq:300): netif_tx_request.gref  =0
> xnb(xnb_dump_txreq:301): netif_tx_request.offset=0
> xnb(xnb_dump_txreq:302): netif_tx_request.flags =8
> xnb(xnb_dump_txreq:303): netif_tx_request.id    =69
> xnb(xnb_dump_txreq:304): netif_tx_request.size  =1000
> xnb(xnb_dump_txreq:299): netif_tx_request index =1
> xnb(xnb_dump_txreq:300): netif_tx_request.gref  =255
> xnb(xnb_dump_txreq:301): netif_tx_request.offset=0
> xnb(xnb_dump_txreq:302): netif_tx_request.flags =0
> xnb(xnb_dump_txreq:303): netif_tx_request.id    =0
> xnb(xnb_dump_txreq:304): netif_tx_request.size  =0
> ..
> 
> Those lines in that variable just keep growing and growing, it is not that
> they are flushed, trimmed or anything. Each time i get the same message on
> serial output, it has one more section of error appended to "same-previous"
> serial output message and sysctl variable as well. Thus at some point serial
> output and sysctl contains a large block of those errors while VM is
> starting. So at some point the value of this sysctl could be reaching max
> allowed/available and this makes the system panic.  I do not know the reason
> for those errors, but actually if there was a patch to suppress them, this
> could be "solved". Another diff chunk might be related to this:
> +dev.xnb.1.xenstore_peer_path: /local/domain/7/device/vif/0
> +dev.xnb.1.xenbus_peer_domid: 7
> +dev.xnb.1.xenbus_connection_state: InitWait
> +dev.xnb.1.xenbus_dev_type: vif
> +dev.xnb.1.xenstore_path: backend/vif/7/0
> +dev.xnb.1.dump_rings:
> +dev.xnb.1.unit_test_results: xnb_rxpkt2rsp_empty:1765 Assertion Error:
> nr_reqs == 0
> +xnb_rxpkt2rsp_empty:1767 Assertion Error: memcmp(&rxb_backup,
> &xnb_unit_pvt.rxb, sizeof(rxb_backup)) == 0
> +xnb_rxpkt2rsp_empty:1769 Assertion Error: memcmp(&rxs_backup,
> xnb_unit_pvt.rxs, sizeof(rxs_backup)) == 0
> +52 Tests Passed
> +1 Tests FAILED

So you have failed tests for netback. Maybe the issue is with
netback rather than blkback.

> What was suspicious about this is that i am using 13.0-RELEASE-p7, it is not
> a DEV version of XEN or anything, but there are sysctl variables with "dev"
> prefix. Could it be that ports have been accidentally compiled XEN version
> with some development flags turned on?

'dev' here means device, not developer.

> So now it felt for me that i understand this panic case:
> Fatal trap 12: page fault while in kernel mode
> cpuid = 1; apic id = 02
> fault virtual address   = 0x22710028
> fault code              = supervisor read data, page not present
> instruction pointer     = 0x20:0xffffffff80c45892
> stack pointer           = 0x28:0xfffffe009656a930
> frame pointer           = 0x28:0xfffffe009656a930
> code segment            = base 0x0, limit 0xfffff, type 0x1b
>                         = DPL 0, pres 1, long 1, def32 0, gran 1
> processor eflags        = interrupt enabled, resume, IOPL = 0
> current process         = 17633 (devmatch)
> trap number             = 12
> panic: page fault
> cpuid = 1
> time = 1646139877
> KDB: stack backtrace:
> #0 0xffffffff80c57525 at kdb_backtrace+0x65
> #1 0xffffffff80c09f01 at vpanic+0x181
> #2 0xffffffff80c09d73 at panic+0x43
> #3 0xffffffff8108b1a7 at trap+0xbc7
> #4 0xffffffff8108b1ff at trap+0xc1f
> #5 0xffffffff8108a85d at trap+0x27d
> #6 0xffffffff81061b18 at calltrap+0x8
> #7 0xffffffff80c62011 at rman_is_region_manager+0x241
> #8 0xffffffff80c1a051 at sbuf_new_for_sysctl+0x101
> #9 0xffffffff80c1949c at kernel_sysctl+0x43c
> #10 0xffffffff80c19b13 at userland_sysctl+0x173
> #11 0xffffffff80c1995f at sys___sysctl+0x5f
> #12 0xffffffff8108baac at amd64_syscall+0x10c
> #13 0xffffffff8106243e at Xfast_syscall+0xfe
> 
> The function "kernel_sysctl" shows that this is sysctl related, and based on
> previous knowledge we know that there might be too many variables. If i
> assume that "Xfast_syscall" is a XEN function, then that shows that panic
> came from XEN call that wanted to do something to sysctl variable, and panic
> happened. This is all just my assumption.

No, Xfast_syscall is not a Xen function. That's the syscall entry
point into the kernel.

> 
> So those were my conclusions.
> 
> What do you think? How should we proceed with this? Should i try to somehow
> build XEN from git sources?

I think that's unlikely to make any difference. I would think the
problem is with FreeBSD rather than Xen. Can you paste the config file
you are using the create the domain(s)?

I've tried myself to trigger this crash by creating and destroying a
guest in a loop but didn't manage to trigger it. I was able to create
(and shutdown) 1328 guests successfully on a ZFS dom0 using 4GB of
RAM.

Do you have some kind of configuration that differs from the default
for ZFS for example? (ie: setting a different ARC size)

Are you ballooning the amount of memory assigned to dom0?

Thanks, Roger.