Re: ZFS + FreeBSD XEN dom0 panic

From: Ze Dupsys <zedupsys_at_gmail.com>
Date: Thu, 10 Mar 2022 09:34:26 UTC
On 2022.03.09. 10:42, Roger Pau Monné wrote:
> On Sun, Mar 06, 2022 at 02:41:17PM +0200, Ze Dupsys wrote:
>> If i execute steps 1 and 4, it seems that machine does not crash; tested 8
>> hours Dom0 8GB RAM, 4 hours Dom0 2GB RAM. So creating a lot of ZVOLs and
>> monitoring it does not seem to be the cause.
>>
>> Yesterday was a happy and a sad day at the same time. The happy thing is
>> that i think that i have understood one panic reason and somewhat better
>> located where  one problem might be, but at the same time, i think i have
>> located 2 bugs that cause one specific panic, but 2 other panic reasons are
>> due to different circumstances.
>>
>> Further is my pure speculation about two things, somewhat related; i do not
>> know XEN or kernel internals, thus take this with grain of salt.
>>
>> Knowing that this seemed to be RAM problem and thinking how ZFS sysctl
>> tuning values differ, i started to look at "sysctl -a" diffs for Dom0 with
>> 2GB RAM and 8GB RAM, by pure luck because comparing Dom0 that was running
>> longer and just recently restarted, i noticed there are diff lines like so:
>> ..
>> +dev.xbbd.202.%driver: xbbd
>> +dev.xbbd.202.%location:
>> +dev.xbbd.202.%parent: xenbusb_back0
>> +dev.xbbd.202.%pnpinfo:
>> +dev.xbbd.202.xenbus_connection_state: Closed
>> +dev.xbbd.202.xenbus_dev_type: vbd
>> +dev.xbbd.202.xenbus_peer_domid: 309
>> +dev.xbbd.202.xenstore_path: backend/vbd/309/51728
>> +dev.xbbd.202.xenstore_peer_path: /local/domain/309/device/vbd/51728
>> +dev.xbbd.203.%desc: Backend Virtual Block Device
>> +dev.xbbd.203.%driver: xbbd
>> +dev.xbbd.203.%location:
>> +dev.xbbd.203.%parent: xenbusb_back0
>> +dev.xbbd.203.%pnpinfo:
>> +dev.xbbd.203.xenbus_connection_state: Closed
>> ..
>> although actually in system at the time there were only 2 DomUs running max.
>> This seems to be the reason why this felt like memory leak, since after long
>> time of start/stop VMs, most probably sysctl hit some sort of RAM limit
>> (which most probably is auto-calculated from total RAM somehow) and panics.
> Those dev.xbbd.XXX sysctls reference a specific blkback instance, and
> should be gone once the guest is shutdown. Do you mean that you have a
> range of dev.xbbd.[0-203] entries in your sysctl?
Yes, i have that range of those entries. Even if all DomUs are 
destroyed. Snippet from logs - yesterday started, today crashed:
cat 2022_03_10__1646887233_sysctl__001.log | grep xbbd | sed -Ee 
's/(xbbd\.[0-9]*)\..*/\1/g' | sort | uniq
dev.xbbd.%parent:
dev.xbbd.0
dev.xbbd.1
dev.xbbd.10
dev.xbbd.100
dev.xbbd.101
dev.xbbd.102
dev.xbbd.103
dev.xbbd.104
dev.xbbd.105
dev.xbbd.106
dev.xbbd.107
dev.xbbd.108
dev.xbbd.109
dev.xbbd.11
..
dev.xbbd.52
dev.xbbd.520
dev.xbbd.521
dev.xbbd.522
dev.xbbd.523
dev.xbbd.524
..
dev.xbbd.99
irq804: xbbd520:133 @cpu0(domain0): 436527
irq805: xbbd521:135 @cpu0(domain0): 4777
irq806: xbbd522:137 @cpu0(domain0): 16589254
irq807: xbbd523:139 @cpu0(domain0): 6412
irq808: xbbd524:141 @cpu0(domain0): 103083

And those entries keep growing. Yesterday, i tried to follow Brian's 
email about working setup with FreeBSD 12.1. With given scripts 8GB RAM 
Dom0, still crashed, took longer, but on morning i saw that it machine 
rebooted, in serial logs had panic message. It seems that 
"kern.panic_reboot_wait_time = -1" did not disable rebooting as well. In 
monitoring logs i see that approx. 3 minutes after crash, testing 
machine was up and running again (rebooted).


> Note that in dev.xbbd.XXX XXX matches the domain ID of the guest
> that's using the backend.
On my setups i have tested this does not seem to be the case. If each VM 
has more than 1 HDD, i have that many instances of xbbd sysctl variables 
as total count of HDDs attached to VMs. But at some point after 
rebooting machines or something, this number starts to grow on each 
create. For me it seems that dom ID is in sysctl, like this 
"dev.xbbd.5.xenbus_peer_domid", and if i grep those values, they show 
domids for nonexistent VMs.

I do not know how sysctls are created, but is there some test suite that 
creates awfully lot of sysctl variables and tries to crash system that 
way? I suppose there are no ways to limit sysctl variable count or any 
protection mechanisms for such cases. Then i could see if panic message 
is the same. Or test that allocates sysctl variable with insaneley large 
value, or appends to it.

>> Then i caught another sysctl variable that is growing due to XEN,
>> "kern.msgbuf: Contents of kernel message buffer". I do not know how this
>> variable grows or by which component it is managed, but in VM start/stop
>> case it grows and contains lines with pattern like so:
>> ..
>> xnb(xnb_rxpkt2rsp:2059): Got error -1 for hypervisor gnttab_copy status
>> xnb(xnb_ring2pkt:1526): Unknown extra info type 255.  Discarding packet
>> xnb(xnb_dump_txreq:299): netif_tx_request index =0
>> xnb(xnb_dump_txreq:300): netif_tx_request.gref  =0
>> xnb(xnb_dump_txreq:301): netif_tx_request.offset=0
>> xnb(xnb_dump_txreq:302): netif_tx_request.flags =8
>> xnb(xnb_dump_txreq:303): netif_tx_request.id    =69
>> xnb(xnb_dump_txreq:304): netif_tx_request.size  =1000
>> xnb(xnb_dump_txreq:299): netif_tx_request index =1
>> xnb(xnb_dump_txreq:300): netif_tx_request.gref  =255
>> xnb(xnb_dump_txreq:301): netif_tx_request.offset=0
>> xnb(xnb_dump_txreq:302): netif_tx_request.flags =0
>> xnb(xnb_dump_txreq:303): netif_tx_request.id    =0
>> xnb(xnb_dump_txreq:304): netif_tx_request.size  =0
>> ..
>>
>> Those lines in that variable just keep growing and growing, it is not that
>> they are flushed, trimmed or anything. Each time i get the same message on
>> serial output, it has one more section of error appended to "same-previous"
>> serial output message and sysctl variable as well. Thus at some point serial
>> output and sysctl contains a large block of those errors while VM is
>> starting. So at some point the value of this sysctl could be reaching max
>> allowed/available and this makes the system panic.  I do not know the reason
>> for those errors, but actually if there was a patch to suppress them, this
>> could be "solved". Another diff chunk might be related to this:
>> +dev.xnb.1.xenstore_peer_path: /local/domain/7/device/vif/0
>> +dev.xnb.1.xenbus_peer_domid: 7
>> +dev.xnb.1.xenbus_connection_state: InitWait
>> +dev.xnb.1.xenbus_dev_type: vif
>> +dev.xnb.1.xenstore_path: backend/vif/7/0
>> +dev.xnb.1.dump_rings:
>> +dev.xnb.1.unit_test_results: xnb_rxpkt2rsp_empty:1765 Assertion Error:
>> nr_reqs == 0
>> +xnb_rxpkt2rsp_empty:1767 Assertion Error: memcmp(&rxb_backup,
>> &xnb_unit_pvt.rxb, sizeof(rxb_backup)) == 0
>> +xnb_rxpkt2rsp_empty:1769 Assertion Error: memcmp(&rxs_backup,
>> xnb_unit_pvt.rxs, sizeof(rxs_backup)) == 0
>> +52 Tests Passed
>> +1 Tests FAILED
> So you have failed tests for netback. Maybe the issue is with
> netback rather than blkback.
I am not sure where the problem is. Maybe there are two or more 
problems. Should they be discussed separately network related and disk 
related?


>> What was suspicious about this is that i am using 13.0-RELEASE-p7, it is not
>> a DEV version of XEN or anything, but there are sysctl variables with "dev"
>> prefix. Could it be that ports have been accidentally compiled XEN version
>> with some development flags turned on?
> 'dev' here means device, not developer.
Okay. Understood.


>> So those were my conclusions.
>>
>> What do you think? How should we proceed with this? Should i try to somehow
>> build XEN from git sources?
> I think that's unlikely to make any difference. I would think the
> problem is with FreeBSD rather than Xen. Can you paste the config file
> you are using the create the domain(s)?
Yet again, as i said, i do not know where the problem is. My wording 
might be off when pointing that it's Xen, since Dom0 is FreeBSD and it 
panics not Xen. It's just that i do not know the boundaries and 
responsibilities of components clearly enough. And often by mentioning 
Xen i mean whole setup or FreeBSD module that interacts with Xen.


> I've tried myself to trigger this crash by creating and destroying a
> guest in a loop but didn't manage to trigger it. I was able to create
> (and shutdown) 1328 guests successfully on a ZFS dom0 using 4GB of
> RAM.
No, just turning on/off single VM does not crash system. I had tried 
that as well before reporting bug. I had attached all /bin/sh scripts 
and configs that crash the system in bug report #261059 tar archive. 
Those scripts are really simple. I've been trying to minimize/trim 
everything off since then, but the simplest requirements for crash are 
executing 3 processes in parallel:

1) loop that creates ZFS volumes at least of size 2GB and with dd writes 
data into them, /dev/zero is good. I've never run out of space, thus 
crash is not due to exhausted space in ZFS pool.
2) loop that creates VM1 with single disk, connects through ssh to that 
VM and writes 3GB data in /tmp, then reboots, starts VM, removes file 
from /tmp, then reboots
3) loop that starts/stops VM2 with 5 disks.

As for panic timing. Less RAM for Dom0, increases speed for panic, panic 
message is the same. For 8GB RAM on my machine i have to wait for 
approx. 10 hours till panic, on 4G RAM around 6 hours. On resource 
starved Dom0 with 2GB, even shorter, but panic messages are more or less 
the same. I have crashed just by omitting step 1, but it takes longer. 
Running just on/off for both VMs, system does not crash. Thus it seems 
that some sort of ZFS load (VM internal or Dom0) is essential for panic 
to occur. I have managed to crash with VM which has 2 HDDs, thus it does 
not seem to be due to 4 HDD threshold emulated/virtio disks. It's just 
that with 5, it takes less time till crash. Usually panic occurs after 
when xbbd.$value, $value > 230.


> Do you have some kind of configuration that differs from the default
> for ZFS for example? (ie: setting a different ARC size)
I have no specific settings, everything is basically like in manual pages.
# cat /etc/sysctl.conf
vm.max_user_wired=-1
vm.max_wired=-1
kern.panic_reboot_wait_time=-1
vfs.zfs.vol.mode=2 # minimize problems with bug #262189 (default value 
still crashes system)

# /etc/rc.conf
cloned_interfaces="bridge0"
ifconfig_bridge0=".."
zfs_enable="YES"
xencommons_enable="YES"
ifconfig_em0=".."
defaultrouter=".."
sshd_* # configs
sendmail_* # configs that should  not influence anything


# cat /boot/loader.conf
autoboot_delay="1"
beastie_disable="YES"
autoboot_delay="5"
boot_multicons="YES"
boot_serial="YES"
comconsole_speed="9600"
console="comconsole,vidconsole"

xen_kernel="/boot/xen"
xen_cmdline="dom0_mem=8192M dom0_max_vcpus=2 dom0=pvh,verbose=1 
console=vga,com1 com1=9600,8n1 guest_loglvl=all loglvl=all 
sync_console=1 noreboot"
hw.usb.no_boot_wait=1


# cat xen-vm1-zvol.conf
memory = 1024
vcpus=1
name = "xen-vm1-zvol"
type = "hvm"
boot = "dc"
hdtype="ahci"

vif = [ "type=vif,bridge=bridge0" ]
disk = [ "backendtype=phy, format=raw, vdev=xvda, 
target=/dev/zvol/zpool/vmdk/dev/vm1-root" ]

vnc=1
vnclisten="0.0.0.0:7"
serial = ["pty"]

VM2 config fingerprint is the same, just with 5 HDDs, different names, 
paths, values.

> Are you ballooning the amount of memory assigned to dom0?
No, if ballooning does not happen by default, then not.


I saw Brian's message about experienced powering off panic case, but 
there was nothing that i could add to that. The way i power off VMs is 
with command:
xl shutdown -w -F $vm_name.

I am intending to change testing scripts 'xl shutdown' to ssh 'init 0' 
and 'xl list' wait loop, to rule out possible problem cause due to 'xl 
shutdown' command.

With given scripts/setup i can crash Xen + FreeBSD 12.1, 12.2 or 13.0 
RELEASES as Dom0.

Some things that i do not know to mention or not, since i can not 
reliably repeat them are. There are times when after a while OOM starts 
to kill qemu-system-i386 processes, or ssh that initialed 'xl create' 
command, although it seems that there is plenty RAM available; usually 
not long after that if testing script is started again or xl 
create/destroy is run manually, system crashes. Sometimes it feels that 
after a crash, next crash happens sooner. There was a case when 8GB RAM 
system crashed in approx. 8 minues after boot and running those scripts, 
when previously it had already crashed. But i can not think of any 
reason for that... can ZFS be left in dirty state if Xen crashes? To 
rule out that, i installed FreeBSD on separate disk(UFS) and always 
recreate testing pool from scratch.

If there is no time for you to set up the system, i would be willing to 
help with that, i can drop SSH pub key, then connect at some arranged 
time, setup takes around 30-40 mins. Nothing much is necessary, just 
some HDDs to test on, ZFS pool, and FreeBSD images, either iso or 
prepared VMs. But other than that, i have exposed all config changes i 
have made, in general it's just installed the way described in manual, 
most of things are default.

I somehow do not feel that my hardware is at fault, since i have tested 
on three different machines, different manufacturers. But maybe they all 
have some BIOS setting that is common and causes this problem; i am not 
sure of that.