[Bug 261059] Kernel panic XEN + ZFS volume.

From: <bugzilla-noreply_at_freebsd.org>
Date: Sun, 09 Jan 2022 13:21:17 UTC
https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=261059

            Bug ID: 261059
           Summary: Kernel panic XEN + ZFS volume.
           Product: Base System
           Version: 13.0-RELEASE
          Hardware: amd64
                OS: Any
            Status: New
          Severity: Affects Only Me
          Priority: ---
         Component: kern
          Assignee: bugs@FreeBSD.org
          Reporter: zedupsys@gmail.com

Created attachment 230842
  --> https://bugs.freebsd.org/bugzilla/attachment.cgi?id=230842&action=edit
all config and test script files

Broadly describing, problem is simple - whole system reboots at
unexpected/unwanted times uncontrollably.

XEN virtualization toolstack is used. FreeBSD is run as Dom0 PVH and hosted
FreeBSD VMs DomU HVM. ZFS file system is used for Dom0. And disks for DomUs are
exposed as block devices, ZFS volumes.

I haven't been able to narrow down which area is one that is causing crash. At
first i thought that this is XEN related problem, but the more i tested, it
somewhat started to feel ZFS related as well; sort of concurrency related.
While searching i have created some scripts (will add as attachments), which
when run atleast on my testing hardware can crash system most of the time.

Based on my observations, the most effective way to crash the system is to run
as root three scripts in parallel:
1) one that creates 2GB ZFS volumes and copies data from IMG file onto ZVOL by
executing dd,
2) script that turns on/off VM1,
3) script that turns on/off VM2 and VM2 has at least 5 disks.

But it is not the only way, it is the one that crashes system faster than other
ways.


System hardware:
CPU: Intel(R) Xeon(R) CPU X3440  @ 2.53GHz
RAM: 16GB ECC
HDD: 2x WDC WD2003FYYS 2TB


System installed from FreeBSD-13.0-RELEASE-amd64-dvd1.iso, all defaults except
IP and some basic configuration. ZFS pool created automatically with name sys.
XEN toolstack installed by pkg install. Done freebsd-update.


root@lab-01 > uname -a
FreeBSD lab-01.b7.abj.lv 13.0-RELEASE-p4 FreeBSD 13.0-RELEASE-p4 #0: Tue Aug 24
07:33:27 UTC 2021    
root@amd64-builder.daemonology.net:/usr/obj/usr/src/amd64.amd64/sys/GENERIC 
amd64


root@lab-01 > freebsd-version
13.0-RELEASE-p5


root@lab-01 > zpool status
  pool: sys
 state: ONLINE
  scan: resilvered 3.70M in 00:00:03 with 0 errors on Fri Jan  7 11:06:07 2022
config:

        NAME          STATE     READ WRITE CKSUM
        sys           ONLINE       0     0     0
          mirror-0    ONLINE       0     0     0
            gpt/sys0  ONLINE       0     0     0
            gpt/sys1  ONLINE       0     0     0

errors: No known data errors


root@lab-01 > pkg info
argp-standalone-1.3_4          Standalone version of arguments parsing
functions from GLIBC
ca_root_nss-3.69_1             Root certificate bundle from the Mozilla Project
curl-7.79.1                    Command line tool and library for transferring
data with URLs
edk2-xen-x64-g202102           EDK2 Firmware for xen_x64
gettext-runtime-0.21           GNU gettext runtime libraries and programs
glib-2.70.1,2                  Some useful routines of C programming (current
stable version)
indexinfo-0.3.1                Utility to regenerate the GNU info page index
libevent-2.1.12                API for executing callback functions on events
or timeouts
libffi-3.3_1                   Foreign Function Interface
libiconv-1.16                  Character set conversion library
libnghttp2-1.44.0              HTTP/2.0 C Library
libssh2-1.9.0_3,3              Library implementing the SSH2 protocol
libxml2-2.9.12                 XML parser library for GNOME
lzo2-2.10_1                    Portable speedy, lossless data compression
library
mpdecimal-2.5.1                C/C++ arbitrary precision decimal floating point
libraries
pcre-8.45                      Perl Compatible Regular Expressions library
perl5-5.32.1_1                 Practical Extraction and Report Language
pixman-0.40.0_1                Low-level pixel manipulation library
pkg-1.17.5                     Package manager
python38-3.8.12                Interpreted object-oriented programming language
readline-8.1.1                 Library for editing command lines as they are
typed
seabios-1.14.0                 Open source implementation of a 16bit X86 BIOS
tmux23-2.3_1                   Terminal Multiplexer (old stable version 2.3)
vim-8.2.3458                   Improved version of the vi editor (console
flavor)
xen-kernel-4.15.0_1            Hypervisor using a microkernel design
xen-tools-4.15.0_2             Xen management tools
yajl-2.1.0                     Portable JSON parsing and serialization library
in ANSI C
zsh-5.8                        The Z shell


root@lab-01 > cat /boot/loader.conf
zfs_load="YES"
vfs.root.mountfrom="zfs:sys"

beastie_disable="YES"
autoboot_delay="5"

boot_multicons="YES"
boot_serial="YES"
comconsole_speed="9600"
console="comconsole,vidconsole"

xen_kernel="/boot/xen"
xen_cmdline="dom0_mem=2048M cpufreq=dom0-kernel dom0_max_vcpus=2 dom0=pvh
console=vga,com1 com1=9600,8n1 guest_loglvl=all loglvl=all"

hw.usb.no_boot_wait=1


root@lab-01 > cat /etc/rc.conf
hostname="lab-01.b7.abj.lv"

cloned_interfaces="bridge10"

create_args_bridge10="name xbr0"
cloned_interfaces_sticky="YES"

ifconfig_xbr0="inet 10.63.0.1/16"

zfs_enable="YES"
sshd_enable="YES"

xencommons_enable="YES"


Besides default ZFS dataset that is mounted at /, i have created parent for VM
ZVOLs and for working in /service directory.
root@lab-01 > zfs list
NAME           USED  AVAIL     REFER  MOUNTPOINT
sys           98.6G  1.66T     1.99G  /
sys/service   96.6G  1.66T     96.6G  /service
sys/vmdk        48K  1.66T       24K  none
sys/vmdk/dev    24K  1.66T       24K  none

# zfs create -o mountpoint=none sys/vmdk
# etc.

I am running scripts from folder /service/crash, so attachments can just be
placed there on fresh system. Scripts need SSH key, thus create it by command
ssh-keygen.


Attached file descriptions:
lib.sh - this file contains reusable functions for tests and VM preparation,
used by test scripts and manually.
libexec.sh - this is just a wrapper file which uses first arg as function name
to be called from lib.sh, this is used for manual function calls.
test_vm1_zvol_on_off.sh - this script in loop executes VM1 boot, sleep, VM1
power off
test_vm2_zvol_on_off.sh - this script in loop executes VM2 boot, sleep, VM2
power off
test_vm2_zvol_5_on_off.sh - this turns on/off VM2 which has 5 HDDs
test_vm1_zvol_3gb.sh - this turns VM1 on/off, and writes/removes 3GB file in
VM1:/tmp folder
xen-vm1-zvol.conf - XEN config file for VM1
xen-vm2-zvol.conf - XEN config file for VM2
xen-vm2-zvol-5.conf - XEN config file for VM2 with 5 HDDs.


To create VMs:
With all those attached files in /service/crash. Run as root:
./libexec.sh vm1_img_create
./libexec.sh vm2_img_create

These commands will create VM1 and VM2 disk images, set internal IP as defined
in lib.sh and copy SSH key from hosts /root/.ssh int VM disks. VM image is
downloaded from
https://download.freebsd.org/ftp/releases/VM-IMAGES/13.0-RELEASE/amd64/Latest/FreeBSD-13.0-RELEASE-amd64.raw.xz,
so network connection is necessary or file FreeBSD-13.0-RELEASE-amd64.raw.xz
must be placed in folder /service/crash/cache.

Then to convert IMG to ZVOL:
./libexec.sh vm1_img_to_zvol
./libexec.sh vm2_img_to_zvol

Sometimes at this point there is dd error, that /dev/zvol/sys/vmdk/dev/vm1-root
is not accessible. There is some ZFS bug, but i could not repeat it reliably
enough to write bug report. So just reboot system, it will show up, just rerun
command.

Create dummy disks for VM2 data.
./libexec.sh vmdk_empty_create vm2-data1.img 2G
./libexec.sh vmdk_empty_create vm2-data2.img 2G
./libexec.sh vmdk_empty_create vm2-data3.img 2G
./libexec.sh vmdk_empty_create vm2-data4.img 2G

./libexec.sh vm2_data_to_zvol

Now that everything is prepared, just test VMs with
xl create xen-vm1-zvol.conf

To see that VM boots, run:
xl console xen-vm1-zvol

It is necessary to connect with SSH manually once, to ensure that connection
works and SSH updates /root/.ssh/known_hosts.

Before test start, expected ZFS layout is:
root@lab-01 #1> zfs list
NAME                     USED  AVAIL     REFER  MOUNTPOINT
sys                      142G  1.62T     1.99G  /
sys/service              111G  1.62T      111G  /service
sys/vmdk                28.9G  1.62T       24K  none
sys/vmdk/dev            28.9G  1.62T       24K  none
sys/vmdk/dev/vm1-root   10.3G  1.62T     5.07G  -
sys/vmdk/dev/vm2-data1  2.06G  1.62T       12K  -
sys/vmdk/dev/vm2-data2  2.06G  1.62T     2.00G  -
sys/vmdk/dev/vm2-data3  2.06G  1.62T     2.00G  -
sys/vmdk/dev/vm2-data4  2.06G  1.62T       12K  -
sys/vmdk/dev/vm2-root   10.3G  1.62T     5.07G  -

And directory
# ls -la /dev/zvol/sys/vmdk/dev/
total 1
dr-xr-xr-x  2 root  wheel      512 Jan  9 14:27 .
dr-xr-xr-x  3 root  wheel      512 Jan  9 14:27 ..
crw-r-----  1 root  operator  0x72 Jan  9 14:27 vm1-root
crw-r-----  1 root  operator  0x70 Jan  9 14:27 vm2-data1
crw-r-----  1 root  operator  0x71 Jan  9 14:27 vm2-data2
crw-r-----  1 root  operator  0x75 Jan  9 14:27 vm2-data3
crw-r-----  1 root  operator  0x73 Jan  9 14:27 vm2-data4
crw-r-----  1 root  operator  0x74 Jan  9 14:27 vm2-root

For me sometimes there are missing ZVOLs in /dev/zvol directory, vm2-data1 or
vm2-data3, even if zfs list shows them up, thus init 6, before tests can be
started.


Once the environment is ready, just run from three different SSH sessions
commands:
1) cd /service/crash; ./libexec.sh zfs_volstress
2) cd /service/crash; ./test_vm1_zvol_on_off.sh
3) cd /service/crash; ./test_vm2_zvol_5_on_off.sh

Sometimes it crashes fast (in 2 minutes) sometimes it takes some time, like 30
minutes.

My observations so far.

1. ZVOLs are acting weird, for example at some point i see output like this:

./libexec.sh: creating sys/stress/data1 2G
dd: /dev/zvol/sys/stress/data1: No such file or directory
./libexec.sh: creating sys/stress/data2 2G
4194304+0 records in
4194304+0 records out
2147483648 bytes transferred in 70.178650 secs (30600241 bytes/sec)
./libexec.sh: creating sys/stress/data3 2G
4194304+0 records in
4194304+0 records out
2147483648 bytes transferred in 73.259213 secs (29313496 bytes/sec)
./libexec.sh: creating sys/stress/data4 2G
dd: /dev/zvol/sys/stress/data4: Operation not supported
./libexec.sh: creating sys/stress/data5 2G
dd: /dev/zvol/sys/stress/data5: Operation not supported
./libexec.sh: creating sys/stress/data6 2G

For me this seems somewhat unexpected behaviour, since each time before dd is
run, zfs create has returned; it is not done in parallel from user's
perspective. See function zfs_volstress in lib.sh file.


2. Often, but not always there are problems with starting VM2 before system
crash, output:

libxl: error: libxl_device.c:1111:device_backend_callback: Domain 53:unable to
add device with path /local/domain/0/backend/vbd/53/51712
libxl: error: libxl_create.c:1613:domcreate_launch_dm: Domain 53:unable to add
disk devices
libxl: error: libxl_domain.c:1182:libxl__destroy_domid: Domain 53:Non-existant
domain
libxl: error: libxl_domain.c:1136:domain_destroy_callback: Domain 53:Unable to
destroy guest
libxl: error: libxl_domain.c:1063:domain_destroy_cb: Domain 53:Destruction of
domain failed
./test_vm2_zvol_single_hdd_on_off.sh: waiting VM to be ready

Sometimes must restart script ./test_vm2_zvol_single_hdd_on_off.sh, because it
is not smart with waiting for VM2 start.


3. It is not necessary for VM2 to have 5 disks to crash system; even running
1) cd /service/crash; ./libexec.sh zfs_volstress
2) cd /service/crash; ./test_vm1_zvol_on_off.sh
3) cd /service/crash; ./test_vm2_zvol_on_off.sh

Will crash system eventually, but it takes much longer to wait for it;
sometimes for me it takes 2-3 hours.


4. If just running, test_vm1_zvol_on_off and test_vm2_zvol_on_off, system seems
not to crash, or maybe i did not wait long enough; it was whole day. Thus ZFS
load seems essential to provoke panic.


5. It is possible to crash system with scripts only 2 scripts:
1) cd /service/crash; ./test_vm1_zvol_3gb.sh (this writes 3GB data inside
VM1:/tmp)
2) cd /service/crash; ./test_vm2_zvol_5_on_off.sh

Writing larger files inside VM1 tends to provoke panic sooner; with 1GB could
not repeat the case often enough.

The problem is that there is little info when system crashes. I am open for
advices how could i capture more useful data, but below are some incomplete,
for me seemed interesting fragments from serial output:

Fatal trap 12: page fault while in kernel mode
cpuid = 1; apic id = 02
fault virtual address   = 0x30028
fault code              = supervisor read data, page not present
instruction pointer     = 0x20:0xffffffff80c45832
stack pointer           = 0x28:0xfffffe00967ec930
frame pointer           = 0x28:0xfffffe00967ec930
cod


Fatal trap 9: general protection fault while in kernel mode
cpuid = 0; apic id = 00
instruction pointer     = 0x20:0xffffffff80c45832
stack pointer           = 0x28:0xfffffe009666b930
frame pointer           = 0x28:0xfffffe009666b930
code segment            = base 0x0, limit 0xfffff, type 0x1b
                        = DPL 0,


Fatal trap 12: page fault w


(d2) Booting from Hard Disk...
(d2) Booting from 0000:7c00
(XEN) d1v0: upcall vector 93
(XEN) d2v0: upcall vector 93
xnb(xnb_frontend_changed:1391): frontend_state=Connected, xnb_state=InitWait
xnb(xnb_connect_comms:787): rings connected!
xbbd4: Error 12 Unable to allocate request bounce buffers
xbbd4: Fatal error. Transitioning to Closing State
panic: pmap_growkernel: no memory to grow kernel
cpuid = 0
time = 1641731595
KDB: stack backtrace:
#0 0xffffffff80c574c5 at kdb_backtrace+0x65
#1 0xffffffff80c09ea1 at vpanic+0x181
#2 0xffffffff80c09d13 at panic+0x43
#3 0xffffffff81073eed at pmap_growkernel+0x27d
#4 0xffffffff80f2da88 at vm_map_insert+0x248
#5 0xffffffff80f301e9 at vm_map_find+0x549
#6 0xffffffff80f2bf16 at kmem_init+0x226
Loading /boot/loader.conf.local



I am interested in solving this. This is a testing machine, thus i can run
modified tests any time. But i am somewhat out of ideas what could be done to
get more verbose output, so that at least complete messages are written in
serial output before automatic reboot happens.

As for "panic: pmap_growkernel: no memory to grow kernel", for me it seemed
that it should be enough that Dom0 has 8GB RAM, and each VM 1GB. But i do not
claim that i am XEN expert and maybe this could be clasified as
misconfiguration of system. If so, i am open to pointers what could be done to
make system more stable.

The same scripts can crash RELEASE-12.1 as well. Tested.

-- 
You are receiving this mail because:
You are the assignee for the bug.