[Bug 265196] talos linux vms hang on reboot at the com ports, need to reboot the host to clear it up

From: <bugzilla-noreply_at_freebsd.org>
Date: Thu, 21 Jul 2022 18:47:42 UTC
https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=265196

--- Comment #23 from John Baldwin <jhb@FreeBSD.org> ---
So in the case that bhyvectl hangs, from the procstat -kk output, bhyvectl is
waiting because some other process has the /dev/vmm/<vmname> file still open. 
For that case, you can try using 'sudo procstat -af | grep <vmname>' to see
which processes still have it open preventing bhyvectl from exiting.

For the case where you had a bhyve exit of 4 and an error of 'vm_open: No such
file or directory', that may be a race between the async destroy used on 13.1
for bhyve (but since fixed in 14 and stable/13 so that bhyvectl will now sleep
waiting for the --destroy request to end before returning).

They return value of 134 is due to abort() and is the triple fault case you
have logs of in bhyve.log.  A triple fault isn't a crash of bhyve, that is a
bit of an old-school way to reboot an x86 computer.  It's perhaps a bit odd
that a Linux guest would use that to reboot vs more conventional means. 
However, you shouldn't have to reboot the host machine just because the guest
exits due to a triple fault.  You should be able to restart the VM again
without rebooting the host.  Here I use "host" to mean the FreeBSD machine
running bhyve VMs.

Looking again, it seems like the talos upgrade is perhaps trying to use kexec
to upgrade instead of a real reboot, and that the second Linux kernel is
perhaps crashing (and not trying to use a triple fault to reboot).  Given the
turn around times for VM booting, you don't really need kexec for VMs.  If you
want to debug this you will have to debug the crash that happens in the second
Linux kernel.  It may be that there is something bhyve isn't emulating quite
right that results in the triple fault, but it will be hard to know what that
is from the bhyve side.  I would see if there's a way to configure talos to not
use kexec and just use "plain" reboots for upgrades instead.

-- 
You are receiving this mail because:
You are the assignee for the bug.