Re: Some kind of race condition in adding and removing domu's causes vm zombies

From: Roger Pau Monné <roger.pau_at_citrix.com>
Date: Tue, 28 Jun 2022 11:38:59 UTC
On Thu, Jun 23, 2022 at 06:30:56PM -0700, Brian Buhrow wrote:
> 	hello.  I don't have a lot more details on the issue, but under xen-4.15 and xen-4.16 with
> freeBSD-12 and FreeBSD-13, it's pretty easy to end up with zombie domu's that are unkillable
> and unrestartable. Even worse, the block devices associated with these not-quite-gone domus'
> are unusable with other domu's without an entire system reboot.
> 
> 	How to reproduce:
> 
> 1.  Shutdown a vm that's currently running, I'm using NetBSD, but FreeBSD domus' wil
> demonstrate this behavior as well.
> 
> 
> 2.  If auto-restart is set in the domu's conf file, the domu will restart with a new domain id.
> 
> 3.  Just as the newly restarted domu is coming up, issue:
> xl destroy <domid-of-newly-started-domain>
> 
> You may see output like the following:
> 
> root# xl destroy 20
> libxl: error: libxl_device.c:1111:device_backend_callback: Domain 20:unable to remove device
> with pa
> th /local/domain/0/backend/vbd/20/768
> libxl: error: libxl_device.c:1111:device_backend_callback: Domain 20:unable to remove device
> with pa
> th /local/domain/0/backend/vif/20/0
> libxl: error: libxl_domain.c:1530:devices_destroy_cb: Domain 20:libxl__devices_destroy failed
> 
> Now, issue:
> #xl list
> (null)                                      20     0     1     --p--d    2083.7
> 
> The work around I've found for this issue is to shutdown the domu with the -h flag, causing the
> system to wait for a final keypress on the console before rebooting.  Then, while it's waiting,
> issue the xl destroy command on the old, waiting, domain ID.
> 
> this work around will prevent the issue, but it's my view that I shouldn't be able to wedge the
> destruction process in this way such that the entire machine needs to be restarted.  Being able
> to do this makes the system rather fragile.

Hm, I don't seem to be able to reproduce this on HEAD.  Could you give
a try to a HEAD kernel and see whether you can reproduce? (keep the
same userland, that should be fine).

Thanks, Roger.