From nobody Fri Jun 24 01:30:56 2022 X-Original-To: freebsd-xen@mlmmj.nyi.freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2610:1c1:1:606c::19:1]) by mlmmj.nyi.freebsd.org (Postfix) with ESMTP id DA61B873AA3 for ; Fri, 24 Jun 2022 01:31:07 +0000 (UTC) (envelope-from buhrow@nfbcal.org) Received: from nfbcal.org (ns.NFBCAL.ORG [157.22.230.125]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (Client CN "nfbcal.org", Issuer "R3" (verified OK)) by mx1.freebsd.org (Postfix) with ESMTPS id 4LTfhp4fR9z3lDp for ; Fri, 24 Jun 2022 01:31:05 +0000 (UTC) (envelope-from buhrow@nfbcal.org) Received: from nfbcal.org (localhost [127.0.0.1]) by nfbcal.org (8.15.2/8.14.1-NFBNETBSD) with ESMTPS id 25O1UuM1007299 (version=TLSv1.2 cipher=DHE-RSA-AES256-GCM-SHA384 bits=256 verify=NO); Thu, 23 Jun 2022 18:30:57 -0700 (PDT) X-Virus-Status: Clean X-Virus-Scanned: clamav-milter 0.102.2 at lothlorien.nfbcal.org Received: (from buhrow@localhost) by nfbcal.org (8.15.2/8.12.11) id 25O1Uul3002911; Thu, 23 Jun 2022 18:30:56 -0700 (PDT) Message-Id: <202206240130.25O1Uul3002911@nfbcal.org> From: Brian Buhrow Date: Thu, 23 Jun 2022 18:30:56 -0700 X-Mailer: Mail User's Shell (7.2.6 beta(4.pl1)+dynamic 20000103) To: freebsd-xen@freebsd.org Subject: Some kind of race condition in adding and removing domu's causes vm zombies Cc: buhrow@nfbcal.org X-Greylist: Sender IP whitelisted, not delayed by milter-greylist-4.4.3 (nfbcal.org [127.0.0.1]); Thu, 23 Jun 2022 18:30:57 -0700 (PDT) X-Rspamd-Queue-Id: 4LTfhp4fR9z3lDp X-Spamd-Bar: -- Authentication-Results: mx1.freebsd.org; dkim=none; dmarc=none; spf=pass (mx1.freebsd.org: domain of buhrow@nfbcal.org designates 157.22.230.125 as permitted sender) smtp.mailfrom=buhrow@nfbcal.org X-Spamd-Result: default: False [-2.44 / 15.00]; MID_RHS_MATCH_FROM(0.00)[]; ARC_NA(0.00)[]; NEURAL_HAM_MEDIUM(-1.00)[-1.000]; FROM_HAS_DN(0.00)[]; R_SPF_ALLOW(-0.20)[+a:ns.nfbcal.org]; NEURAL_HAM_LONG(-1.00)[-1.000]; MIME_GOOD(-0.10)[text/plain]; TO_DN_NONE(0.00)[]; DMARC_NA(0.00)[nfbcal.org]; TO_MATCH_ENVRCPT_SOME(0.00)[]; NEURAL_HAM_SHORT(-0.14)[-0.143]; RCPT_COUNT_TWO(0.00)[2]; MLMMJ_DEST(0.00)[freebsd-xen]; FROM_EQ_ENVFROM(0.00)[]; R_DKIM_NA(0.00)[]; MIME_TRACE(0.00)[0:+]; ASN(0.00)[asn:7091, ipnet:157.22.0.0/16, country:US]; RCVD_COUNT_TWO(0.00)[2]; RCVD_TLS_ALL(0.00)[] X-ThisMailContainsUnwantedMimeParts: N List-Id: Discussion List-Archive: https://lists.freebsd.org/archives/freebsd-xen List-Help: List-Post: List-Subscribe: List-Unsubscribe: Sender: owner-freebsd-xen@freebsd.org X-BeenThere: freebsd-xen@freebsd.org hello. I don't have a lot more details on the issue, but under xen-4.15 and xen-4.16 with freeBSD-12 and FreeBSD-13, it's pretty easy to end up with zombie domu's that are unkillable and unrestartable. Even worse, the block devices associated with these not-quite-gone domus' are unusable with other domu's without an entire system reboot. How to reproduce: 1. Shutdown a vm that's currently running, I'm using NetBSD, but FreeBSD domus' wil demonstrate this behavior as well. 2. If auto-restart is set in the domu's conf file, the domu will restart with a new domain id. 3. Just as the newly restarted domu is coming up, issue: xl destroy You may see output like the following: root# xl destroy 20 libxl: error: libxl_device.c:1111:device_backend_callback: Domain 20:unable to remove device with pa th /local/domain/0/backend/vbd/20/768 libxl: error: libxl_device.c:1111:device_backend_callback: Domain 20:unable to remove device with pa th /local/domain/0/backend/vif/20/0 libxl: error: libxl_domain.c:1530:devices_destroy_cb: Domain 20:libxl__devices_destroy failed Now, issue: #xl list (null) 20 0 1 --p--d 2083.7 The work around I've found for this issue is to shutdown the domu with the -h flag, causing the system to wait for a final keypress on the console before rebooting. Then, while it's waiting, issue the xl destroy command on the old, waiting, domain ID. this work around will prevent the issue, but it's my view that I shouldn't be able to wedge the destruction process in this way such that the entire machine needs to be restarted. Being able to do this makes the system rather fragile. -thanks -Brian