Re: ZFS + FreeBSD XEN dom0 panic

From: Roger Pau Monné <roger.pau_at_citrix.com>
Date: Mon, 21 Mar 2022 11:14:16 UTC
On Fri, Mar 18, 2022 at 08:48:08PM +0200, Ze Dupsys wrote:
> On 2022.03.18. 17:24, Roger Pau Monné wrote:
> > On Tue, Mar 15, 2022 at 08:51:57AM +0200, Ze Dupsys wrote:
> > > On 2022.03.14. 11:19, Roger Pau Monné wrote:
> > > > On Mon, Mar 14, 2022 at 10:06:58AM +0200, Ze Dupsys wrote:
> > > > > ..
> > > > > 
> > > > > Why those lines starting "xnb(xnb_detach:1330):" do not have any message?
> > > > > Could it be that there is a bad pointer to message buffer that can not be
> > > > > printed? And then sometimes panic happens because access goes out of allowed
> > > > > memory region?
> > > > Some messages in netback are just "\n", likely leftovers from debug.
> > > Okay, found the lines, it is as you say. So this will not be an easy one.
> > > 
> > > 
> > > > Can you try to stress the system again but this time with guests not
> > > > having any network interfaces? (so that netback doesn't get used in
> > > > dom0).
> > > I'll try to come up with something. At the moment all commands to VMs are
> > > given through ssh.
> > > 
> > > 
> > > > Then if you could rebuild the FreeBSD dom0 kernel with the above patch
> > > > we might be able to get a bit more of info about blkback shutdown.
> > > I rebuilt 13.1 STABLE, with commenting out #undef and adding #define, thus
> > > line number will differ by single line. For this test i did not remove
> > > network interfaces, and did add DPRINTF messages to xnb_detach function as
> > > well, since i hoped to maybe catch something there, by printing pointers. I
> > > somewhat did not like that xnb_detach does not check for NULL return from
> > > device_get_softc, nor for device_t argument, so i though, maybe those
> > > crashes are something related to that. But i guess this will not be so easy,
> > > and maybe it is safe to assume that "device_t dev" is always valid in that
> > > context.
> > > 
> > > So i ran stress test, system did not crash as it happens often when more
> > > debugging info is printed, characteristics change. But it did leak sysctl
> > > xbbd variables. I'll attach all collected log files. sysctl and xl list
> > > commands differ in timing a little bit. xl list _02 is when all VMs are
> > > turned off. Sysctl only has keys without values, not to trigger xnb tests
> > > while reading all values.
> > 
> > So I've been staring at this for a while, and I'm not yet sure I
> > figured out exactly what's going on, but can you give a try to the
> > patch below?
> > 
> > Thanks, Roger.
> > ---8<---
> > diff --git a/sys/xen/xenbus/xenbusb.c b/sys/xen/xenbus/xenbusb.c
> > index e026f8203ea1..a8b75f46b9cc 100644
> > --- a/sys/xen/xenbus/xenbusb.c
> > +++ b/sys/xen/xenbus/xenbusb.c
> > @@ -254,7 +254,7 @@ xenbusb_delete_child(device_t dev, device_t child)
> >   static void
> >   xenbusb_verify_device(device_t dev, device_t child)
> >   {
> > -	if (xs_exists(XST_NIL, xenbus_get_node(child), "") == 0) {
> > +	if (xs_exists(XST_NIL, xenbus_get_node(child), "state") == 0) {
> >   		/*
> >   		 * Device tree has been removed from Xenbus.
> >   		 * Tear down the device.
> > 
> 
> Hi, Roger,
> 
> Yes, i will run it, but results will likely be on Monday.
> 
> I have done some more digging, but since i do not know how things work
> internally, i have a lot of unknowns and assumptions. I did a lot of DPRINTF
> stuff in various places, i will attach logs, maybe they give better picture.
> 
> For easier log reading i filter single log pair with commands:
> sed -i '' -e 's/0xfffff80/0x/g' *
> cat *xen_serial_* | grep -E 'xbb|talkv|xs_read|xs_gather|xenbusb_add_device'
> > xbb.log
> cat *sysctl* | grep -E '^dev.xbb' | grep -Ev
> "\.%location:|\.%pnpinfo:|\.%parent" > xbb_sysctl.log
> 
> Then i search for leaked sysctl *.xenstore_path and search that in xbb.log.
> As far as i can tell, cases when it is leaked xbb's have no
> xbb_close_backend call, for whatever reason. Often pointers for leaked xbb
> are a reused ones from previously closed xbb instnaces.
> 
> 
> My suspicion goes for blkback.c, xbb_attach_disk lines:
> ..
>   error = xs_gather(XST_NIL, xenbus_get_node(dev), "physical-device-path",
>       NULL, &xbb->dev_name, NULL);
>   if (error != 0)
>     return;
> ..
> 
> The thing is that for those lost xbb's, this is where error = 2, which is
> ENOENT (if i found correct code).
> 
> Which is weird, because i assume that xbb_attach_disk function is called
> after "hot-plug script" has finished execution. This is defined in
> xbb_attach(device_t dev):
> ..
>   KASSERT(!xbb->hotplug_done, ("Hotplug scripts already executed"));
>   watch_path = xs_join(xenbus_get_node(xbb->dev), "physical-device-path");
>   xbb->hotplug_watch.callback_data = (uintptr_t)dev;
>   xbb->hotplug_watch.callback = xbb_attach_disk;
> ..
> 
> So i added debugging messages at file: /usr/local/etc/xen/scripts/block
> after line "xenstore-write $path/physical-device-path $params", but did not
> see anything suspicious.
> 
> One of my ideas is that, maybe for some reason callback is called too ealry,
> because i would expect that when the callback is called, the path
> $path/physical-device-path should exist (is created by script). But then how
> can xs_gather fail with ENOENT there? Something manages to remove
> $path/physical-device-path in the short window between xenstore-write and
> while callback activates?
> 
> Another thing is that function xbb_attach_disk on error for "error =
> xs_gather" just returns if != 0. I would expect it to be able to handle
> error gracefully it should flag xbb with an error and then shut down or
> something, but since i do not know internals, it is hard to judge correct
> behaviour in that context.
> 
> Another interesting behaviour is that xbb_shutdown is called twice for each
> instance. Once from some function and then from xbb_detach.

This is expected. On proper domain shutdown the backend will be closed
when the frontend switches to Closing state. Hence the cal from
xbb_detach is for cases where the backend hasn't been closed properly
(ie: domain destroy or stuck frontend).

> 
> My problem of geting more useful info is that i do not understand the flow
> of VM creation, including disk allocation.
> 
> For my understanding they just "magically" appear detected with xbb_probe.
> :)
> 
> I tried to understand function xs_gather called by xbb_attach, implemented
> by dev/xen/xenstore/xenstore.c, but it just writes packet somewhere; i
> assumed that dev/xen/xenstore/xenstore_dev.c is the one that handles other
> end for that packet, so i looked at "case XS_READ:" in that file, but it
> calls xs_dev_request_and_reply which is implemented by
> dev/xen/xenstore/xenstore.c, so for me it feels like going in circles. So at
> the moment i did not find which component could be the one that sends
> ENOENT.
> 
> Could you tell the big picture how VM block dev is created? I mean is it xen
> tools writing request in xenstore_dev, but how it then activates xbb probe?
> Which component is responsible for allocating memory xbb instance?

PV devices are all handled in xenstore (a.k.a. xenbus) which is a kind
of key-value database that contains the information about devices.

The FreeBSD kernel sets a watch on certain xenstore paths, so that
when new entries are added a specific function is called on order to
detect whether devices are being added or removed.

> 
> I added printf in dev/xen/xenstore/xenstore.c
> ..
> xs_talkv(struct xs_transaction t, enum xsd_sockmsg_type request_type,
>     const struct iovec *iovec, u_int num_vecs, u_int *len, void **result)
> {
> ..
>   if (msg.type == XS_ERROR) {
>     error = xs_get_error(ret);
>     // .. printf here ..
>     free(ret, M_XENSTORE);
> ..
> And it prints error 22 a lot (EINVAL, if like in errno.h) and sometimes
> error 2 (this is the one i suppose i saw at mentioned xs_gather call). The
> trouble is that i can not locate the source for xs_talkv error generation +
> when all error messages are printed on serial, it starts to influence
> overall system behaviour, since serial can not keep up with message
> generation speed and as a result sysctl leak does not trigger as fast.
> 
> Could i create static variable in xenstore.c and set it to 1 from blkback.c
> when i want logging and to 0 when i don't? Or is it a global component and
> when blkback.c would set's it's logging to 1, all other concurrent messages
> will be logged?
> 
> I would not expect xs_talkv to encounter as many error 22 cases, as it does.
> Could this be relevant? Or could it be that it's just by design that some
> functions test if key exists before creating new one?
> 
> I stared at xs_talkv function for a while. It calls xs_write_store twice, so
> i thought maybe there could be a case when message is not writen
> "atomically"? Thus error 22. Yes, it does sx_xlock(&xs.request_mutex), but
> function xs_write_store does mtx_lock(&xs.ring_lock); and
> mtx_unlock(&xs.ring_lock); in an interesting manner. OTOH if this really was
> even a bit wrong, then system would be far more unstable maybe even
> unusable, right?
> 
> As for now, it seems that actually the third process that was creating zfs
> volumes is not important. What is important is load on system, not load on
> ZVOLs, because i was running VM1 and VM2 on/off stress without zfs create/dd
> script, but was building kernel on parallel, sysctl variables were leaked.
> Kernel was building on UFS ada0, VM disks on ZVOLs ada1.
> 
> For today i guess this is it.
> 
> Thanks for looking into this and sorry i can not produce more useful error
> logs. Earlier dated has no error code prints. Sysctl variables are taken
> when all VMs are down.

I think the problem is not likely with the xenstore implementation
(ie: xs_talkv) but rather a race with how the FreeBSD kernel detects
and manages addition and removal of devices that hang off xenbus.

I'm afraid there's too much data below for me to parse it.

Let me know whether the proposed patch made any difference.

Thanks, Roger.