Re: ZFS + FreeBSD XEN dom0 panic

From: Roger Pau Monné <roger.pau_at_citrix.com>
Date: Thu, 24 Mar 2022 16:26:22 UTC
On Thu, Mar 24, 2022 at 06:01:24PM +0200, Ze Dupsys wrote:
> On 2022.03.24. 15:12, Roger Pau Monné wrote:
> > On Mon, Mar 21, 2022 at 06:56:05PM +0100, Roger Pau Monné wrote:
> > > On Mon, Mar 21, 2022 at 05:35:15PM +0100, Roger Pau Monné wrote:
> > > > On Mon, Mar 21, 2022 at 04:07:48PM +0200, Ze Dupsys wrote:
> > > > > On 2022.03.21. 13:14, Roger Pau Monné wrote:
> > > > > > I think the problem is not likely with the xenstore implementation
> > > > > > (ie: xs_talkv) but rather a race with how the FreeBSD kernel detects
> > > > > > and manages addition and removal of devices that hang off xenbus.
> > > > > > 
> > > > > > I'm afraid there's too much data below for me to parse it.
> > > > > 
> > > > > Understood. Sounds more tricky than i thought. What could i do to make data
> > > > > more useful?
> > > > 
> > > > I have another patch for you to try. This will make the system a bit
> > > > chatty, let's see what you get.
> > > 
> > > Forgot to mention: when testing the patch attached to the previous
> > > email there's no need to push the system until you get a panic. Just
> > > detecting when you have stale xbbd entries in sysctl would be enough,
> > > or alternatively when you start to see entries in the output of
> > > `xenstore-ls -fp` like:
> > > 
> > > /local/domain/0/backend/vbd/XX/XXXXX = ""   (n0)
> > > /local/domain/0/backend/vbd/XX/XXXXX/feature-barrier = "1"   (n0)
> > > /local/domain/0/backend/vbd/XX/XXXXX/feature-flush-cache = "1"   (n0)
> > > /local/domain/0/backend/vbd/XX/XXXXX/max-ring-page-order = "5"   (n0)
> > > 
> > > Note the lack of a '/local/domain/0/backend/vbd/XX/XXXXX/state' node.
> > > 
> > > At that point I would request that you attach the output of
> > > `xenstore-ls -fp` together with the full serial log since the system
> > > booted.
> > > 
> > > You might not need a lot of iterations to trigger that state.
> > 
> > Hello,
> > 
> > Sorry to pester, but do you have any update on this?
> > 
> > I'm quite sure there are races with xenbus device attach/detach, and
> > the earlier we can get this sorted out the better.
> 
> Hello,
> 
> Yes, i agree.
> 
> Sorry that i could not write sooner, the logs i have gathered so far does
> not seem to be helpful, garbage before panic (i'll attach anyways). I did
> run few tests, 2 of them did panic kernel, so i could not gather
> `xenstore-ls -fp` output.
> 
> While testing in all cases i did not notice any sysctl variable leaks, all
> logged sysctls seemed to be normal. Without patch system did not panic at
> first sysctl variable leak, so maybe the given patch fixed sysctl variable
> leaks, but it did not solve panic problem for sure.
> 
> I am thinking to rewrite testing scripts somehow so that information about
> system is gathered in a better way.
> 
> Would getting rid of ZFS help debugging problem? I'll try to decrease HDD
> count for VMs as well, so that there is less noise in logs.

Hm, TBH I'm not sure what the problem is. I've guessed it was some
kind of leak due to stale backends not being properly cleaned, but it
might be something else.

This seems to be a fairly common trace for your panics:

#0 0xffffffff80c74605 at kdb_backtrace+0x65
#1 0xffffffff80c26611 at vpanic+0x181
#2 0xffffffff80c26483 at panic+0x43
#3 0xffffffff810c1b97 at trap+0xba7
#4 0xffffffff810c1bef at trap+0xbff
#5 0xffffffff810c1243 at trap+0x253
#6 0xffffffff81098c58 at calltrap+0x8
#7 0xffffffff80c7f251 at rman_is_region_manager+0x241
#8 0xffffffff80c36e71 at sbuf_new_for_sysctl+0x101
#9 0xffffffff80c362bc at kernel_sysctl+0x3ec
#10 0xffffffff80c36933 at userland_sysctl+0x173
#11 0xffffffff80c3677f at sys___sysctl+0x5f
#12 0xffffffff810c249c at amd64_syscall+0x10c
#13 0xffffffff8109956b at Xfast_syscall+0xfb

Could you give me the output of executing the following on dom0:

$ addr2line -e /usr/lib/debug/boot/kernel/kernel.debug 0xffffffff80c7f251
$ addr2line -e /usr/lib/debug/boot/kernel/kernel.debug 0xffffffff80c36e71
$ addr2line -e /usr/lib/debug/boot/kernel/kernel.debug 0xffffffff80c362bc
$ addr2line -e /usr/lib/debug/boot/kernel/kernel.debug 0xffffffff80c36933
$ addr2line -e /usr/lib/debug/boot/kernel/kernel.debug 0xffffffff80c3677f

That would give us a more accurate trace.

Thanks, Roger.