Re: ZFS + FreeBSD XEN dom0 panic

In reply to: Roger Pau Monné : "Re: ZFS + FreeBSD XEN dom0 panic"
Go to: [ bottom of page ] [ top of archives ] [ this month ]
From: Ze Dupsys <zedupsys_at_gmail.com>
Date: Fri, 15 Apr 2022 09:06:12 UTC
On 2022.04.14. 10:39, Roger Pau Monné wrote:
> ..
> Thanks. I will groom those patches in order to prepare them for
> commit. Regardless of whether there are other issues still lurking I
> think those changes are worth committing now.

Hi,

So the tests still are running with 3 patches, seems a lot better than 
without, soon i think i will stop them, since they have proved a point.

uptime
11:22AM  up 1 day, 19:18, 8 users, load averages: 2.17, 2.04, 2.12


There has been some problem though, because at this stage xl list has 
(null) VM.

xl list
Name                                        ID   Mem VCPUs      State 
Time(s)
Domain-0                                     0  1023     4     r----- 
211752.0
(null)                                     346     0     1     --ps-d 
    61.7
xen-vm2nonic-zvol-5                        557  1024     1     r----- 
    47.2
xen-vm1nonic-zvol                          558  1024     1     -b---- 
    42.0

I have no idea why it went in that state. I've been collecting vmstat -m 
since start and after filtering out, i think i got maybe useful hints.

Pictures in url: https://file.fm/u/k67uhj436#/

In general we can see, that xbbd with 3 patches seems not to leak 
memory. I did not see any component whose memory usage was just growing. 
Images show values of InUse of vmstat -m, y is bytes, x unix timestamp, 
but i don't know if maybe i should have looked into MemUse instead.

We see that solaris takes quiet a chunk, which is expected for ZFS i 
guess. But the interesting thing is a spike in newblk, looking deeper it 
can be seen that this spike happens at the same time when jsegdep spike 
is, and then there is at smaller scale, but still spike at the same time 
for jseg. At the same time when there is spike for newblk there is a 
little bit dip for solaris.

If there are some specific ones i should pay more attention, i could 
look at them.

I would like to speculate that "..pmap_growkernel.." panic happens at 
those times when there is spike high enough and that for this current 
case it was just a lucky coincidence that system had more mem and did 
not panic. Or maybe this is where this (null) VM appeared. 
Unfortunateley i did not log output of xl list with timestamps.

Currently xenstore-ls -fp, does not contain any row with 346, so i 
suppose that disks have been freed, i don't see any suspicious sysctl 
variables either, so i do not know what state this (null) VM is in and 
why it is not cleaning up. Is there some useful command in this case to 
collect more info about (null) VM?

Thanks.