Re: What's going on with vnets and epairs w/ addresses?

From: Bjoern A. Zeeb <bz_at_freebsd.org>
Date: Tue, 20 Dec 2022 20:50:09 UTC
On Tue, 20 Dec 2022, Mark Johnston wrote:

> On Sun, Dec 18, 2022 at 10:52:58AM -0600, Kyle Evans wrote:
>> On Sat, Dec 17, 2022 at 11:22 AM Gleb Smirnoff <glebius@freebsd.org> wrote:
>>>
>>>   Zhenlei,
>>>
>>> On Fri, Dec 16, 2022 at 06:30:57PM +0800, Zhenlei Huang wrote:
>>> Z> I managed to repeat this issue on CURRENT/14 with this small snip:
>>> Z>
>>> Z> -------------------------------------------
>>> Z> #!/bin/sh
>>> Z>
>>> Z> # test jail name
>>> Z> n="test_ref_leak"
>>> Z>
>>> Z> jail -c name=$n path=/ vnet persist
>>> Z> # The following line trigger jail pr_ref leak
>>> Z> jexec $n ifconfig lo0 inet 127.0.0.1/8
>>> Z>
>>> Z> jail -R $n
>>> Z>
>>> Z> # wait a moment
>>> Z> sleep 1
>>> Z>
>>> Z> jls -j $n
>>> Z>
>>> Z> After DDB debugging and tracing , it seems that is triggered by a combine of [1] and [2]
>>> Z>
>>> Z> [1] https://reviews.freebsd.org/rGfec8a8c7cbe4384c7e61d376f3aa5be5ac895915 <https://reviews.freebsd.org/rGfec8a8c7cbe4384c7e61d376f3aa5be5ac895915>
>>> Z> [2] https://reviews.freebsd.org/rGeb93b99d698674e3b1cc7139fda98e2b175b8c5b <https://reviews.freebsd.org/rGeb93b99d698674e3b1cc7139fda98e2b175b8c5b>
>>> Z>
>>> Z>
>>> Z> In [1] the per-VNET uma zone is shared with the global one.
>>> Z> `pcbinfo->ipi_zone = pcbstor->ips_zone;`
>>> Z>
>>> Z> In [2] unref `inp->inp_cred` is deferred called in inpcb_dtor() by uma_zfree_smr() .
>>> Z>
>>> Z> Unfortunately inps freed by uma_zfree_smr() are cached and inpcb_dtor() is not called immediately ,
>>> Z> thus leaking `inp->inp_cred` ref and hence `prison->pr_ref`.
>>> Z>
>>> Z> And it is also not possible to free up the cache by per-VNET SYSUNINIT tcp_destroy / udp_destroy / rip_destroy.
>>>
>>> This is known issue and I'd prefer not to call it a problem. The "leak" of a jail
>>> happens only if machine is idle wrt the networking activity.
>>>
>>> Getting back to the problem that started this thread - the epair(4)s not immediately
>>> popping back to prison0. IMHO, the problem again lies in the design of if_vmove and
>>> epair(4) in particular. The if_vmove shall not exist, instead we should do a full
>>> if_attach() and if_detach(). The state of an ifnet when it undergoes if_vmove doesn't
>>> carry any useful information. With Alexander melifaro@ we discussed better options
>>> for creating or attaching interfaces to jails that if_vmove. Until they are ready
>>> the most easy workaround to deal with annoying epair(4) come back problem is to
>>> remove it manually before destroying a jail, like I did in 80fc25025ff.
>>>
>>
>> It still behaved much better prior to eb93b99d6986, which you and Mark
>> were going to work on a solution for to allow the cred "leak" to close
>> up much more quickly. CC markj@, since I think it's been six months
>> since the last time I inquired about it, making this a good time to do
>> it again...
>
> I spent some time trying to see if we could fix this in UMA/SMR and
> talked to Jeff about it a bit.  At this point I don't think it's the
> right approach, at least for now.  Really we have a composability
> problem where different layers are using different techniques to signal
> that they're done with a particular piece of memory, and they just
> aren't compatible.
>
> One thing I tried is to implement a UMA function which walks over all
> SMR zones and synchronizes all cached items (so that their destructors
> are called).  This is really expensive, at minimum it has to bind to all

A semi-unrelated question -- do we have any documentation around SMR
in the tree which is not in subr_smr.c?

(I have to admit I find it highly confusing that the acronym is more
easily found as "Shingled Magnetic Recording (SMR)" in a different
header file).


> CPUs in the system so that it can flush per-CPU buckets.  If
> jail_deref() calls that function, the bug goes away at least in my
> limited testing, but its use is really a layering violation.
>
> We could, say, periodically scan cached UMA/SMR items and invoke their
> destructors, but for most SMR consumers this is unnecessary, and again
> there's a layering problem: the inpcb layer shouldn't "know" that it has
> to do that for its zones, since it's the jail layer that actually cares.
>
> It also seems kind of strange that dying jails still occupy a slot in
> the jail namespace.  I don't really understand why the existence of a
> dying jail prevents creation of a new jail with the same name, but
> presumably there's a good reason for it?

You can create a new jail but if you have (physical) resources tied to
the old one which are not released, then you are stuck (physical
network interfaces for example).


> Now my inclination is to try and fix this in the inpcb layer, by not
> accessing the inp_cred at all in the lookup path until we hold the inpcb
> lock, and then releasing the cred ref before freeing a PCB to its zone.
> I think this is doable based on a few observations:
> - When doing an SMR-protected lookup, we always lock the returned inpcb
>  before handing it to the caller.  So we could in principle perform
>  inp_cred checking after acquiring the lock but before returning.
> - If there are no jailed PCBs in a hash chain in_pcblookup_hash_locked()
>  always scans the whole chain.
> - If we match only one PCB in a lookup, we can probably(?) return that
>  PCB without dereferencing the cred pointer at all.  If not, then the
>  scan only has to keep track of a fixed number of PCBs before picking
>  which one to return.  So it looks like we can perform a lockless scan
>  and keep track of matches on the stack, then lock the matched PCBs and
>  perform prison checks if necessary, without making the common case
>  more expensive.
>
> In fact there is a parallel thread on freebsd-jail which reports that
> this inp_cred access is a source of frequent cache misses.  I was
> surprised to see that the scan calls prison_flag() before even checking
> the PCB's local address.  So if the hash chain is large then we're
> potentially performing a lot of unnecessary memory accesses (though
> presumably it's common for most of the PCBs to be sharing a single
> cred?).  In particular we can perhaps solve two problems at once.

I haven't heard back after I sent the test program there;  I hope that
can be solved independently first and any optimisations can then come.


> Any thoughts?  Are there some fundamental reasons this can't work?
>

-- 
Bjoern A. Zeeb                                                     r15:7