From nobody Tue Dec 20 20:50:09 2022 X-Original-To: freebsd-jail@mlmmj.nyi.freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2610:1c1:1:606c::19:1]) by mlmmj.nyi.freebsd.org (Postfix) with ESMTP id 4Nc7xf6DxYz1GyGF for ; Tue, 20 Dec 2022 20:50:14 +0000 (UTC) (envelope-from bz@freebsd.org) Received: from smtp.freebsd.org (smtp.freebsd.org [96.47.72.83]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature RSA-PSS (4096 bits) server-digest SHA256 client-signature RSA-PSS (4096 bits) client-digest SHA256) (Client CN "smtp.freebsd.org", Issuer "R3" (verified OK)) by mx1.freebsd.org (Postfix) with ESMTPS id 4Nc7xf4Swjz4Lbt; Tue, 20 Dec 2022 20:50:14 +0000 (UTC) (envelope-from bz@freebsd.org) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=freebsd.org; s=dkim; t=1671569414; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: in-reply-to:in-reply-to:references:references; bh=q/Q/Vp7ojc0jFeICZS5HMz+J9jG6pn78B9j0KYAmZIo=; b=wqVyt5R2cXReagpVrZQb+gZ0iDu/TLssli0qWlZ+bdv0oJAePUjLkK0A8vR7oVl36kAGj/ 4aRmAJ9G9rT7FubG3x7GrAxZioxjZd3nStzsgbOBMXy0CY5VIu222kZJ9S+3L7GkEKtDlJ v1e2l+okNbNkZaleY1AQtTSHfqTTY4A8ID1KlYqN7ur5YO+LGC0WsaYsZFptXXutchLvdn Bl8XM0MxmTIW8VosebYhyVbeHwdKsDEUZHmzLF1ytw4QtN9Yi/UA/det5+fGiItRTDsZbN x43YtLLWqlJyNgrg7KdKNDWXKkdsEFpuvZS/hioNmr/d+uiaspXGZa03ZZ11+Q== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=freebsd.org; s=dkim; t=1671569414; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: in-reply-to:in-reply-to:references:references; bh=q/Q/Vp7ojc0jFeICZS5HMz+J9jG6pn78B9j0KYAmZIo=; b=VFRZEHvSIVd0f8djJ2JsW0UoVEmBSQYxd7V0+JTqMZt30+1lYvDgIDXzhPr1Qn9pxks5kz 2NLBD6DNcp7CU5p16FYLNlZyTVBdBy6qASiJTgLQO9PtCbtZ7eaUY3e3XUcwES7Dg1vhs2 u/PFKXhcRrQ7UhsIl13HAz4Oi95ZPxjy8ZEW3crfq3eX6O8AlkLhNrVl2yiKiBVVeqC6N9 WHs7eS255MQk710DW8ExTHpxuIvRckoSTvDCk9HFWqOEtuXgmEUJDsEdXapSCarkaueHAr GhNXC5LkF6i0qZa8c0DnqwgJlhfD4FvesixfmVAJ2YnzbWKoIFE/a4+auJZQUw== ARC-Authentication-Results: i=1; mx1.freebsd.org; none ARC-Seal: i=1; s=dkim; d=freebsd.org; t=1671569414; a=rsa-sha256; cv=none; b=txdgrRhEuSeudVrKrmoCsunqKi1N7Bj4Tdna9x2HhLqSxn/wN/NIgdnL9lIvPJvQtmCIBQ 1JOZvDeBKHbAOjdFSUJzYj3MVTgmq3uQSDLxq+suCsdD9ouxx1lKrU/RNMq1ncaqx/odLT 4m+lZ6C64LnWNMBBItI8n6k91G4f6J0QloWxJgrQUdqyFzijzkolSymDtlfJ0qRa4YBHNw cPoiVdAAFfYKj51En0Aix88Bg5nu1fpdURdfNCHRYIiY9GtHSfBlxqfrcT0jaQ1mBkx+pr APVV+KPjqN3oR/Uzxw7PwYjHUg2XP0rJ+0kISB18xy5aLwf0zWAfUUCwxFVwRw== Received: from mx1.sbone.de (cross.sbone.de [195.201.62.131]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (Client CN "mx1.sbone.de", Issuer "SBone.DE" (not verified)) (Authenticated sender: bz/mail) by smtp.freebsd.org (Postfix) with ESMTPSA id 4Nc7xf1tpgzG79; Tue, 20 Dec 2022 20:50:14 +0000 (UTC) (envelope-from bz@freebsd.org) Received: from mail.sbone.de (mail.sbone.de [IPv6:fde9:577b:c1a9:4902:0:7404:2:1025]) (using TLSv1.2 with cipher ADH-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by mx1.sbone.de (Postfix) with ESMTPS id B90D98D4A228; Tue, 20 Dec 2022 20:50:12 +0000 (UTC) Received: from content-filter.t4-02.sbone.de (content-filter.t4-02.sbone.de [IPv6:fde9:577b:c1a9:4902:0:7404:2:2742]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature RSA-PSS (4096 bits) server-digest SHA256) (No client certificate requested) by mail.sbone.de (Postfix) with ESMTPS id E6E885C3A833; Tue, 20 Dec 2022 20:50:11 +0000 (UTC) X-Virus-Scanned: amavisd-new at sbone.de Received: from mail.sbone.de ([IPv6:fde9:577b:c1a9:4902:0:7404:2:1025]) by content-filter.t4-02.sbone.de (content-filter.t4-02.sbone.de [IPv6:fde9:577b:c1a9:4902:0:7404:2:2742]) (amavisd-new, port 10024) with ESMTP id euS0YaqQHCtP; Tue, 20 Dec 2022 20:50:10 +0000 (UTC) Received: from strong-iwl0.sbone.de (strong-iwl0.sbone.de [IPv6:fde9:577b:c1a9:4902:b66b:fcff:fef3:e3d2]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature RSA-PSS (4096 bits) server-digest SHA256) (No client certificate requested) by mail.sbone.de (Postfix) with ESMTPSA id E98E75C3A830; Tue, 20 Dec 2022 20:50:09 +0000 (UTC) Date: Tue, 20 Dec 2022 20:50:09 +0000 (UTC) From: "Bjoern A. Zeeb" To: Mark Johnston cc: Kyle Evans , Gleb Smirnoff , Zhenlei Huang , "freebsd-jail@freebsd.org" Subject: Re: What's going on with vnets and epairs w/ addresses? In-Reply-To: Message-ID: References: <5r22os7n-ro15-27q-r356-rps331o06so5@mnoonqbm.arg> <150A60D6-6757-46DD-988F-05A9FFA36821@FreeBSD.org> X-OpenPGP-Key-Id: 0x14003F198FEFA3E77207EE8D2B58B8F83CCF1842 List-Id: Discussion about FreeBSD jail(8) List-Archive: https://lists.freebsd.org/archives/freebsd-jail List-Help: List-Post: List-Subscribe: List-Unsubscribe: Sender: owner-freebsd-jail@freebsd.org MIME-Version: 1.0 Content-Type: text/plain; charset=US-ASCII; format=flowed X-ThisMailContainsUnwantedMimeParts: N On Tue, 20 Dec 2022, Mark Johnston wrote: > On Sun, Dec 18, 2022 at 10:52:58AM -0600, Kyle Evans wrote: >> On Sat, Dec 17, 2022 at 11:22 AM Gleb Smirnoff wrote: >>> >>> Zhenlei, >>> >>> On Fri, Dec 16, 2022 at 06:30:57PM +0800, Zhenlei Huang wrote: >>> Z> I managed to repeat this issue on CURRENT/14 with this small snip: >>> Z> >>> Z> ------------------------------------------- >>> Z> #!/bin/sh >>> Z> >>> Z> # test jail name >>> Z> n="test_ref_leak" >>> Z> >>> Z> jail -c name=$n path=/ vnet persist >>> Z> # The following line trigger jail pr_ref leak >>> Z> jexec $n ifconfig lo0 inet 127.0.0.1/8 >>> Z> >>> Z> jail -R $n >>> Z> >>> Z> # wait a moment >>> Z> sleep 1 >>> Z> >>> Z> jls -j $n >>> Z> >>> Z> After DDB debugging and tracing , it seems that is triggered by a combine of [1] and [2] >>> Z> >>> Z> [1] https://reviews.freebsd.org/rGfec8a8c7cbe4384c7e61d376f3aa5be5ac895915 >>> Z> [2] https://reviews.freebsd.org/rGeb93b99d698674e3b1cc7139fda98e2b175b8c5b >>> Z> >>> Z> >>> Z> In [1] the per-VNET uma zone is shared with the global one. >>> Z> `pcbinfo->ipi_zone = pcbstor->ips_zone;` >>> Z> >>> Z> In [2] unref `inp->inp_cred` is deferred called in inpcb_dtor() by uma_zfree_smr() . >>> Z> >>> Z> Unfortunately inps freed by uma_zfree_smr() are cached and inpcb_dtor() is not called immediately , >>> Z> thus leaking `inp->inp_cred` ref and hence `prison->pr_ref`. >>> Z> >>> Z> And it is also not possible to free up the cache by per-VNET SYSUNINIT tcp_destroy / udp_destroy / rip_destroy. >>> >>> This is known issue and I'd prefer not to call it a problem. The "leak" of a jail >>> happens only if machine is idle wrt the networking activity. >>> >>> Getting back to the problem that started this thread - the epair(4)s not immediately >>> popping back to prison0. IMHO, the problem again lies in the design of if_vmove and >>> epair(4) in particular. The if_vmove shall not exist, instead we should do a full >>> if_attach() and if_detach(). The state of an ifnet when it undergoes if_vmove doesn't >>> carry any useful information. With Alexander melifaro@ we discussed better options >>> for creating or attaching interfaces to jails that if_vmove. Until they are ready >>> the most easy workaround to deal with annoying epair(4) come back problem is to >>> remove it manually before destroying a jail, like I did in 80fc25025ff. >>> >> >> It still behaved much better prior to eb93b99d6986, which you and Mark >> were going to work on a solution for to allow the cred "leak" to close >> up much more quickly. CC markj@, since I think it's been six months >> since the last time I inquired about it, making this a good time to do >> it again... > > I spent some time trying to see if we could fix this in UMA/SMR and > talked to Jeff about it a bit. At this point I don't think it's the > right approach, at least for now. Really we have a composability > problem where different layers are using different techniques to signal > that they're done with a particular piece of memory, and they just > aren't compatible. > > One thing I tried is to implement a UMA function which walks over all > SMR zones and synchronizes all cached items (so that their destructors > are called). This is really expensive, at minimum it has to bind to all A semi-unrelated question -- do we have any documentation around SMR in the tree which is not in subr_smr.c? (I have to admit I find it highly confusing that the acronym is more easily found as "Shingled Magnetic Recording (SMR)" in a different header file). > CPUs in the system so that it can flush per-CPU buckets. If > jail_deref() calls that function, the bug goes away at least in my > limited testing, but its use is really a layering violation. > > We could, say, periodically scan cached UMA/SMR items and invoke their > destructors, but for most SMR consumers this is unnecessary, and again > there's a layering problem: the inpcb layer shouldn't "know" that it has > to do that for its zones, since it's the jail layer that actually cares. > > It also seems kind of strange that dying jails still occupy a slot in > the jail namespace. I don't really understand why the existence of a > dying jail prevents creation of a new jail with the same name, but > presumably there's a good reason for it? You can create a new jail but if you have (physical) resources tied to the old one which are not released, then you are stuck (physical network interfaces for example). > Now my inclination is to try and fix this in the inpcb layer, by not > accessing the inp_cred at all in the lookup path until we hold the inpcb > lock, and then releasing the cred ref before freeing a PCB to its zone. > I think this is doable based on a few observations: > - When doing an SMR-protected lookup, we always lock the returned inpcb > before handing it to the caller. So we could in principle perform > inp_cred checking after acquiring the lock but before returning. > - If there are no jailed PCBs in a hash chain in_pcblookup_hash_locked() > always scans the whole chain. > - If we match only one PCB in a lookup, we can probably(?) return that > PCB without dereferencing the cred pointer at all. If not, then the > scan only has to keep track of a fixed number of PCBs before picking > which one to return. So it looks like we can perform a lockless scan > and keep track of matches on the stack, then lock the matched PCBs and > perform prison checks if necessary, without making the common case > more expensive. > > In fact there is a parallel thread on freebsd-jail which reports that > this inp_cred access is a source of frequent cache misses. I was > surprised to see that the scan calls prison_flag() before even checking > the PCB's local address. So if the hash chain is large then we're > potentially performing a lot of unnecessary memory accesses (though > presumably it's common for most of the PCBs to be sharing a single > cred?). In particular we can perhaps solve two problems at once. I haven't heard back after I sent the test program there; I hope that can be solved independently first and any optimisations can then come. > Any thoughts? Are there some fundamental reasons this can't work? > -- Bjoern A. Zeeb r15:7