From nobody Sun Dec 18 17:44:14 2022 X-Original-To: freebsd-jail@mlmmj.nyi.freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2610:1c1:1:606c::19:1]) by mlmmj.nyi.freebsd.org (Postfix) with ESMTP id 4NZqw70dTBz1GmWT; Sun, 18 Dec 2022 17:44:23 +0000 (UTC) (envelope-from jamie@freebsd.org) Received: from gritton.org (gritton.org [162.220.209.3]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature RSA-PSS (4096 bits) server-digest SHA256 client-signature RSA-PSS (2048 bits) client-digest SHA256) (Client CN "gritton.org", Issuer "gritton.org" (not verified)) by mx1.freebsd.org (Postfix) with ESMTPS id 4NZqw60mhmz4Jqr; Sun, 18 Dec 2022 17:44:22 +0000 (UTC) (envelope-from jamie@freebsd.org) Authentication-Results: mx1.freebsd.org; dkim=none; spf=softfail (mx1.freebsd.org: 162.220.209.3 is neither permitted nor denied by domain of jamie@freebsd.org) smtp.mailfrom=jamie@freebsd.org; dmarc=none Received: from gritton.org ([127.0.0.3]) (authenticated bits=0) by gritton.org (8.16.1/8.16.1) with ESMTPA id 2BIHiEVV000024; Sun, 18 Dec 2022 09:44:14 -0800 (PST) (envelope-from jamie@freebsd.org) List-Id: Discussion about FreeBSD jail(8) List-Archive: https://lists.freebsd.org/archives/freebsd-jail List-Help: List-Post: List-Subscribe: List-Unsubscribe: Sender: owner-freebsd-jail@freebsd.org MIME-Version: 1.0 Date: Sun, 18 Dec 2022 09:44:14 -0800 From: James Gritton To: freebsd-jail@freebsd.org, freebsd-net Cc: Zhenlei Huang Subject: Re: Propose a new stage `vnet_shutdown` before `vnet_destroy` In-Reply-To: References: User-Agent: Roundcube Webmail/1.4.11 Message-ID: <1c9dbf6d26b9525243dd6b3ffafa23cb@freebsd.org> X-Sender: jamie@freebsd.org Content-Type: text/plain; charset=US-ASCII; format=flowed Content-Transfer-Encoding: 7bit X-Spamd-Result: default: False [-3.10 / 15.00]; NEURAL_HAM_LONG(-1.00)[-1.000]; NEURAL_HAM_MEDIUM(-1.00)[-1.000]; NEURAL_HAM_SHORT(-1.00)[-1.000]; MIME_GOOD(-0.10)[text/plain]; MLMMJ_DEST(0.00)[freebsd-net@freebsd.org,freebsd-jail@freebsd.org]; TO_MATCH_ENVRCPT_SOME(0.00)[]; FROM_EQ_ENVFROM(0.00)[]; MIME_TRACE(0.00)[0:+]; R_DKIM_NA(0.00)[]; FREEMAIL_CC(0.00)[gmail.com]; ASN(0.00)[asn:30247, ipnet:162.220.208.0/22, country:US]; RCVD_TLS_LAST(0.00)[]; ARC_NA(0.00)[]; DMARC_NA(0.00)[freebsd.org]; FROM_HAS_DN(0.00)[]; FREEFALL_USER(0.00)[jamie]; RCVD_VIA_SMTP_AUTH(0.00)[]; R_SPF_SOFTFAIL(0.00)[~all:c]; RCPT_COUNT_THREE(0.00)[3]; MID_RHS_MATCH_FROM(0.00)[]; TO_DN_SOME(0.00)[]; TAGGED_RCPT(0.00)[]; RCVD_COUNT_TWO(0.00)[2] X-Rspamd-Queue-Id: 4NZqw60mhmz4Jqr X-Spamd-Bar: --- X-ThisMailContainsUnwantedMimeParts: N On 2022-12-18 00:01, Zhenlei Huang wrote: > I'm currently working on route nexthop caching feature for tunneling > interfaces such as > if_gif, if_gre, if_vxlan, and potentially if_wg. I encounter a nasty > bug related to VNET lifecycle. > More preciously I'd like to call `rib_unsubscribe()` to unsubscribe > route event when the interface > tunnel is deleted (gif_delete_tunnel). > > While on VNET shutting down, VNET SYSUNINIT was called and the routing > vnet subsystem > is destroyed before the interface going down and hence cause > pagefault. I do not want to check > `vnet.vnet_shutdown` state as it looks messed up. > > I'm recently reviewing the life cycles of prison and get some > inspirations. > > When the jail / prison is submitted to destroy ( by jail_remove > syscall ) then SIGKILL is sent to > the prison's processes. I think it is correct order to destroy jail / > prison. To summarize, the life cycle > of jail / prison is: > > on jail create: PRISON_STATE_INVALID -> create VNET -> > PRISON_STATE_ALIVE -> setup network resources, ifnet, if addresses, > routing, etc. -> create / attach (network) processes > on jail destroy: jexec kill processes (1) by user -> mark it as > PRISON_STATE_DYING -> send SIGKILL to processes by kernel (2) -> > destroy VNET (if prison pr_ref go to the last one) -> DYED > > The (2) is a cleanup by kernel as (1) is possible not done by user. > > > So it comes the idea about the life cycle of VNET. > > While on jail destroy, the network resources are cleaned up by > vnet_destroy ( SYSUNINIT ). Then the > order of SYSUNINIT of network components is hacking as circular > network resource dependency is possible. > For example the routing table entries (nhop) have reference of ifnet, > and ifnet have reference to route nhop (cache), as > I encountered. > > Just like the cleanup processes by kernel, we can introduce a new > stage `vnet_shutdown` that clean up network resources. > When jail / prison is going to dye, after kernel has cleaned up > processes it call `vnet_shutdown` to cleanup network resources, > then vnet_destroy will go smoothly as there's no circular network > resource dependency right now. > > The life cycle of prison becomes: > > on jail create: PRISON_STATE_INVALID -> create VNET -> > PRISON_STATE_ALIVE -> setup network resources, ifnet, if addresses, > routing, etc. -> create / attach (network) processes > on jail destroy: jexec kill processes (1) by user -> mark it as > PRISON_STATE_DYING -> send SIGKILL to processes by kernel (2) -> > vnet_shutdown cleanup network resources -> destroy VNET (if prison > pr_ref go to the last one) -> DYED > > This idea is still unmature and I hope to hear more voices about it. This is absolutely the direction things need to go. Vnet isn't the only thing that can have these problems, though it's been the biggest offender. There could also be cycles that involve more than one subsystem, which could be helped by broad application of this idea. There's a function in kern_jail.c ready for this: prison_cleanup. It's called in "mark PRISON_STATE_DYING" stage of things. That's before the "send SIGKILL" part of your sequence, but otherwise fits. - Jamie