Re: Propose a new stage `vnet_shutdown` before `vnet_destroy`

From: Zhenlei Huang <zlei_at_FreeBSD.org>
Date: Fri, 06 Jan 2023 10:14:06 UTC

> On Dec 19, 2022, at 1:44 AM, James Gritton <jamie@freebsd.org> wrote:
> 
> On 2022-12-18 00:01, Zhenlei Huang wrote:
>> I'm currently working on route nexthop caching feature for tunneling
>> interfaces such as
>> if_gif, if_gre, if_vxlan, and potentially if_wg. I encounter a nasty
>> bug related to VNET lifecycle.
>> More preciously I'd like to call `rib_unsubscribe()` to unsubscribe
>> route event when the interface
>> tunnel is deleted (gif_delete_tunnel).
>> While on VNET shutting down, VNET SYSUNINIT was called and the routing
>> vnet subsystem
>> is destroyed before the interface going down and hence cause
>> pagefault. I do not want to check
>> `vnet.vnet_shutdown` state as it looks messed up.
>> I'm recently reviewing the life cycles of prison and get some inspirations.
>> When the jail / prison is submitted to destroy ( by jail_remove
>> syscall ) then SIGKILL is sent to
>> the prison's processes. I think it is correct order to destroy jail /
>> prison. To summarize, the life cycle
>> of jail / prison is:
>> on jail create: PRISON_STATE_INVALID -> create VNET ->
>> PRISON_STATE_ALIVE -> setup network resources, ifnet, if addresses,
>> routing, etc. -> create / attach (network) processes
>> on jail destroy: jexec kill processes (1) by user -> mark it as
>> PRISON_STATE_DYING -> send SIGKILL to processes by kernel (2)  ->
>> destroy VNET (if prison pr_ref go to the last one) ->  DYED
>> The (2) is a cleanup by kernel as (1) is possible not done by user.
>> So it comes the idea about the life cycle of VNET.
>> While on jail destroy, the network resources are cleaned up by
>> vnet_destroy ( SYSUNINIT ). Then the
>> order of SYSUNINIT of network components is hacking as circular
>> network resource dependency is possible.
>> For example the routing table entries (nhop) have reference of ifnet,
>> and ifnet have reference to route nhop (cache), as
>> I encountered.
>> Just like the cleanup processes by kernel, we can introduce a new
>> stage `vnet_shutdown` that clean up network resources.
>> When jail / prison is going to dye, after kernel has cleaned up
>> processes it call `vnet_shutdown` to cleanup network resources,
>> then vnet_destroy will go smoothly as there's no circular network
>> resource dependency right now.
>> The life cycle of prison becomes:
>> on jail create: PRISON_STATE_INVALID -> create VNET ->
>> PRISON_STATE_ALIVE -> setup network resources, ifnet, if addresses,
>> routing, etc. -> create / attach (network) processes
>> on jail destroy: jexec kill processes (1) by user -> mark it as
>> PRISON_STATE_DYING -> send SIGKILL to processes by kernel (2)  ->
>> vnet_shutdown cleanup network resources -> destroy VNET (if prison
>> pr_ref go to the last one) ->  DYED
>> This idea is still unmature and I hope to hear more voices about it.
> 
> This is absolutely the direction things need to go.  Vnet isn't the
> only thing that can have these problems, though it's been the biggest
> offender.  There could also be cycles that involve more than one
> subsystem, which could be helped by broad application of this idea.
> 
> There's a function in kern_jail.c ready for this: prison_cleanup.
> It's called in "mark PRISON_STATE_DYING" stage of things.  That's
> before the "send SIGKILL" part of your sequence, but otherwise fits.
> 

Submitted to Phabricator for review:

https://reviews.freebsd.org/D37956
https://reviews.freebsd.org/D37957


> - Jamie

Best regards,
Zhenlei