pf and vimage

Fri Aug 21 12:42:04 UTC 2009

Thanks very useful!
Do you have an "official" page to look for update.
What do you think of putting it on the FreeBSD Wiki?

Fabien

Le 20 août 09 à 18:17, Julian Elischer a écrit :

> there were some people looking at adding vnet support to pf.
> Since we discussed it last, the rules of the game have
> significantly changed for the better. With the addition
> of some new facilitiesin FreeBSD, the work needed to virtualize
> a module has significantly decreased.
>
>
> The following doc gives the new rules..
>
>
> August 17 2009
> Julian Elischer
>
> ===================
> Vimage: what is it?
> ===================
>
> Vimage is a framework in the BSD kernel which allows a co-operating  
> module
> to operate on multiple independent instances of its state so that it  
> can
> participate in a virtual machine / virtual environment scenario. It  
> refers
> to a part of the Jail infrastructure in FreeBSD. For historical  
> reasons
> "Virtual network stack enabled jails"(1) are also known as "vimage  
> enabled
> jails"(2) or "vnet enabled jails"(3).  The currently correct term is  
> the
> latter, which is a contraction of the first. In the future other  
> parts of
> the system may be virtualized using the same technology and the term  
> to
> cover all such components would be VIMAGE enhanced modules.
>
> The implementation approach taken by the vimage framework is a  
> redefinition
> of selected global state variables to evaluate to constructs that  
> allow for
> the virtualized state to be stored and resolved in appropriate  
> instances of
> 'jail' specific container storage regions.  The code operating on  
> virtualized
> state has to conform to a set of rules described further below.  
> Among other
> things in order to allow for all the changes to be conditionally  
> compilable.
> i.e.  permitting the virtualized code to fall back to operation on  
> global state.
>
> The rest of this document will discuss NETWORK virtualization
> though the concepts may be true in the future for other parts of the
> system.
>
> The most visible change throughout the existing code is typically  
> replacement
> of direct references to global variables with macros; foo_bar thus  
> becomes
> V_foo_bar.  V_foo_bar macros will resolve back to the foo_bar global  
> in
> default kernel builds, and alternatively to the logical equivalent of
> some_base_pointer->_foo_bar for "options VIMAGE" kernel configs.
>
> Prepending of "V_" prefixes to variable references helps in
> visual discrimination between global and virtualized state.
> It is also possible to use an alternative syntax, of VNET(foo_bar) to
> achieve the same thing. The developers felt that V_foo_bar was less
> visually distracting while still providing enough clues to the reader
> that the variable is virtualized. In fact the V_foo_bar macro is
> locally defined near the definition of foo_bar to be an alias for
> VNET(foo_bar) so the two are not only equivalent, they are the same.
>
> The framework also extends the sysctl infrastructure to support  
> access to
> virtualized state through introduction of the SYSCTL_VNET family of  
> macros;
> those also automatically fall back to their standard SYSCTL  
> counterparts
> in default kernel builds.
>
> Transparent libkvm(3) lookups are provided to virtualized variables
> which permits userland binaries such as netstat to operate unmodified
> on "options VIMAGE" kernels, though this may have some security  
> implications.
>
> Vnets are associated with jails.  In 8.0, every process is  
> associated with
> a jail, usually the default (null) jail, and jails currently hang  
> off of
> a processes ucred.  This relationship defines a process's  
> administrative
> affinity to a vnet and thus indirectly to all of its state. All  
> network
> interfaces and sockets hold pointers back to their associated vnets.
> This relationship is obviously entirely independent from proc->ucred- 
> >jail
> bindings.  Hence, when a process opens a socket, the socket will get  
> bound
> to a vnet instance hanging off of proc->ucred->jail->vnet, but once  
> such a
> socket->vnet binding gets established, it cannot be changed for the  
> entire
> socket lifetime.
>
> The mapping of a from a thread to a vnet should always be done via the
> TD_TO_VNET macro as the path may change in the future as we get more
> experience with using the system.
>
> Certain classes of network interfaces (Ethernet in particular) can be
> reassigned from one vnet to another at any time.  By definition all  
> vnets
> are independent and can communicate only if they are explicitly
> provided with communication paths. Currently mainly netgraph is used  
> to
> establish inter-vnet datapaths, though other paths are  being explored
> such as the 'epair' back-to-back virtual interface pair, in which
> the different sides may exist in different jails.
>
> In network traffic processing the vnet affinity is defined either by  
> the
> inbound interface or by the socket / pcb -> vnet binding.  However,  
> there
> are many functions in the network stack that cannot implicitly fetch
> the vnet context from their standard arguments.  Instead of explicitly
> extending argument lists of such functions with a struct vnet *,
> the concept of a "current vnet", a per-thread variable was introduced,
> which can be fetched  efficiently via the curvnet macro.  The correct
> network context has to be set on entry to the network stack (socket
> operations, packet reception, or timer-driven functions) and cleared  
> on exit.
> This must be done via provided CURVNET_SET() / CURVNET_RESTORE()  
> family of
> macros, which allow for "stacking" of curvnet context setting and  
> provide
> additional debugging info in INVARIANTS kernel configs.  In most cases
> however a developer writing virtualized code will not have to set /
> restore the curvnet context unless the code would include timer-driven
> events, given that those are inherently vnet-contextless on entry.
>
> The current rule is that when not in networking code, the result of
> the 'curvnet' macro will return NULL and evaluating a V_xxx (or  
> VNET(xxx))
> macro will result in an kernel page-fault error. While this is not  
> strictly
> necessary, it aids in debugging and assurance of program correctness.
> Note this does NOT mean that TD_TO_VNET(curthread) is invalid.
> A thread is always associated with a vnet, but just the efficient
> "curvnet" access method is disabled along with the ability to resolve
> virtualized symbols.
>
>
> Converting / virtualizing existing code
> =======================================
>
> There are several steps need in virtualisation.
>
> 1/ Decide whether the module needs to be virtualised.
>
>   If the module is a driver for specific hardware, it makes sense that
>   there be only one instance of the driver as there is only one  
> piece of
>   physical hardware.  There are changes in the networking code to  
> allow
>   physical (or virtual) interfaces to be moved between vnets.  This
>   generally requires NO changes to the network drivers of the classes
>   covered (e.g. ethernet). Currently if your module is does not have  
> any
>   networking facet, the answer is "no" by default.
>
> 2/ If the module is to be virtualised, decide which attributes of the
>   module should be virtualised.
>
>   For example, It may make sense that there be a single central pool
>   of "struct foo" and a single uma zone for them to come from, with  
> a single
>   lock guarding it. It might also make sense if the "foo_debug" sysctl
>   controls all the instances at once, while on the other hand, the
>   "foo_mode" sysctl might make better sense if it were controllable
>   on a virtual system by virtual system basis.
>
> 3/ Work out what global variables and structures are to be  
> virtualised to
>   achieve the behaviour required for part #2.
>
> 4/ Work out for all the code paths through the module, how the  
> thread entering
>   the module can divine which virtual environment it is on.
>
>   Some examples:
>   * Since interfaces are all assigned to one vnet or another, an  
> incoming
>     packet has a pointer to the receive interface, which in turn has a
>     pointer back to the vnet. Often "curvnet" will already have been  
> set
>     by the time your code is called anyhow.
>   * Similarly, on any request from outside the kernel, (direct or  
> indirect)
>     the current thread has a way to get to the current virtual  
> environment
>     instance via TD_TO_VNET(curthread).  For existing sockets the vnet
>     context must be used via so->so_vnet since the thread's vnet might
>     change after socket creation.
>   * Timer initiated actions usually have a (void *) argument which  
> points to
>     some private structure for the module. It should be possible to  
> add
>     a pointer to the appropriate module instance into whatever  
> structure
>     that points to.
>   * Sometimes an action (timer trigerred or trigerred by module load  
> or
>     unload simply has to check all the vimage or module instances.
>     There are macro (pairs) for this which will iterate through all  
> the
>     VNET or instances. (see sample code below).
>
>   This covers most of the cases, however in some cases it may still be
>   required for the module to stash away the virtual environment  
> instance
>   somewhere, and make associated changes in the code.
>
> 5/ Decide which parts of the initialization and teardown are per  
> jail and
>   which parts are global, and separate out the code accordingly.
>   Global initialization is done using the SYSINIT facility.
>   Per jail initialization is done using VNET_SYSINIT().
>   Per jail teardown is doen using VNET_SYSUNINIT().
>   Global teardown is done using SYSUNIT().
>   In addition, the modevent handler is called with various event  
> types before
>   any of these are called. The modevent handler may veto load or  
> teardown.
>   On Shutdown, only the modevent handler is called so it may have to  
> simulate
>   the calling of the other handlers if clean shutdown is a requirement
>   of your module. (see sample code below). Don't forget to unregister
>   event handlers, and destroy locks and condition variables.
>
> 6/ Add the code described below to the files that make up the module.
>
> Details:  (VNET implementation details)
>
> Firstly the file <net/vnet.h> must be included. Depending on what
> code you use you may find you also need one or more of: <sys/proc.h>,
> <sys/ucred.h> and <sys/jail.h>. These requirements may change slightly
> as the ABI settles.
>
> Having decided which variables need to be virtualized, the definition
> of thosvariables needs to be modified to use the VNET_DEFINE() macro.
> For example:
>
> static int foo = 3;
> struct bar thebar = { 1,2,3 };
>
> would become:
>
> static VNET_DEFINE(int, foo) = 3;
> VNET_DEFINE(struct bar, thebar) = { 1,2,3 };
>
> extern int foo;
> in an include file might become:
> VNET_DECLARE(int foo);
>
> Normal rules regarding 'static/extern' apply. The initial values  
> that you
> give in this way will be stored and used as the initial values for
> EACH NEW INSTANCE of these variables as new jails/vnets are created.
>
> As mentioned above, accesses to virtualized symbols are achieved via  
> macros,
> which generally are of the same name as the original symbol but with  
> a "V_"
> prepended, thus the head of the interface list, called 'ifnet' is  
> replaced
> whereever used with "V_ifnet".  We do this, by adding the following
> lines after the definitions above:
>
> #define V_foo			VNET(foo)
> #define V_thebar		VNET(thebar)
>
> --- side-note ---
> In SCTP, because the code is shared with
> other OS's they are replaced with a macro MODULE_GLOBAL(modulename,  
> symbol).
> (this may simplify in light of recent changes).
> --------------
>
> In addition, should any of your values need to be changed  or viewed
> via sysctl, the following SYSCTL definitions would be needed:
>
> SYSCTL_VNET_PROC(_net_inet, OID_AUTO, thebar,
>    CTLTYPE_?? | CTLFLAG_RW | CTLFLAG_SECURE3, &VNET_NAME(thebar), 0,
>    thebar, "?", "the bar is open");
> {[XXX] robert fix this is possible ^^^}
> SYSCTL_VNET_INT(_net_inet, OID_AUTO, foo,
>    CTLFLAG_RW, &VNET_NAME(foo), 0, "size of foo");
>
>
> In the current version of vimage, when VIMAGE is not compiled into
> the kernel, the macros evaluate to a direct reference to the one and  
> only
> symbol/variable, so that there is no speed penalty for those not  
> using vnets.
>
> When VIMAGE is compiled in, the macro will evaluate to an access to  
> an offset
> into a data structure that is accessed on a per-vet basis. The vnet
> used for this is always curvnet. For this reason an attempt to access
> such a variable while curvnet is not valid, will result in an  
> exception.
>
> To ensure that curvnet has a valid value when needed one needs to
> add the following code on all entry code paths into the networking  
> code:
> int
> my_func(int arg)
> {
>        CURVNET_SET(TD_TO_VNET(curthread));
>                do_my_network_stuff(arg);
>        CURVNET_RESTORE();
>        return (0);
> }
>
> The initial value is usually something like "TD_TO_VNET(curthread)
> which in turn is a macro that derives the vnet affinity from the  
> current
> thread.  It could also be (m->m_ifp->if_vnet) if we were receiving  
> an mbuf,
> or so->so_vnet if we had a socket involved.
>
> Usually, when a packet enters the system it is carried through the  
> processing
> path via a single thread, and that thread will set its virtual  
> environment
> reference to that indicated by the packet on picking up that new  
> packet.
> This means that in the normal inbound processing path as well as the
> outgoing process path the current thread can be used to indicate the
> current virtual environment and curvet will always be valid once most
> user supplied code is reached. In timer events, it is sometimes
> necessary to add an "outer loop" to iterate through all the possible  
> vnets
> if there is just one timer for all instances.
>
> When a new loadable module is virtualised the module definitions
> and intializers need to be examined. The following example illustrates
> what is needed in the case that you are not loading a new protocol,  
> or domain.
> (for that see later)
>
> ============= sample skeleton code ==========
>
> /* init on boot or module load */
> static int
> mymod_init(void)
> {
>        return (error);
> }
>
> /****************
> * Stuff that must be initialized for every instance
> * (including the first of course).
> */
> static int
> mymod_vnet_init(const void *unused)
> {
>        return (0);
> }
>
> /**********************
> * Called for the removal of the last instance only on module unload.
> */
> static void
> mymod_uninit(void)
> {
> }
>
> /***********************
> * Called for the removal of each instance.
> */
> static int
> mymod_vnet_uninit(const void *unused)
> {
>        return (0)
> }
>
> mymod_modevent(module_t mod, int type, void *unused)
> {
>        int err = 0;
>
>        switch (type) {
>        case MOD_LOAD:
> 		/* check that loading is ok */
>                break;
>
>        case MOD_UNLOAD:
> 		/* check that unloading is ok */
>                break;
>
>        case MOD_QUIESCE:
> 		/* warning: try stop processing */
> 		/* maybe sleep 1 mSec or something to let threads get out */
>                break;
>
>        case MOD_SHUTDOWN:
> 		/*
> 		 * this is called once  but you may want to shut down
> 		 * things in each jail, or something global.
> 		 * In that case it's up to us to simulate the SYSUNINIT()
> 		 * or the VNET_SYSUNINIT()
> 		 */
> 		{
> 			VNET_ITERATOR_DECL(vnet_iter);
> 			VNET_LIST_RLOCK();
> 			VNET_FOREACH(vnet_iter) {
> 				CURVNET_SET(vnet_iter);
> 				mymod_vnet_uninit(NULL);
> 				CURVNET_RESTORE();
> 			}
> 			VNET_LIST_RUNLOCK();
> 		}
> 		/* you may need to shutdown something global. */
> 		mymod_uninit();
>                break;
>
>        default:
>                err = EOPNOTSUPP;
>                break;
>        }
>        return err;
> }
>
> static moduledata_t mymodmod = {
>        "mymod",
>        mymod_modevent,
>        0
> };
>
> /* define execution order using constants from /sys/sys/kernel.h */
> #define MYMOD_MAJOR_ORDER      SI_SUB_PROTO_BEGIN         /* for  
> example */
> #define MYMOD_MODULE_ORDER     (SI_ORDER_ANY + 64)        /* not  
> fussy */
> #define MYMOD_SYSINIT_ORDER    (MYMOD_MODULE_ORDER + 1)   /* a bit  
> later */
> #define MYMOD_VNET_ORDER       (MYMOD_MODULE_ORDER + 2)   /* later  
> still */
>
> DECLARE_MODULE(mymod, mymodmod, MYMOD_MAJOR_ORDER,  
> MYMOD_MODULE_ORDER);
> MODULE_DEPEND(mymod, ipfw, 2, 2, 2); /* depend on ipfw version  
> (exactly) 2 */
> MODULE_VERSION(mymod, 1);
>
> SYSINIT(mymod_init, MYMOD_MAJOR_ORDER, MYMOD_SYSINIT_ORDER,
>   mymod_init, NULL);
> SYSUNINIT(mymod_uninit, MYMOD_MAJOR_ORDER, MYMOD_SYSINIT_ORDER,
>   mymod_uninit, NULL);
>
> VNET_SYSINIT(mymod_vnet_init, MYMOD_MAJOR_ORDER, MYMOD_VNET_ORDER,
>   mymod_vnet_init, NULL);
> VNET_SYSUNINIT(mymod_vnet_uninit, MYMOD_MAJOR_ORDER, MYMOD_VNET_ORDER,
>   mymod_vnet_uninit, NULL);
>
>
> ========== end sample code =======
>
> On BOOT, the order of evaluation will be:
>  In a NON-VIMAGE kernel where the module is compiled:
>     MODEVENT, SYSINIT and VNET_SYSINIT both runm with order defined  
> by their
>     order declarations. {good foot shooting material if you get it  
> wrong!}
>
>  In a VIMAGE kernel where the module is compiled in:
>     MODEVNET, SYSINIT and VNET_SYSINIT all run with order defined by  
> their
>     order declarations.  AND in addition, the VNET_SYSINIT is
>     repeated once for every existing or new jail/vnet.
>
> On loading a vnet enabled kernel module after boot:
>      MODEVENT("event = load");
>      SYSINIT()
>      VNET_SYSINIT() for every existing jail
>        AND in addition, VNET_SYSINIT being called for each new jail  
> created.
>
> On unloading of module:
>      MODEVENT("event = MOD_QUIESCE")
>      MODEVENT("event = MOD_UNLOAD")
>      VNET_SYSUNINIT called for every jail/vnet
>      SYSUNINIT
>
> On system shutdown:
>      MODEVENT(shutdown)
>
> NOTICE that while the order of the SYSINIT and VNET_SYSINIT is  
> reversed from
> that of SYSUNINIT and VNET_SYSUNINIT, MODEVENTS do not follow
> this rule and thus it is dangerous to initialise and uninitialise
> things which are order dependent using MODEVENTs.
>
> Or, put another way,
> Since MODEVENT is called first during module load, it would, by the
> assumption that everything is reversed, be easy to assume that  
> MODEVENT
> is called AFTER the SYSINITS during unload.  This is in fact not
> the case. (and I have the scars to prove it).
>
> It might be make some sense if the "QUIESCE" was called before the
> SYSINIT/SYSUNINIT and the UNLOAD called after.. with a millisecond
> sleep between them, but this is not the case either.
>
> Since initial values are copied into the virtualized variables
> on each new instantiatin, it is quite possible to have modules for  
> which
> some of the above methods are not needed, and they may be left out.
> (but not the modevent).
>
> Sometimes there is a need to iterate through the vnets.
> See the modevent shutdown handler (above) for an example of how to  
> do this.
> Don't forget the locks.
>
> In the case where you are loading a new protocol, or domain  
> (protocol family)
> there are some "shortcuts" that are in place to allow you to  
> maintain a bit
> more source compatibility with older revisions of FreeBSD. It must be
> added that the sample code above works just fine for protocols,  
> however
> protcols also have an aditional initialization vector which is via the
> prtocol structure, which has a pr_init() entry.
> When a protocol is registered using pf_proto_register(), the pr_init()
> for the protocol is called once for every existing vnet. in addition,
> it will be called for each new vnet. The pr_destroy() method will be  
> called
> as well on vnet teardown. The pf_proto_register() funcion can be  
> called
> either from a modevent handler of from the SYSINIT() if you have  
> one, and
> the pf_proto_unregister() called from the SYSUNINIT or the unload
> modevent handler.
>
> If you are adding a whole new protocol domain, (protocol family) then
> you should add the VNET_DOMAIN_SET(domainname) (e,g, inet, inet6)
> macro. These use VNET_SYSINIT internally to indirectly call the
> dom_init() and pr_init()  functions for each vnet, (and the  
> equivalent for
> teardown.)  In this case one needs to be absolutely sure that both  
> your
> domain and protocol initializers can be called multiple times, once  
> for
> each vnet. One can still add SYSINITs for once only initialization,
> or use the modevent handler. I prefer to do as much explicitly
> in the SYSINITS and VNET_SYSINITS as then you have no surprises.
>
> finally:
> The command to make a new jail with a new vnet:
> jail -c host.hostname=test path=/ vnet command=/bin/tcsh
> jail -c host.hostname=test path=/ children.max=4 vnet command=/bin/ 
> tcsh
> (children.max allows hierarchical jail creation).
> Note that the command must come last.
>
>
> _______________________________________________
> freebsd-net at freebsd.org mailing list
> http://lists.freebsd.org/mailman/listinfo/freebsd-net
> To unsubscribe, send any mail to "freebsd-net-unsubscribe at freebsd.org"