pf and vimage

Thu Aug 20 16:31:55 UTC 2009

there were some people looking at adding vnet support to pf.
Since we discussed it last, the rules of the game have
significantly changed for the better. With the addition
of some new facilitiesin FreeBSD, the work needed to virtualize
a module has significantly decreased.

The following doc gives the new rules..

-------------- next part --------------
August 17 2009
Julian Elischer

===================
Vimage: what is it?
===================

Vimage is a framework in the BSD kernel which allows a co-operating module
to operate on multiple independent instances of its state so that it can
participate in a virtual machine / virtual environment scenario. It refers 
to a part of the Jail infrastructure in FreeBSD. For historical reasons
"Virtual network stack enabled jails"(1) are also known as "vimage enabled
jails"(2) or "vnet enabled jails"(3).  The currently correct term is the
latter, which is a contraction of the first. In the future other parts of
the system may be virtualized using the same technology and the term to
cover all such components would be VIMAGE enhanced modules.

The implementation approach taken by the vimage framework is a redefinition
of selected global state variables to evaluate to constructs that allow for
the virtualized state to be stored and resolved in appropriate instances of
'jail' specific container storage regions.  The code operating on virtualized
state has to conform to a set of rules described further below. Among other
things in order to allow for all the changes to be conditionally compilable.
i.e.  permitting the virtualized code to fall back to operation on global state.

The rest of this document will discuss NETWORK virtualization 
though the concepts may be true in the future for other parts of the
system.

The most visible change throughout the existing code is typically replacement
of direct references to global variables with macros; foo_bar thus becomes
V_foo_bar.  V_foo_bar macros will resolve back to the foo_bar global in
default kernel builds, and alternatively to the logical equivalent of
some_base_pointer->_foo_bar for "options VIMAGE" kernel configs.

Prepending of "V_" prefixes to variable references helps in
visual discrimination between global and virtualized state.
It is also possible to use an alternative syntax, of VNET(foo_bar) to
achieve the same thing. The developers felt that V_foo_bar was less
visually distracting while still providing enough clues to the reader
that the variable is virtualized. In fact the V_foo_bar macro is
locally defined near the definition of foo_bar to be an alias for
VNET(foo_bar) so the two are not only equivalent, they are the same.

The framework also extends the sysctl infrastructure to support access to
virtualized state through introduction of the SYSCTL_VNET family of macros;
those also automatically fall back to their standard SYSCTL counterparts
in default kernel builds.

Transparent libkvm(3) lookups are provided to virtualized variables
which permits userland binaries such as netstat to operate unmodified
on "options VIMAGE" kernels, though this may have some security implications.

Vnets are associated with jails.  In 8.0, every process is associated with
a jail, usually the default (null) jail, and jails currently hang off of
a processes ucred.  This relationship defines a process's administrative
affinity to a vnet and thus indirectly to all of its state. All network
interfaces and sockets hold pointers back to their associated vnets.
This relationship is obviously entirely independent from proc->ucred->jail
bindings.  Hence, when a process opens a socket, the socket will get bound
to a vnet instance hanging off of proc->ucred->jail->vnet, but once such a
socket->vnet binding gets established, it cannot be changed for the entire
socket lifetime.

The mapping of a from a thread to a vnet should always be done via the 
TD_TO_VNET macro as the path may change in the future as we get more
experience with using the system.

Certain classes of network interfaces (Ethernet in particular) can be
reassigned from one vnet to another at any time.  By definition all vnets
are independent and can communicate only if they are explicitly
provided with communication paths. Currently mainly netgraph is used to
establish inter-vnet datapaths, though other paths are  being explored
such as the 'epair' back-to-back virtual interface pair, in which
the different sides may exist in different jails.

In network traffic processing the vnet affinity is defined either by the
inbound interface or by the socket / pcb -> vnet binding.  However, there
are many functions in the network stack that cannot implicitly fetch
the vnet context from their standard arguments.  Instead of explicitly
extending argument lists of such functions with a struct vnet *,
the concept of a "current vnet", a per-thread variable was introduced,
which can be fetched  efficiently via the curvnet macro.  The correct
network context has to be set on entry to the network stack (socket
operations, packet reception, or timer-driven functions) and cleared on exit.
This must be done via provided CURVNET_SET() / CURVNET_RESTORE() family of
macros, which allow for "stacking" of curvnet context setting and provide
additional debugging info in INVARIANTS kernel configs.  In most cases
however a developer writing virtualized code will not have to set /
restore the curvnet context unless the code would include timer-driven
events, given that those are inherently vnet-contextless on entry.

The current rule is that when not in networking code, the result of
the 'curvnet' macro will return NULL and evaluating a V_xxx (or VNET(xxx))
macro will result in an kernel page-fault error. While this is not strictly
necessary, it aids in debugging and assurance of program correctness.
Note this does NOT mean that TD_TO_VNET(curthread) is invalid.
A thread is always associated with a vnet, but just the efficient
"curvnet" access method is disabled along with the ability to resolve 
virtualized symbols.

Converting / virtualizing existing code
=======================================

There are several steps need in virtualisation.

1/ Decide whether the module needs to be virtualised.

   If the module is a driver for specific hardware, it makes sense that
   there be only one instance of the driver as there is only one piece of
   physical hardware.  There are changes in the networking code to allow
   physical (or virtual) interfaces to be moved between vnets.  This
   generally requires NO changes to the network drivers of the classes
   covered (e.g. ethernet). Currently if your module is does not have any 
   networking facet, the answer is "no" by default.

2/ If the module is to be virtualised, decide which attributes of the 
   module should be virtualised. 

   For example, It may make sense that there be a single central pool
   of "struct foo" and a single uma zone for them to come from, with a single
   lock guarding it. It might also make sense if the "foo_debug" sysctl
   controls all the instances at once, while on the other hand, the
   "foo_mode" sysctl might make better sense if it were controllable 
   on a virtual system by virtual system basis.

3/ Work out what global variables and structures are to be virtualised to 
   achieve the behaviour required for part #2.

4/ Work out for all the code paths through the module, how the thread entering
   the module can divine which virtual environment it is on.

   Some examples:
   * Since interfaces are all assigned to one vnet or another, an incoming
     packet has a pointer to the receive interface, which in turn has a
     pointer back to the vnet. Often "curvnet" will already have been set
     by the time your code is called anyhow.
   * Similarly, on any request from outside the kernel, (direct or indirect)
     the current thread has a way to get to the current virtual environment
     instance via TD_TO_VNET(curthread).  For existing sockets the vnet
     context must be used via so->so_vnet since the thread's vnet might
     change after socket creation.
   * Timer initiated actions usually have a (void *) argument which points to 
     some private structure for the module. It should be possible to add 
     a pointer to the appropriate module instance into whatever structure
     that points to.
   * Sometimes an action (timer trigerred or trigerred by module load or 
     unload simply has to check all the vimage or module instances.
     There are macro (pairs) for this which will iterate through all the 
     VNET or instances. (see sample code below).

   This covers most of the cases, however in some cases it may still be
   required for the module to stash away the virtual environment instance
   somewhere, and make associated changes in the code.

5/ Decide which parts of the initialization and teardown are per jail and
   which parts are global, and separate out the code accordingly.
   Global initialization is done using the SYSINIT facility.
   Per jail initialization is done using VNET_SYSINIT().
   Per jail teardown is doen using VNET_SYSUNINIT().
   Global teardown is done using SYSUNIT().
   In addition, the modevent handler is called with various event types before
   any of these are called. The modevent handler may veto load or teardown.
   On Shutdown, only the modevent handler is called so it may have to simulate
   the calling of the other handlers if clean shutdown is a requirement
   of your module. (see sample code below). Don't forget to unregister 
   event handlers, and destroy locks and condition variables.

6/ Add the code described below to the files that make up the module.

Details:  (VNET implementation details)

Firstly the file <net/vnet.h> must be included. Depending on what
code you use you may find you also need one or more of: <sys/proc.h>, 
<sys/ucred.h> and <sys/jail.h>. These requirements may change slightly
as the ABI settles.

Having decided which variables need to be virtualized, the definition
of thosvariables needs to be modified to use the VNET_DEFINE() macro.
For example: 

static int foo = 3;
struct bar thebar = { 1,2,3 };

would become:

static VNET_DEFINE(int, foo) = 3;
VNET_DEFINE(struct bar, thebar) = { 1,2,3 };

extern int foo; 
in an include file might become:
VNET_DECLARE(int foo);

Normal rules regarding 'static/extern' apply. The initial values that you
give in this way will be stored and used as the initial values for 
EACH NEW INSTANCE of these variables as new jails/vnets are created.

As mentioned above, accesses to virtualized symbols are achieved via macros,
which generally are of the same name as the original symbol but with a "V_"
prepended, thus the head of the interface list, called 'ifnet' is replaced
whereever used with "V_ifnet".  We do this, by adding the following
lines after the definitions above:

#define V_foo			VNET(foo)
#define V_thebar		VNET(thebar)

--- side-note ---
In SCTP, because the code is shared with
other OS's they are replaced with a macro MODULE_GLOBAL(modulename, symbol).
(this may simplify in light of recent changes).
--------------

In addition, should any of your values need to be changed  or viewed
via sysctl, the following SYSCTL definitions would be needed:

SYSCTL_VNET_PROC(_net_inet, OID_AUTO, thebar,
    CTLTYPE_?? | CTLFLAG_RW | CTLFLAG_SECURE3, &VNET_NAME(thebar), 0,
    thebar, "?", "the bar is open");
{[XXX] robert fix this is possible ^^^}
SYSCTL_VNET_INT(_net_inet, OID_AUTO, foo,
    CTLFLAG_RW, &VNET_NAME(foo), 0, "size of foo");

In the current version of vimage, when VIMAGE is not compiled into
the kernel, the macros evaluate to a direct reference to the one and only
symbol/variable, so that there is no speed penalty for those not using vnets.

When VIMAGE is compiled in, the macro will evaluate to an access to an offset
into a data structure that is accessed on a per-vet basis. The vnet
used for this is always curvnet. For this reason an attempt to access
such a variable while curvnet is not valid, will result in an exception.

To ensure that curvnet has a valid value when needed one needs to 
add the following code on all entry code paths into the networking code:
int
my_func(int arg)
{
        CURVNET_SET(TD_TO_VNET(curthread));
                do_my_network_stuff(arg);
        CURVNET_RESTORE();
        return (0);
}

The initial value is usually something like "TD_TO_VNET(curthread)
which in turn is a macro that derives the vnet affinity from the current
thread.  It could also be (m->m_ifp->if_vnet) if we were receiving an mbuf,
or so->so_vnet if we had a socket involved.

Usually, when a packet enters the system it is carried through the processing 
path via a single thread, and that thread will set its virtual environment
reference to that indicated by the packet on picking up that new packet.
This means that in the normal inbound processing path as well as the
outgoing process path the current thread can be used to indicate the
current virtual environment and curvet will always be valid once most 
user supplied code is reached. In timer events, it is sometimes 
necessary to add an "outer loop" to iterate through all the possible vnets
if there is just one timer for all instances.

When a new loadable module is virtualised the module definitions
and intializers need to be examined. The following example illustrates
what is needed in the case that you are not loading a new protocol, or domain.
(for that see later)

============= sample skeleton code ==========

/* init on boot or module load */
static int
mymod_init(void)
{
        return (error);
}

/****************
 * Stuff that must be initialized for every instance
 * (including the first of course).
 */
static int
mymod_vnet_init(const void *unused)
{
        return (0);
}

/**********************
 * Called for the removal of the last instance only on module unload.
 */
static void
mymod_uninit(void)
{
}

/***********************
 * Called for the removal of each instance.
 */
static int
mymod_vnet_uninit(const void *unused)
{
        return (0)
}

mymod_modevent(module_t mod, int type, void *unused)
{
        int err = 0;

        switch (type) {
        case MOD_LOAD:
		/* check that loading is ok */
                break;

        case MOD_UNLOAD:
		/* check that unloading is ok */
                break;

        case MOD_QUIESCE:
		/* warning: try stop processing */
		/* maybe sleep 1 mSec or something to let threads get out */
                break;

        case MOD_SHUTDOWN:
		/*
		 * this is called once  but you may want to shut down
		 * things in each jail, or something global.
		 * In that case it's up to us to simulate the SYSUNINIT()
		 * or the VNET_SYSUNINIT()
		 */
		{
			VNET_ITERATOR_DECL(vnet_iter);
			VNET_LIST_RLOCK();
			VNET_FOREACH(vnet_iter) {
				CURVNET_SET(vnet_iter); 
				mymod_vnet_uninit(NULL);
				CURVNET_RESTORE();
			}
			VNET_LIST_RUNLOCK();
		}
		/* you may need to shutdown something global. */
		mymod_uninit(); 
                break;

        default:
                err = EOPNOTSUPP;
                break;
        }
        return err;
}

static moduledata_t mymodmod = {
        "mymod",
        mymod_modevent,
        0
};

/* define execution order using constants from /sys/sys/kernel.h */
#define MYMOD_MAJOR_ORDER      SI_SUB_PROTO_BEGIN         /* for example */
#define MYMOD_MODULE_ORDER     (SI_ORDER_ANY + 64)        /* not fussy */
#define MYMOD_SYSINIT_ORDER    (MYMOD_MODULE_ORDER + 1)   /* a bit later */
#define MYMOD_VNET_ORDER       (MYMOD_MODULE_ORDER + 2)   /* later still */

DECLARE_MODULE(mymod, mymodmod, MYMOD_MAJOR_ORDER, MYMOD_MODULE_ORDER);
MODULE_DEPEND(mymod, ipfw, 2, 2, 2); /* depend on ipfw version (exactly) 2 */
MODULE_VERSION(mymod, 1);

SYSINIT(mymod_init, MYMOD_MAJOR_ORDER, MYMOD_SYSINIT_ORDER,
   mymod_init, NULL);
SYSUNINIT(mymod_uninit, MYMOD_MAJOR_ORDER, MYMOD_SYSINIT_ORDER,
   mymod_uninit, NULL);

VNET_SYSINIT(mymod_vnet_init, MYMOD_MAJOR_ORDER, MYMOD_VNET_ORDER,
   mymod_vnet_init, NULL);
VNET_SYSUNINIT(mymod_vnet_uninit, MYMOD_MAJOR_ORDER, MYMOD_VNET_ORDER,
   mymod_vnet_uninit, NULL);

========== end sample code =======

On BOOT, the order of evaluation will be:
  In a NON-VIMAGE kernel where the module is compiled:
     MODEVENT, SYSINIT and VNET_SYSINIT both runm with order defined by their
     order declarations. {good foot shooting material if you get it wrong!}

  In a VIMAGE kernel where the module is compiled in:
     MODEVNET, SYSINIT and VNET_SYSINIT all run with order defined by their
     order declarations.  AND in addition, the VNET_SYSINIT is
     repeated once for every existing or new jail/vnet.

On loading a vnet enabled kernel module after boot:
      MODEVENT("event = load");
      SYSINIT()
      VNET_SYSINIT() for every existing jail
        AND in addition, VNET_SYSINIT being called for each new jail created.

On unloading of module:
      MODEVENT("event = MOD_QUIESCE")
      MODEVENT("event = MOD_UNLOAD")
      VNET_SYSUNINIT called for every jail/vnet
      SYSUNINIT

On system shutdown:
      MODEVENT(shutdown)

NOTICE that while the order of the SYSINIT and VNET_SYSINIT is reversed from
that of SYSUNINIT and VNET_SYSUNINIT, MODEVENTS do not follow
this rule and thus it is dangerous to initialise and uninitialise
things which are order dependent using MODEVENTs.

Or, put another way,
Since MODEVENT is called first during module load, it would, by the
assumption that everything is reversed, be easy to assume that MODEVENT
is called AFTER the SYSINITS during unload.  This is in fact not
the case. (and I have the scars to prove it).

It might be make some sense if the "QUIESCE" was called before the
SYSINIT/SYSUNINIT and the UNLOAD called after.. with a millisecond
sleep between them, but this is not the case either. 

Since initial values are copied into the virtualized variables
on each new instantiatin, it is quite possible to have modules for which
some of the above methods are not needed, and they may be left out.
(but not the modevent).

Sometimes there is a need to iterate through the vnets.
See the modevent shutdown handler (above) for an example of how to do this.
Don't forget the locks.

In the case where you are loading a new protocol, or domain (protocol family)
there are some "shortcuts" that are in place to allow you to maintain a bit
more source compatibility with older revisions of FreeBSD. It must be
added that the sample code above works just fine for protocols, however 
protcols also have an aditional initialization vector which is via the
prtocol structure, which has a pr_init() entry.
When a protocol is registered using pf_proto_register(), the pr_init()
for the protocol is called once for every existing vnet. in addition, 
it will be called for each new vnet. The pr_destroy() method will be called
as well on vnet teardown. The pf_proto_register() funcion can be called
either from a modevent handler of from the SYSINIT() if you have one, and
the pf_proto_unregister() called from the SYSUNINIT or the unload 
modevent handler.

If you are adding a whole new protocol domain, (protocol family) then
you should add the VNET_DOMAIN_SET(domainname) (e,g, inet, inet6)
macro. These use VNET_SYSINIT internally to indirectly call the
dom_init() and pr_init()  functions for each vnet, (and the equivalent for  
teardown.)  In this case one needs to be absolutely sure that both your
domain and protocol initializers can be called multiple times, once for
each vnet. One can still add SYSINITs for once only initialization, 
or use the modevent handler. I prefer to do as much explicitly
in the SYSINITS and VNET_SYSINITS as then you have no surprises.

finally: 
The command to make a new jail with a new vnet:
jail -c host.hostname=test path=/ vnet command=/bin/tcsh
jail -c host.hostname=test path=/ children.max=4 vnet command=/bin/tcsh
(children.max allows hierarchical jail creation).
Note that the command must come last.