Vimage howto

Mon Dec 8 00:47:54 PST 2008

Well not completely, but I've had a number of questions over the
last few months about what it is, so, as Marko and I have written
the following "how to virtualize your module" document, I've been
directing people to it. After another couple of questions I think
this could do with wider distribition..

It is available at:

http://perforce.freebsd.org/fileViewer.cgi?FSPC=//depot/projects/vimage/porting_to_vimage.txt

but I include it here for popular enjoyment.

Please contact me or Marko if you have any questions or suggestions on 
this.
-------------- next part --------------

===================
Vimage: what is it?
===================

Vimage is a framework in the BSD kernel which allows a co-operating module
to operate on multiple independent instances of its state so that it can
participate in a virtual machine / virtual environment scenario.

The implementation approach taken by the vimage framwork is a replacement
of selected global state variables with constructs that allow for the
virtualized state to be stored and resolved in appropriate instances of
module-specific container structures.  The code operating on virtualized state
has to conform to a set of rules described further below, among other things
in order to allow for all the changes to be conditionally compilable, i.e.
permitting the virtualized code to fall back to operation on global state.

The most visible change throughout the existing code is typically replacement
of direct references to global variables with macros; foo_bar thus becomes
V_foo_bar.  V_foo_bar macros will resolve back to foo_bar global in default
kernel builds, and alternatively to some_base_pointer->_foo_bar for "options
VIMAGE" kernel configs.  Prepending of "V_" prefixes to variable references
helps in visual discrimination between global and virtualized state.  The
framework extends the sysctl infrastructure to support access to virtualized
state through introduction of the SYSCTL_V family of macros; those also
automatically fall back to their standard SYSCTL counterparts in default
kernel builds.  Transparent kldsym(2) lookups are provided to virtualized
variables explicitly marked for visibility to kldsym interface, which permits
userland binaries such as netstat to operate unmodified on "options VIMAGE"
kernels, though this may have wide security implications.

The vimage struct is currently primarily a placeholder for pointers to
module-specific struct instances; currently V_NET (networking), V_CPU
(CPU scheduling), and V_PROCG (jail-style interprocess protection) major
module classes are defined.  Each vimage module may or may not be further
split into minor or submodules; the networking subsystem (vimage id V_NET;
struct vnet) in particular is organized in submodules such as VNET_MOD_NET
(mandatory shared infrastructure: routing tables, interface lists etc.);
VNET_MOD_INET (IPv4 state including transport protocols); VNET_MOD_INET6,
VNET_MOD_IPSEC, VNET_MOD_IPFW, VNET_MOD_NETGRAPH etc.  The speciality of
VNET submodules is in that they not only provide storage for virtualized
data, but also enforce ordering of initialization and cleanup.  Hence, not
all submodules must necessarily allocate private storage for their specific
data; they may be defined solely for to support proper initialization
ordering.

Each process is associated with a vimage, and vimages currently hang off of
ucred-s.  This relationship defines a process's administrative affinity
to a vimage and thus indirectly to all of its modules (NET, CPU, PROCG)
as well as to any submodules.  All network interfaces and sockets hold
pointers back to their parent vnets; this relationship is obviously entirely
independent from proc->ucred->vimage bindings.  Hence, when a process
opens a socket, the socket will get bound to a vnet instance hanging off of
proc->ucred->vimage->vnet, but once such a socket->vnet binding gets
established, it cannot be changed for the entire socket lifetime.  Certain
classes of network interfaces (Ethernet in particular) can be assigned
from one vnet to another at any time.  By definition all vnets are
are independent and can communicate only if they are explicitly provided
with communication paths; currently only netgraph can be used to establish
inter-vnet datapaths.

In network traffic processing the vnet affinity is defined either by the
inbound interface or by the socket / pcb -> vnet binding.  However, there
are many functions in the network stack that cannot implicitly fetch
the vnet context from their standard arguments.  Instead of explicitly
extending argument lists of such functions with a struct vnet *,
a per-thread variable td_vnet was introduced, which can be fetched via
the curvnet macro (#define curvnet curthread->td_vnet).  The curvnet
context has to be set on entry to the network stack (socket operations,
packet reception, or timer-driven functions) and cleared on exit.  This
must be done via provided CURVNET_SET() / CURVNET_RESTORE() family of
macros, which allow for "stacking" of curvnet context setting and provide
additional debugging info in INVARIANTS kernel configs.  In most cases
however a developer writing virtualized code will not have to set /
restore the curvnet context unless the code would include timer-driven
events, given that those are inherently vnet-contextless on entry.

Converting / virtualizing existing code
=======================================

There are several steps need in virtualisation.

1/ decide whether the module needs to be virtualised.

   if the module is a driver for specific hardware, it makes sense that
   there be only one instance of the driver as there is only one piece of
   physical hardware.  There are changes in the networking code to allow
   physical (or virtual) interfaces to be moved between vnets.  This
   generally requires NO changes to the network drivers of the classes
   covered (e.g. ethernet).

2/ decide if your module is part of one of the major module groups.
   These are currently V_NET V_PROCG V_CPU.

   The reader will note that the descriptions below  use the acronym VNET
   a lot.  The vimage system has been at this time broken into a number of 
   subsections. One of these is the "VNET" group. The idea of these
   subsections is that they might be individually selected as
   virtualizable in a particular virtual machine instance.

   As an example, in a virtualization, one might to allocate a couple of
   processors to it, but keep the same filesystem and network setup, or
   alternatively to share processors but to have virtualised networking.

3/ If the module is to be virtualised, decide which attributes of the 
   module should be virtualised. 

   For example, It may make sense that there be a single central pool
   of "struct foo" and a single uma zone for them to come from, with a single
   lock guarding it. It might also make sense if the "foo_debug" sysctl
   controls all the instances at once, while on the other hand, the
   "foo_mode" sysctl might make better sense if it were controllable 
    on a virtual system by virtual system basis.

4/ Work out what global variables and structures are to be virtualised to 
   achieve the behaviour required for part #3.

5/ Work out for all the code paths through the module, how the path entering
   the module can divine which virtual environment it is on.

   Some examples:
   * Since interfaces are all assigned to one vnet or another, an incoming
     packet has a pointer to the receive interface, which in turn has a
     pointer back to the vnet. Often "curvnet" will already have been set
     by the time your code is called anyhow.
   * Similarly, on any request from outside the kernel, (direct or indirect)
     the current thread has a way to get to the current virtual environment
     instance via td->ucred->vimage.  For existing sockets the vnet context
     must be used via so->so_vnet since td->ucred->vimage might change after
     socket creation.
   * Timer initiated actions usually have a (void *) argument which points to 
     some private structure for the module. It should be possible to add 
     a pointer to the appropriate module instance into whatever structure
     that points to.
   * Sometimes an action (timer trigerred or trigerred by module load or 
     unload simply has to check all the vimage or module instances.
     There are macro (pairs) for this which will iterate through all the 
     VNET or VPROCG instances.

   This covers most of the cases, however in some cases it may still be
   required for the module to stash away the virtual environment instance
   somewhere, and make associated changes in the code.

6/ Add the code described below to the files that make up the module

Details:

temp. note: for module FOO add a definition for VNET_MOD_FOO in sys/vimage.h.
This will eventually be dynamically assigned.

For now these instructions refer mainly to VNET and not VCPU, VPROCG etc.

Symbols defined in other modules that have been virtualised will have been
moved to a module-specific virtualisation structure. It will be defined in a 
.h file for just this purpose. If a module will never export virtualise
symbols beyond it's borders, then this structure may well just be in a common
include file for that module. As an example, common networking
(but not protocol) variables have been moved to a file called net/vnet.h, but
the gre module has simply added the virtualisation structure to if_gre.h as 
no code outside the gre interface will access those values.

Accesses to virtualised symbols are achieved via macros, which generally
are of the same name as the original symbol but with a "V_" prepended,
thus the head of the interface list, called 'ifnet' is replaced whereever 
used with "V_ifnet".  In SCTP, because the code is shared with other OS's
they are replaced with a macro MODULE_GLOBAL(modulename, symbol).
In the current version of vimage, when VIMAGE is not compiled into
the kernel, the macros evaluate to a direct reference to the symbol.
In future versions it will evaluate to a global version of the virtualisation
structure with the offset to the entry in quesiton, which will result in
a single direct memory reference, so that the speed will be as it is now.

When VIMAGE is compiled in, the macro will evaluate to an access to an
element in a structure pointed to by a local varible.
For this reason, it is necessary to also add, at the beginning of
these functions another macro that will instantiate this local variable
and point it at the correct place.
As an example, prior to using the "V_ifnet" structure in a program block,
we must add the following macro at the head of a code block enclosing the
references to set up module-specific base pointer variable:

  INIT_VNET_NET(initial_value); /* initial value is usually curvnet */

When VIMAGE is not defined, this will evaluate to nothing but when it
IS defined, it will evaluate to:

  struct vnet_net *vnet_net = (initial_value);

The initial value is usually something like "curvnet" which in turn
is a macro that derives the vnet affinity from the current thread.
It could also be (m->m_ifp->if_vnet) if we were receiving an mbuf.

In the case where it is just one function in a module calling
another (static), the porter might decide to simply pass the local
variable as an argument, rather than to reevaluate it in the function,
but should be prepared to cope with the fact that the code might be
compiled in the "no-VIMAGE" manner (in which case the argument would be 
marked as "unused"). 

Usually, when a packet enters the system it is carried through the processing 
path via a single thread, and that thread will set its virtual environment
reference to that indicated by the packet on picking up that new packet.
This means that in the normal inbound processing path as well as the
outgoing process path the current thread can be used to indicate the
current virtual environment. In the case of timer initiated events, best
practice would also be to set the current virtual module reference to that
indicated calculated by whatever way that would be done, so that any functions
called could rely on the current thread being a good reference for the correct
virtual module.

When a new VNET submodule is defined for virtualisation, the following
structure defining macro is used to define it to the framework. 

#define VNET_MOD_DECLARE(m_name_uc, m_name_lc, m_iattach, m_idetach,    \
    m_dependson, m_symmap)                                              \
        static const struct vnet_modinfo vnet_##m_name_lc##_modinfo = { \
                .vmi_id                 = VNET_MOD_##m_name_uc,         \
                .vmi_dependson          = VNET_MOD_##m_dependson,       \
                .vmi_name               = #m_name_lc,                   \
                .vmi_iattach            = m_iattach,                    \
                .vmi_idetach            = m_idetach,                    \
                .vmi_struct_size        =                               \
                        sizeof(struct vnet_##m_name_lc),                \
                .vmi_symmap             = m_symmap                      \

The ID  we allocated in the temporary  first step  in "Details" is
the first entry here; eventually this should be automatically done
by module name. The DEPENDSON field tells us the order that modules
should be initialised in a new virtual environment. This may later need
to be changed to a list of text module names for dynamic calculation.
The rest of the fields are self explanatory, with the exception of the
symmap entry.
The symmap allows us to intercept calls by libkvm to the 
linker when it is looking up symbols and to redirect it
dynamically. this allows for example "netstat -r" to find the 
routing tables for THIS virtual environment.
(of course that won't work for core dumps). (XXX *needs thought *)

As example of virtualising a dummy module named the FOO module
the following code might be added to a special vfoo.h or at least to
the exisitng foo.h file:

========================================================

#ifndef _DIR_VFOO_H_
#define _DIR_VFOO_H_

#include <dir/foo.h> /* for struct foo_bar */

#define INIT_VNET_FOO(vnet) \
	INIT_FROM_VNET(vnet, VNET_MOD_FOO, \
	    struct vnet_foo, vnet_foo)

#define VNET_FOO(sym)      VSYM(vnet_foo, sym)

#if (defined(VIMAGE) || defined(FUTURE))
struct vnet_foo {
	int		_foo_counter
	struct foo_bar	_foo_barx;
};
#endif

/* Symbol translation macros */
#define V_foo_counter		VNET_FOO(foo_counter)
#define V_foo_barx		VNET_FOO(foo_barx)

#endif /* !_FOO_VFOO_H_ */
=========================================================

For each time the foo module is initiated for a new virtual environment,
the foo_bar structure must be initiated, so a new foo_creator and destructor 
functions are defined for the module. The Module will call these when a new 
virtual environment is created or destroyed. The constructor must be called
once for the base machine when the system is booted, even when options VIMAGE
is not defined. 

==================== in module foo.c ======
#include "opt_vimage.h"
[...]
#include <sys/vimage.h>
[...]
#include <dir/vfoo.h>
[...]

#ifndef VIMAGE
 /* initially the globals would have been here,
  * and for now we will leave them here when not using VIMAGE.
  * In the future we will instead have a static version of the structure.
  */
# if defined(FUTURE)
    struct vnet_foo vnet_foo_globals;
# else /* !FUTURE */
    int foo_counter = 0;
    struct foo_bar foo_barx = {};
# endif /* !FUTURE */
#endif /* !VIMAGE */

[...]

#if (defined(VIMAGE) || defined(FUTURE))
static vnet_attach_fn vnet_foo_iattach;
static vnet_detach_fn vnet_foo_idetach;
#endif

#ifdef VIMAGE
/* If we have symbols we need to divert for libkvm
 * then put them in here. We may not need to do anything if
 * the symbols are not used by libkvm.
 */
static struct vnet_symmap vnet_net_symmap[] = {
        VNET_SYMMAP(foo, foo_counter),
        VNET_SYMMAP(foo, foo_barx),
        VNET_SYMMAP_END
};
/*
 * Declare our module and state that we want to be done after the 
 * loopback interface is initialised for the virtual environment.
 */
VNET_MOD_DECLARE(FOO, foo, vnet_foo_iattach,
    vnet_foo_idetach, LOIF, vnet_foo_symmap)
#endif /* VIMAGE */

[...]

/* a pre-exisiting 'foo' function that will be converted. */
void
foo_work(void)
{
	INIT_VNET_FOO(curvnet);	/* Add this at the front */

	V_foo_counter++;	/* add "V_" to the front of the symbol */
	[...]
	V_foo_barx.mumble = V_foo_counter;  /* and here too */
	[...]
}

/*
 * A function which on entry has no idea of which vnet it is on
 * and needs to look at them all for some reason.
 * NOTE! if this code is running in a thread that
 * does nothing else, or otherwise doesn't care about which
 * vnet it is on then the steps that save and restore the previous vnet
 * need not be done. (Marked with /* XXX */)
 */
void
foo_tick(void)
{
	VNET_ITERATOR_DECL(vnet_iter);
	[...]

	[...]
	VNET_LIST_RLOCK();
	VNET_LIST_FOREACH(vnet_iter) {
		CURVNET_SET(vnet_iter); 
		INIT_VNET_NET(vnet_iter);
		[...]
		do work,
		including calling code that assumes we have curvnet set.
		[...]
		CURVNET_RESTORE();
 	}
	VNET_LIST_RUNLOCK();
	[...]
}

#if (defined(VIMAGE) || defined(FUTURE))
static int vnet_foo_iattach(const void *unused)
{
	INIT_VNET_FOO(curvnet);

	V_foo_counter = 0;
	bzero (&V_foo_barx, sizeof (V_foo_barx));
	return 0;
}
#endif

#ifdef VIMAGE
static int vnet_foo_idetach(const void *unused)
{
	INIT_VNET_FOO(curvnet);

	/* prove we are ready to remove the module */
	/* code here to do work required */
	return 0;
}
#endif /* VIMAGE */

/*
 * Handle loading and unloading for this code.
 * The only thing we need to link into is the NETISR strucure.
 */
static int
foo_mod_event(module_t mod, int event, void *data)
{
	int error = 0;

	switch (event) {
	case MOD_LOAD:
		/* Initialize everything. */
		/* put your code here */
#ifdef VIMAGE
		/* This will do the work for each vortual environment. */
		vnet_mod_register(&vnet_foo_modinfo);
#else /* !VIMAGE */
#ifdef FUTURE
		/* otherwise do the initialisation directly */
		vnet_foo_iattach(NULL);
#else /* !FUTURE */
/* otherwise the intialisation is done statically */
#endif /* !FUTURE */
#endif /* !VIMAGE */
		break;
	case MOD_UNLOAD:
		/* You can't unload it because an interface may be using it. */
		/* this needs work */
		/* Should refuse to unload if any virtual environment */
		/* are using this still. */
		/* MARKO, fill in here */
		error = EBUSY;
		break;
	default:
		error = EOPNOTSUPP;
		break;
	}
	return (error);
}