vkernel & GSoC, some questions
dillon at apollo.backplane.com
Sun Mar 16 23:24:46 UTC 2008
Basically DragonFly has a syscall API that allows a userland process
to create and completely control any number of VM spaces, including
the ability to pass execution control to a VM space and get it back,
and control memory mappings within that VM space (and in the virtual
kernel process itself) on a page-by-page basis, so only 'invalid' PTEs
are passed through to the virtual kernel by the real kernel and the
real kernel caches page mappings with real hardware pmaps. Any
exception that occurs within a running VM space is routed back to the
virtual kernel process by the real kernel. Any real signal (e.g. the
vkernel's 'clock' interrupt) or exception that occurs also forces control
to return to the vkernel process.
A DragonFly virtual kernel is just a user process which uses this feature
to manipulate VM contexts (i.e. for processes running under the vkernel
itself), providing a complete emulation environment that is opaque to
userland. The vkernel itself is not running in an emulated environment,
it is a 'real' (and singular) user process running on the machine.
These VM contexts are managed by the real kernel as pure VM contexts,
NOT as threads or processes or anything else. Since the VM context in
the real kernel basically has one VM entry (representing the software
emulated mmap of the entire address space), and since pmap's use
throw-away PTEs, the real-kernel overhead is minimal and there is no
real limit to the number of virtualized processes the virtual kernel
can control, nor any other resource limitations within the real kernel.
One can even run a virtual kernel inside a virtual kernel... not sure why
anyone would want to do it, but it works! I can even thrash the virtual
kernel without it having any effect whatsoever on the real kernel or
The ENTIRE operational overhead rests solely in operations which must
perform a context switch. Cpu-bound programs will run at full speed and
I/O bound programs aren't too bad either. Context-switch-heavy programs
suffer as they do in a hardware virtualized environment. Make no
mistake about that, running any sort of kernel in a hardware virtualized
environment that wasn't designed to run in and you are going to have
horrible performance, as many people trying to simply 'move' their
existing machines to virtualized environments have found out the hard
way. I could probably shave off a microsecond from our virtual
kernel syscall path, but it isn't a priority for me... I'm using a
code efficient but performance inefficient implementation to pass
contextual information between the emulated VM context and the
virtual kernel, and it's a fairly expensive copy op that would benefit
greatly if it were converted to shared memory or if I simply cached the
userland page in the real kernel to avoid the copyout/lookup/pmap op.
I could probably also parallelize the real I/O backend for the 'disk'
better, but it isn't a priority for me either.
SMP is supported the same as it is supported in a real kernel, the
virtual kernel simply creates a LWP for each 'cpu' (for all intents
and purposes you can think of it as forking once for each cpu). All
the LWPs have access to the same pool of VM contexts and thus the
virtual kernel can schedule its processes to any of the LWPs on a whim.
It just uses the same process scheduler that the real kernel does...
nearly all the code in the virtual kernel is the same, in fact, the
vkernel 'platform' is only 700K of source code.
There are some minor (and admittedly not very well developed) shims to
reduce the load on the real machine when you do things like run a
vkernel simulating many cpu's on a machine which only has a few
physical cpu's. Spinning in a thread vs on a hard cpu is not the best
thing in the world to do, after all. In anycase, this means that
generally speaking SMP performance in a virtual kernel will scale as
DragonFly's own SMP performance is improved. Right now the vkernels
can be built SMP but it isn't recommended... those kinds of builds
are best used to test SMP work and not for real applications.
Insofar as virtual kernels verses machine emulation and performance goes,
people need to realize that *NO* machine emulation technology is going
to perform well for any task requiring a lot context switching or a lot
of non-MMU-resolvable page faults. No matter WHAT technology you use,
at some point any real I/O operation will have to pass through the real
kernel, period. For example, a syscall-heavy process running under a
virtual kernel will perform just about as badly as a syscall-heavy
process running under something like VMWare. Hardware virtualized MMU
support isn't quite advanced enough to solve the performance bottleneck
for any virtualization technology that I am aware of. The only reason
VMWare is perceived to have better performance in certain cases is
simply because they have invested a ridiculous number of man-hours on
instruction rewriting, plus targetted optimizations which do not stand
the test of time (work with particular software and do not generally
survive the evoluation of that software without retargetting the
optimization). It's like the assembly-vs-C arguments we had in the
mid-80's. It isn't a good precedent.
Hardware virtualization is still the only real avenue for true cross-
platform emulation, but it isn't ultimately going to be the best solution
for same-platform emulation.
Frankly a virtualized kernel such as DragonFly's kernel and user mode
linux (which uses a similar but slightly different context switch
handling model) is a better development path then machine emulation
for SAME-OS kernels, because the virtualized kernel is explicitly
designed to operate in that environment, allowing all the context-
transitional interfaces to be customized far better then what you can
do with any hardware virtualization technology, not to mention
that a virtual kernel is actually better positioned to use hardware
virtualization technologies then a hardware emulated kernel is. Sounds
nuts, but it's true.
Hardware virtualization technologies currently have far more eyeballs
writing insanely complex instruction rewriting code which is why they
are perceived as having a performance benefit at the moment, but the
development path is extremely inelegant and there is far more room for
optimization in a virtualized kernel environment then there is in a
hardware emulated environment. The virtualized kernel environment can
take advantage of the same hardware features as the hardware emulated
environment, after all, but a hardware emulated environment cannot take
advantage of all the direct syscall features available to a virtual
Types of optimizations we can do to improve virtual kernel technologies,
which also apply to hardware emulated kernels:
* Prefetch more pages to avoid excessive invalid page exceptions. Right
now mot prefetching is turned off, resulting in fairly horrible
performance for malloc-intensive programs.
* Improve the system call context switching path. Right now it uses
excessive copyin/copyout ops. What it really needs to do is use an
up-call mechanic that allows the register and FP context to be thrown
away in the vkernel process (cutting out half the copyin/copyout's).
* Do a better job bundle I/O (buffer cache interactions in the vkernel
require very different optmizations vs buffer cache interactions in
a real kernel).
* Asynchronize real I/O better. Right now, I admit, I'm basically just
using a write() to the disk file. True asynchronization requires
creating some 'I/O' LWPs outside of the SMP model, and I haven't done
that yet. Right now I have a few LWPs inside the SMP model to
parallelize I/O but it doesn't work very well.
Not really a big list, and nothing earthshattering.
More information about the freebsd-hackers