vkernel & GSoC, some questions

Sun Mar 16 23:24:46 UTC 2008

    Basically DragonFly has a syscall API that allows a userland process
    to create and completely control any number of VM spaces, including
    the ability to pass execution control to a VM space and get it back,
    and control memory mappings within that VM space (and in the virtual
    kernel process itself) on a page-by-page basis, so only 'invalid' PTEs
    are passed through to the virtual kernel by the real kernel and the
    real kernel caches page mappings with real hardware pmaps.  Any
    exception that occurs within a running VM space is routed back to the
    virtual kernel process by the real kernel.  Any real signal (e.g. the
    vkernel's 'clock' interrupt) or exception that occurs also forces control
    to return to the vkernel process.

    A DragonFly virtual kernel is just a user process which uses this feature
    to manipulate VM contexts (i.e. for processes running under the vkernel
    itself), providing a complete emulation environment that is opaque to
    userland.  The vkernel itself is not running in an emulated environment,
    it is a 'real' (and singular) user process running on the machine.
    These VM contexts are managed by the real kernel as pure VM contexts,
    NOT as threads or processes or anything else.  Since the VM context in
    the real kernel basically has one VM entry (representing the software
    emulated mmap of the entire address space), and since pmap's use
    throw-away PTEs, the real-kernel overhead is minimal and there is no
    real limit to the number of virtualized processes the virtual kernel
    can control, nor any other resource limitations within the real kernel.

    One can even run a virtual kernel inside a virtual kernel... not sure why
    anyone would want to do it, but it works!  I can even thrash the virtual
    kernel without it having any effect whatsoever on the real kernel or
    system.

    The ENTIRE operational overhead rests solely in operations which must
    perform a context switch.  Cpu-bound programs will run at full speed and
    I/O bound programs aren't too bad either.  Context-switch-heavy programs
    suffer as they do in a hardware virtualized environment.  Make no
    mistake about that, running any sort of kernel in a hardware virtualized
    environment that wasn't designed to run in and you are going to have
    horrible performance, as many people trying to simply 'move' their
    existing machines to virtualized environments have found out the hard
    way.   I could probably shave off a microsecond from our virtual
    kernel syscall path, but it isn't a priority for me... I'm using a
    code efficient but performance inefficient implementation to pass
    contextual information between the emulated VM context and the
    virtual kernel, and it's a fairly expensive copy op that would benefit
    greatly if it were converted to shared memory or if I simply cached the
    userland page in the real kernel to avoid the copyout/lookup/pmap op.
    I could probably also parallelize the real I/O backend for the 'disk'
    better, but it isn't a priority for me either.

    SMP is supported the same as it is supported in a real kernel, the
    virtual kernel simply creates a LWP for each 'cpu' (for all intents
    and purposes you can think of it as forking once for each cpu).  All
    the LWPs have access to the same pool of VM contexts and thus the
    virtual kernel can schedule its processes to any of the LWPs on a whim.
    It just uses the same process scheduler that the real kernel does...
    nearly all the code in the virtual kernel is the same, in fact, the
    vkernel 'platform' is only 700K of source code.

    There are some minor (and admittedly not very well developed) shims to
    reduce the load on the real machine when you do things like run a
    vkernel simulating many cpu's on a machine which only has a few
    physical cpu's.  Spinning in a thread vs on a hard cpu is not the best
    thing in the world to do, after all.  In anycase, this means that
    generally speaking SMP performance in a virtual kernel will scale as
    DragonFly's own SMP performance is improved.  Right now the vkernels
    can be built SMP but it isn't recommended... those kinds of builds
    are best used to test SMP work and not for real applications.

    --

    Insofar as virtual kernels verses machine emulation and performance goes,
    people need to realize that *NO* machine emulation technology is going
    to perform well for any task requiring a lot context switching or a lot
    of non-MMU-resolvable page faults.  No matter WHAT technology you use,
    at some point any real I/O operation will have to pass through the real
    kernel, period.  For example, a syscall-heavy process running under a
    virtual kernel will perform just about as badly as a syscall-heavy
    process running under something like VMWare.  Hardware virtualized MMU
    support isn't quite advanced enough to solve the performance bottleneck
    for any virtualization technology that I am aware of.  The only reason
    VMWare is perceived to have better performance in certain cases is 
    simply because they have invested a ridiculous number of man-hours on
    instruction rewriting, plus targetted optimizations which do not stand
    the test of time (work with particular software and do not generally
    survive the evoluation of that software without retargetting the
    optimization).  It's like the assembly-vs-C arguments we had in the
    mid-80's.  It isn't a good precedent.

    Hardware virtualization is still the only real avenue for true cross-
    platform emulation, but it isn't ultimately going to be the best solution
    for same-platform emulation.

    Frankly a virtualized kernel such as DragonFly's kernel and user mode
    linux (which uses a similar but slightly different context switch
    handling model) is a better development path then machine emulation
    for SAME-OS kernels, because the virtualized kernel is explicitly
    designed to operate in that environment, allowing all the context-
    transitional interfaces to be customized far better then what you can
    do with any hardware virtualization technology, not to mention
    that a virtual kernel is actually better positioned to use hardware
    virtualization technologies then a hardware emulated kernel is.  Sounds
    nuts, but it's true.

    Hardware virtualization technologies currently have far more eyeballs
    writing insanely complex instruction rewriting code which is why they
    are perceived as having a performance benefit at the moment, but the
    development path is extremely inelegant and there is far more room for
    optimization in a virtualized kernel environment then there is in a
    hardware emulated environment.  The virtualized kernel environment can
    take advantage of the same hardware features as the hardware emulated
    environment, after all, but a hardware emulated environment cannot take
    advantage of all the direct syscall features available to a virtual
    kernel.

    --

    Types of optimizations we can do to improve virtual kernel technologies,
    which also apply to hardware emulated kernels:

    * Prefetch more pages to avoid excessive invalid page exceptions.  Right
      now mot prefetching is turned off, resulting in fairly horrible
      performance for malloc-intensive programs.

    * Improve the system call context switching path.  Right now it uses
      excessive copyin/copyout ops.  What it really needs to do is use an
      up-call mechanic that allows the register and FP context to be thrown
      away in the vkernel process (cutting out half the copyin/copyout's).

    * Do a better job bundle I/O (buffer cache interactions in the vkernel
      require very different optmizations vs buffer cache interactions in
      a real kernel).

    * Asynchronize real I/O better.  Right now, I admit, I'm basically just
      using a write() to the disk file.  True asynchronization requires
      creating some 'I/O' LWPs outside of the SMP model, and I haven't done
      that yet.  Right now I have a few LWPs inside the SMP model to
      parallelize I/O but it doesn't work very well.

    Not really a big list, and nothing earthshattering.

						-Matt