FreeBSD mail list etiquette

Sat Oct 25 15:13:14 PDT 2003

    Sheesh, you think you guys (*ALL* you guys) have enough time on your
    hands?  There are better places to direct all that brainpower.

    I don't really need to defend DragonFly... I believe it stands on its
    own very well not only with what we have already accomplished but with
    what we are about to accomplish.  Jeffrey is very close to decoupling
    the NETIF and network protocol drivers from Giant and Hiten has been
    playing with the APICs in regards to distributing interrupts to
    particular CPUs (something which DragonFly is particularly good at due
    to the way the light weight kernel threading system works).  As soon as
    I get this namecache mess rewritten (and assuming David Rhodus doesn't 
    keep pulling obscure panics out of his hat :-), but to be fair our NFS
    is already gobs faster then 4.x)... I am going to start cleaning up
    loose ends in the networking code and we will have the critical path
    entirely decoupled and mostly (or completely) mutexless.

    We are taking a somewhat different approach to BGL removal then 5.x.
    Instead of halfhazardly locking up subsystems with mutexes we are 
    instead locking up subsystems by moving them into their own threads,
    then scaling through the use of multiple threads, and leaving everything
    that hasn't been locked up under the BGL.  That way we are able to skip
    the intermediate step of determining where all the contention is,
    because the only contention will be the BGL'd areas which haven't been
    converted yet and we will simply assume contention.  This way we can
    focus on optimizing the critical path, which will get us 80% of the
    scaleability we need, and tackle the other things like, say, the route
    table, after we have the topology in place and can see clearly what needs
    to be done for it (e.g. like using RCU and passive IPI messaging instead
    of mutexes for updates).

    So, for example, take the TCP stack.  It's already mostly in its own
    thread simply by virtue of being a software interrupt.  Softints,
    like interrupts, are threads in DragonFly.  After the first lockup
    phase external APIs such as mbuf allocation and freeing, and
    route table lookups, will still be under the BGL, but PCBs and packet
    manipulation will be serialized in the protocol thread(s) and require no
    mutexes or locks whatsoever.  Then we will move most of the mbuf API out
    of the BGL simply by adding a per-cpu layer (and since there is no
    cpu-hopping preemption we can depend on the per-cpu globaldata area 
    without wasting cycles getting and releasing mutexes that just waste
    cycles since the whole idea is for there to be no contention in the
    first place).  But just like our current slab allocator, things that
    miss the per-cpu globaldata cache will either use the BGL to access
    the kernel_map or will queue the operation (if it does not need to be
    synchronous) for later execution.  After all, who cares if free() 
    can't release a chunk of memory to the kernel_map instantly for reuse?

    It's a lot easier lockup path then the direction 5.x is going, and
    a whole lot more maintainable IMHO because most of the coding doesn't
    have to worry about mutexes or LORs or anything like that.  

    If I were to recommend anything to the folks working on FreeBSD-current,
    it would be:

	* get rid of priority borrowing, and stop depending on it to fix
	  all your woes with interrupt threads accessing mutexes that
	  non-interrupt threads might also be accessing in the critical
	  path.  Fix the interrupt code instead.

	* get rid of *NON*-interrupt thread preemption while in the kernel.

	* get rid of preemptive cpu migration, even across normal blocks
	  inside the kernel unless you tell the API otherwise with a flag
	  that it is ok.

	* formalize critical sections to use just the counter mechanism
	  (similar to spls in 4.x), which it almost does now, and require
	  that hardware interrupts conform to the mechanism on all
	  architectures.

	* Port our IPI messaging code (which isn't optimized yet, but works
	  and can theoretically be very nicely optimized).

	* separate the userland scheduler from the kernel thread scheduler
	  using a designated P_CURPROC approach, which completely fixes the
	  priority inversion issues I might add that ULE only 'fake fixes'
	  right now.  Make the kernel thread scheduler a fixed priority
	  scheduler (e.g. highest priority being interrupts, then softints,
	  then threads operating in the kernel, then user associated
	  threads operating in the kernel, then user associated threads
	  operating in userland).  Fix the userland scheduler API to
	  conform to the designated P_CURPROC approach, where the userland
	  scheduler is responsible for maintaining a single user process's
	  thread or threads on each cpu in the system at a time.

    If you did the above you would be a lot happier.  Once the schedulers
    are separated I would also make the kernel thread scheduler per-cpu
    and remove *ALL* mutex dependancies from it, which in turn will allow
    you to trivially integrate BGL requirements with a per-thread lock
    counter and directly integrate it into the kernel thread scheduler,
    which I do in DragonFly if you look at kern/lwkt_thread.c.  It actually
    optimizes the use of the BGL such that you can avoid doing BGL operations
    when switching between threads with the same BGL locked/not-locked state.

						-Matt