FreeBSD mail list etiquette

Sat Oct 25 21:01:50 PDT 2003

:>     It's a lot easier lockup path then the direction 5.x is going, and
:>     a whole lot more maintainable IMHO because most of the coding doesn't
:>     have to worry about mutexes or LORs or anything like that.  
:
:You still have to be pretty careful, though, with relying on implicit
:synchronization, because while it works well deep in a subsystem, it can
:break down on subsystem boundaries.  One of the challenges I've been
:bumping into recently when working with Darwin has been the split between
:their Giant kernel lock, and their network lock.  To give a high level
:summary of the architecture, basically they have two Funnels, which behave
:similarly to the Giant lock in -STABLE/-CURRENT: when you block, the lock
:is released, allowing other threads to enter the kernel, and regained when
:the thread starts to execute again. They then have fine-grained locking
:for the Mach-derived components, such as memory allocation, VM, et al. 

    I recall a presentation at BSDCon that mentioned that... yours I think.

    The interfaces we are contemplating for the NETIF (at the bottom)
    and UIPC (at the top) are different.  We probably won't need to use
    any mutexes to queue incoming packets to the protocol thread, we will
    almost certainly use an async IPI message to queue a message holding the
    packet if the protocol thread is on a different cpu.  On the same cpu
    it's just a critical section to interlock the queueing operation against
    the protocol thread.  Protocol packet output to NETIF would use the
    same methodology... asynch IPI message if the NETIF is on another cpu,
    critical section if it is on the current cpu.

    The protocol itself will change from a softint to a normal thread, or
    perhaps a thread at softint priority.  The softint is already a thread
    but we would separate each protocol into its own thread and have an
    ability to create several threads for a single protocol (like TCP) when
    necessary to take advantage of multiple cpus.

    On the UIPC side we have a choice of using a mutex to lock the socket
    buffer, or passing a message to the protocol thread responsible for
    the socket buffer (aka PCB).  There are tradeoffs for both situations
    since if this is related to a write() it winds up being a synchronous
    message.  Another option is to COW the memory but that might be too
    complex.  Smaller writes could simply copyin() the data as an option,
    or we could treat the socket buffer as a FIFO which would allow the
    system call UIPC interface to append to it without holding any locks
    (other then a memory barrier after the copy and before updating the
    index), then simply send a kick-off message to the protocol thread
    telling it that more data is present.

:Deep in a particular subsystem -- say, the network stack, all works fine. 
:The problem is at the boundaries, where structures are shared between
:multiple compartments.  I.e., process credentials are referenced by both
:"halves"  of the Darwin BSD kernel code, and are insufficiently protected
:in the current implementation (they have a write lock, but no read lock,
:so it looks like it should be possible to get stale references with
:pointers accessed in a read form under two different locks). Similarly,
:there's the potential for serious problems at the surprisingly frequently
:occuring boundaries between the network subsystem and remainder of the
:kernel: file descriptor related code, fifos, BPF, et al.  By making use of
:two large subsystem locks, they do simplify locking inside the subsystem,
:but it's based on a web of implicit assumptions and boundary
:synchronization that carries most of the risks of explicit locking.

    Yes.  I'm not worried about BPF, and ucred is easy since it is
    already 95% of the way there, though messing with ucred's ref count
    will require a mutex or an atomic bus-locked instruction even in 
    DragonFly!  The route table is our big issue.  TCP caches routes so we
    can still BGL the route table and achieve 85% of the scaleable
    performance so I am not going to worry about the route table initially.

    An example with ucred would be to passively queue it to a particular cpu
    for action.  Lets say instead of using an atomic bus-locked instruction
    to manipulate ucred's ref count, we instead send a passive IPI to the
    cpu 'owning' the ucred, and that ucred is otherwise read-only.  A 
    passive IPI, which I haven't implemented yet, is simply queueing an
    IPI message but not actually generating an interrupt on the target cpu
    unless the CPU->CPU software IPI message FIFO is full, so it doesn't
    actually waste any cpu cycles and multiple operations can be executed
    in-batch by the target.  Passive IPIs can be used for things
    that do not require instantanious action and both bumping and releasing
    ref counts can take advantage of it.  I'm not saying that is how
    we will deal with ucred, but it is a definite option.

:It's also worth noting that there have been some serious bugs associated
:with a lack of explicit synchronization in the non-concurrent kernel model
:used in RELENG_4 (and a host of other early UNIX systems relying on a
:single kernel lock).  These have to do with unexpected blocking deep in a
:function call stack, where it's not anticipated by a developer writing
:source code higher in the stack, resulting in race conditions.  In the

    I've encountered this with softupdates, so I know what you mean.  
    softupdates (at least in 4.x) is extremely sensitive to blocking in
    places where it doesn't expect blocking to happen.  My free() code was
    occassionally (and accidently) blocking in an interrupt thread waiting
    on kernel_map (I've already removed kmem_map from DragonFly), and this
    was enough to cause softupdates to panic in its IO completion rundown
    once in a blue moon due to assumptions on its lock 'lk'.

    Synchronization is a bigger problem in 5.x then it is in DragonFly because
    in DragonFly most of the work is shoved over to the cpu that 'owns' the
    data structure via an async IPI.  e.g. when you want to schedule thread X
    on cpu 1 and thread X is owned by cpu 2, cpu 1 will send an asynch
    IPI to cpu 2 and cpu 2 will actually do the scheduling.  If the cpuid 
    changes during the message transit cpu 2 will simply chase the owning cpu,
    forwarding it along.  It doesn't matter if the cpuid is out of synch,
    in fact!  You don't even need a memory barrier.  Same goes for the slab 
    allocator...  DragonFly does not mess with the slab allocated by another
    cpu, it forwards the free() request to the other cpu instead.

    For a protocol, a protocol thread will own a PCB, so the PCB will be
    'owned' by the cpu the protocol thread is on.  Any manipulation of the
    PCB must occur on that cpu or otherwise be very carefully managed
    (e.g. FIFO rindex/windex for the socket buffer and a memory barrier).
    Our intention is to encapsulate most operations as messages to the
    protocol thread owning the PCB.

:past, there have been a number of exploitable security vulnerabilities due
:to races opened up in low memory conditions, during paging, etc.  One
:solution I was exploring was using the compiler to help track the
:potential for functions to block, similar to the const qualifier, combined
:with blocking/non-blocking assertions evaluated at compile-time.  However,
:some of our current APIs (M_NOWAIT, M_WAITOK, et al) make that approach
:somewhat difficult to apply, and would have to be revised to use a
:compiler solution.  These potential weaknesses very much exist in an
:explicit model, but with explicit locking, we have a clearer notion of how
:to express assertions.

    DragonFly is using its LWKT messaging API to abstract blocking verses
    non-blocking.  In particular, if a client sends a message using an
    asynch interface it isn't supposed to block, but can return EASYNC if it
    wound up queueing the message due to not being able to execute it
    synchronous without blocking.  If a client sends a message using a
    synchronous messaging interface then the client is telling the
    messaging subsystem that it is ok to block.

    This combined with the fact that we are using critical sections and
    per-cpu globaldata caches that do not require mutexes to access allows
    code to easily determine whether something might or might not block,
    and the message structure is a convenient placemark to queue and 
    return EASYNC deep in the kernel if something would otherwise block
    when it isn't supposed to. 

    We also have the asynch IPI mechanism and a few other mechanisms at
    our disposal and these cover a surprisingly large number of situations
    in the system.  90% of the 'not sure if we might block' problem
    is related to scheduling or memory allocation and neither of those
    subsystems needs to use extranious mutexes, so managing the blocking
    conditions is actually quite easy.

:In -CURRENT, we make use of thread-based serialization in a number of
:places to avoid explicit synchronization costs (such as in GEOM for
:processing work queues), and we should make more use of this practice. 
:I'm particularly interested in the use of interface interrupt threads
:performing direct dispatch as a means to maintain interface ordering of
:packets coming in network interfaces while allowing parallelism in network
:processing (you'll find this in use in Sam's netperf branch currently).
:
:Robert N M Watson             FreeBSD Core Team, TrustedBSD Projects
:robert at fledge.watson.org      Network Associates Laboratories

    I definitely think that -current should explore a greater roll for 
    threading subsystems.  Remember that many operations can be done
    asynchronously and thus do not actually require synchronous context
    switches or blocking.  A GEOM strategy routine is a good example, since
    it must perform I/O and I/O *ALWAYS* blocks or takes an interrupt
    at some point.  However, you need to be careful because not all 
    operations truely need to be run in a threaded subsystem's thread
    context.  This is why DragonFly's LWKT messaging subsystem uses the
    Amiga's BeginIo abstraction for dispatching a message, which allows
    the target port to execute messages synchronously in the context of
    the caller if it happens to be possible to do so without blocking.

    The advantage of this is that we can start out by always queueing the
    message (thereby guarenteeing that queue mode operation will always
    be acceptable), and then later on we can optimize paricular messages
    (such as read()'s that are able to lock and access the VM object's
    page cache without blocking, in order to avoid switching to a 
    filesystem thread unnecessarily).

    I'm sure we will hit issues but so far it has been smooth sailing.

					-Matt
					Matthew Dillon 
					<dillon at backplane.com>