FreeBSD mail list etiquette
dillon at apollo.backplane.com
Sat Oct 25 21:01:50 PDT 2003
:> It's a lot easier lockup path then the direction 5.x is going, and
:> a whole lot more maintainable IMHO because most of the coding doesn't
:> have to worry about mutexes or LORs or anything like that.
:You still have to be pretty careful, though, with relying on implicit
:synchronization, because while it works well deep in a subsystem, it can
:break down on subsystem boundaries. One of the challenges I've been
:bumping into recently when working with Darwin has been the split between
:their Giant kernel lock, and their network lock. To give a high level
:summary of the architecture, basically they have two Funnels, which behave
:similarly to the Giant lock in -STABLE/-CURRENT: when you block, the lock
:is released, allowing other threads to enter the kernel, and regained when
:the thread starts to execute again. They then have fine-grained locking
:for the Mach-derived components, such as memory allocation, VM, et al.
I recall a presentation at BSDCon that mentioned that... yours I think.
The interfaces we are contemplating for the NETIF (at the bottom)
and UIPC (at the top) are different. We probably won't need to use
any mutexes to queue incoming packets to the protocol thread, we will
almost certainly use an async IPI message to queue a message holding the
packet if the protocol thread is on a different cpu. On the same cpu
it's just a critical section to interlock the queueing operation against
the protocol thread. Protocol packet output to NETIF would use the
same methodology... asynch IPI message if the NETIF is on another cpu,
critical section if it is on the current cpu.
The protocol itself will change from a softint to a normal thread, or
perhaps a thread at softint priority. The softint is already a thread
but we would separate each protocol into its own thread and have an
ability to create several threads for a single protocol (like TCP) when
necessary to take advantage of multiple cpus.
On the UIPC side we have a choice of using a mutex to lock the socket
buffer, or passing a message to the protocol thread responsible for
the socket buffer (aka PCB). There are tradeoffs for both situations
since if this is related to a write() it winds up being a synchronous
message. Another option is to COW the memory but that might be too
complex. Smaller writes could simply copyin() the data as an option,
or we could treat the socket buffer as a FIFO which would allow the
system call UIPC interface to append to it without holding any locks
(other then a memory barrier after the copy and before updating the
index), then simply send a kick-off message to the protocol thread
telling it that more data is present.
:Deep in a particular subsystem -- say, the network stack, all works fine.
:The problem is at the boundaries, where structures are shared between
:multiple compartments. I.e., process credentials are referenced by both
:"halves" of the Darwin BSD kernel code, and are insufficiently protected
:in the current implementation (they have a write lock, but no read lock,
:so it looks like it should be possible to get stale references with
:pointers accessed in a read form under two different locks). Similarly,
:there's the potential for serious problems at the surprisingly frequently
:occuring boundaries between the network subsystem and remainder of the
:kernel: file descriptor related code, fifos, BPF, et al. By making use of
:two large subsystem locks, they do simplify locking inside the subsystem,
:but it's based on a web of implicit assumptions and boundary
:synchronization that carries most of the risks of explicit locking.
Yes. I'm not worried about BPF, and ucred is easy since it is
already 95% of the way there, though messing with ucred's ref count
will require a mutex or an atomic bus-locked instruction even in
DragonFly! The route table is our big issue. TCP caches routes so we
can still BGL the route table and achieve 85% of the scaleable
performance so I am not going to worry about the route table initially.
An example with ucred would be to passively queue it to a particular cpu
for action. Lets say instead of using an atomic bus-locked instruction
to manipulate ucred's ref count, we instead send a passive IPI to the
cpu 'owning' the ucred, and that ucred is otherwise read-only. A
passive IPI, which I haven't implemented yet, is simply queueing an
IPI message but not actually generating an interrupt on the target cpu
unless the CPU->CPU software IPI message FIFO is full, so it doesn't
actually waste any cpu cycles and multiple operations can be executed
in-batch by the target. Passive IPIs can be used for things
that do not require instantanious action and both bumping and releasing
ref counts can take advantage of it. I'm not saying that is how
we will deal with ucred, but it is a definite option.
:It's also worth noting that there have been some serious bugs associated
:with a lack of explicit synchronization in the non-concurrent kernel model
:used in RELENG_4 (and a host of other early UNIX systems relying on a
:single kernel lock). These have to do with unexpected blocking deep in a
:function call stack, where it's not anticipated by a developer writing
:source code higher in the stack, resulting in race conditions. In the
I've encountered this with softupdates, so I know what you mean.
softupdates (at least in 4.x) is extremely sensitive to blocking in
places where it doesn't expect blocking to happen. My free() code was
occassionally (and accidently) blocking in an interrupt thread waiting
on kernel_map (I've already removed kmem_map from DragonFly), and this
was enough to cause softupdates to panic in its IO completion rundown
once in a blue moon due to assumptions on its lock 'lk'.
Synchronization is a bigger problem in 5.x then it is in DragonFly because
in DragonFly most of the work is shoved over to the cpu that 'owns' the
data structure via an async IPI. e.g. when you want to schedule thread X
on cpu 1 and thread X is owned by cpu 2, cpu 1 will send an asynch
IPI to cpu 2 and cpu 2 will actually do the scheduling. If the cpuid
changes during the message transit cpu 2 will simply chase the owning cpu,
forwarding it along. It doesn't matter if the cpuid is out of synch,
in fact! You don't even need a memory barrier. Same goes for the slab
allocator... DragonFly does not mess with the slab allocated by another
cpu, it forwards the free() request to the other cpu instead.
For a protocol, a protocol thread will own a PCB, so the PCB will be
'owned' by the cpu the protocol thread is on. Any manipulation of the
PCB must occur on that cpu or otherwise be very carefully managed
(e.g. FIFO rindex/windex for the socket buffer and a memory barrier).
Our intention is to encapsulate most operations as messages to the
protocol thread owning the PCB.
:past, there have been a number of exploitable security vulnerabilities due
:to races opened up in low memory conditions, during paging, etc. One
:solution I was exploring was using the compiler to help track the
:potential for functions to block, similar to the const qualifier, combined
:with blocking/non-blocking assertions evaluated at compile-time. However,
:some of our current APIs (M_NOWAIT, M_WAITOK, et al) make that approach
:somewhat difficult to apply, and would have to be revised to use a
:compiler solution. These potential weaknesses very much exist in an
:explicit model, but with explicit locking, we have a clearer notion of how
:to express assertions.
DragonFly is using its LWKT messaging API to abstract blocking verses
non-blocking. In particular, if a client sends a message using an
asynch interface it isn't supposed to block, but can return EASYNC if it
wound up queueing the message due to not being able to execute it
synchronous without blocking. If a client sends a message using a
synchronous messaging interface then the client is telling the
messaging subsystem that it is ok to block.
This combined with the fact that we are using critical sections and
per-cpu globaldata caches that do not require mutexes to access allows
code to easily determine whether something might or might not block,
and the message structure is a convenient placemark to queue and
return EASYNC deep in the kernel if something would otherwise block
when it isn't supposed to.
We also have the asynch IPI mechanism and a few other mechanisms at
our disposal and these cover a surprisingly large number of situations
in the system. 90% of the 'not sure if we might block' problem
is related to scheduling or memory allocation and neither of those
subsystems needs to use extranious mutexes, so managing the blocking
conditions is actually quite easy.
:In -CURRENT, we make use of thread-based serialization in a number of
:places to avoid explicit synchronization costs (such as in GEOM for
:processing work queues), and we should make more use of this practice.
:I'm particularly interested in the use of interface interrupt threads
:performing direct dispatch as a means to maintain interface ordering of
:packets coming in network interfaces while allowing parallelism in network
:processing (you'll find this in use in Sam's netperf branch currently).
:Robert N M Watson FreeBSD Core Team, TrustedBSD Projects
:robert at fledge.watson.org Network Associates Laboratories
I definitely think that -current should explore a greater roll for
threading subsystems. Remember that many operations can be done
asynchronously and thus do not actually require synchronous context
switches or blocking. A GEOM strategy routine is a good example, since
it must perform I/O and I/O *ALWAYS* blocks or takes an interrupt
at some point. However, you need to be careful because not all
operations truely need to be run in a threaded subsystem's thread
context. This is why DragonFly's LWKT messaging subsystem uses the
Amiga's BeginIo abstraction for dispatching a message, which allows
the target port to execute messages synchronously in the context of
the caller if it happens to be possible to do so without blocking.
The advantage of this is that we can start out by always queueing the
message (thereby guarenteeing that queue mode operation will always
be acceptable), and then later on we can optimize paricular messages
(such as read()'s that are able to lock and access the VM object's
page cache without blocking, in order to avoid switching to a
filesystem thread unnecessarily).
I'm sure we will hit issues but so far it has been smooth sailing.
<dillon at backplane.com>
More information about the freebsd-hackers