Asynchronous syscalls for massive concurrency

Wed Mar 19 21:05:30 UTC 2008

Hello,

Reading the recent thread on -current regarding dropping KSE and the
evolution of application design with respect to concurrency, prompted
me to want to ask what people think regarding the desirability and/or
feasability of making FreeBSD support an asynchronous interface to
*all* system calls. In particular, I would love to hear from
experienced kernel developers whether this is a total pipe dream in
terms of practically implementing it semi-efficiently with a
reasonable amount of work, in a real-world operating system like
FreeBSD.

It is something that I have secretly coveting for for a long time
(generally, not for FreeBSD specifically). Let me explain my
motivation. Some will probably disagree, others will think I am
pointing out the obvious, but I still want to summarize it here in the
hopes of triggering a fruitful and/or interesting discussion. If this
has been discussed, feel free to point me in the right direction.

Apologies if this reads like a rant (will se if anyone bothers to read
it to begin with ;).

State machines vs. concurrency in application design:

It is my belief that, in general, from a design perspective there is a
desire to model a fundamentally concurrent process in the form of
concurrency, rather than a state machine. It is my belief that,
ignoring efficiency, concurrency tends to lead to cleaner, more
modularized and more easily maintainable code.

"Concurrency" above does not necessarily refer to operating system
threads using the pthread model. I will mention implementation issues
later on. Right now, I am arguing strictly from the design perspective
of how I want to structure my code.

In the simplest cases, a state machine will be pretty small and easy
to understand (let's say the state machine parsing simple line-based
messages in an IRC server). As the problem grows more complex, you
will invariable wish to modularize the code; introduce abstractions;
moving details away from the core loop.

This can be done to any extent; you might generalize it a bit and be
fairly satisfied. Until you want to introduce even more features and
you feel the need to further generalize. You end up reaching a point
where the framework for writing a modularized state machine is
approaching the featurefullness of actual concurrency. If your
language supports continuations, you may very well simply implement
the state machine in terms of these. In this case, you pretty much
have userland co-operative threading. That is one end of the extreme;
the other end is the simple core state machine loop without
modularizations. In between, you have various levels of generality in
explicit state management.

In this way, I view "true" concurrency modelling as essentially being
the perfect state machine, completely hidden from application
code. From this perspective, manually writing explicit state
management is essentially a less powerful and less expressive way of
achieving the same goal, that I would only resort to if "true"
concurrency is not possible or practical.

I will readily admit that I have not written a lot of state machine
style code, exactly because of my view expressed above. I have however
written, and write on a daily basis, a pretty considerable amount of
multi-threaded (or otherwise concurrent) code. Some of it could
reasonable be re-done in the form of a state machine, while in other
cases the complexity would be absolutely tremendous without using some
form of state management framework that at least approaches true
concurrency (and even then complexity would be considerably higher
than now).

This is my reason for fundamentally preferring a concurrent approach,
and would love for it to be supported by the language/environment
being used, to an extent such that there is no need to fall back to
explicit state management.

Enabling massive concurrency by asynchronous syscalls:

Support for asynchronous syscalls, I believe, would help *a lot* in
realizing practical massive concurrency in a variety of environments,
from C/C++ to more higher level languages.

  C/C++ and similar w/ pthread concurrency model:

    One can already come *fairly* far by writing code in a stack
    conscious manner. Most of the C/C++ I do is written with this in
    mind, and having tens of thousands of threads or more is not a
    problem at all, because the stack size can be very very
    conservative. This is not to say that it is necessarily the most
    efficient approach (context switching overhead, cache localitly,
    etc), but it does get you pretty far - particularly when the
    application is extremely concurrent in relation to the actual
    amount of work done in total (that is, when your problem is not
    CPU efficiency, but the ability to e.g. handle 30 000 clients,
    most of which are mostly idle).

    That said, there are still issues. Regardless of whether explicit
    memory management is used, or garbage collection, there is a
    desire to reduce contention. Many mallocs have thread-local pools
    (including jemalloc). But of course, if you want to be running
    with a 32 kbyte stack size, suddenly you do not have a lot of room
    for thread-local resources of that type.

    Having asynchronous syscalls, in combination with sufficient
    support from the runtime/threading libraries, would allow for
    these resources to scale with the number of *cores*, rather than
    the number of threads. Even if the nature of the C language more
    or less requires, as far as I can tell, that you retain a
    per-thread native stack, it would be a huge plus in several
    situations to be able to scale based on memory availability
    relative to stack size, without worrying about secondary affects
    like higher contention in malloc.

  Higher level languages with native threads (e.g. Python,
  Ruby 1.9, some Common Lisps):

    A similar situation exists here as with C/C++.

  Higher level languages without native threads (Ruby 1.8, erlang,
  chicken, etc):

    This is my primary motivation. Several languages attempt to get
    away with using non-native threads, and end up being pretty
    scalable in terms of concurrency - but at a cost. Only certain
    operations are possible to do in a non-blocking fashion
    (basically, I/O).

    In the case of e.g. Ruby and Chicken, the problem is simply lived
    with, and other syscalls block the interpreter, causing the
    language to be much less usable than it otherwise would have been,
    in a variety of "production" situations.

    In the case of erlang, this problem is dealt with by requiring
    blocking operations to be performed in a separate process, with
    pretty expensive communication going on between those processes
    and the core virtual machine. The fact that you can't just
    randomly do your syscall, or blocking library call, right off the
    bat means the threshold, in terms of effort, in interfacing with
    native non-erlang code is pretty high. (Of course in erlang a
    major purpose is to have this separation on purpose, to ensure
    that the core virtual machine is not tained by broken native
    code. I am ignoring that here.) 

    The reason why your call blocks the entire {Ruby,Chicken,...}
    interpreter is not that somebody *wants* this to happen, but
    because of the effort required to fix this problem - especially if
    you are not willing to take the massive performance hit of
    requireing native threads. Ruby 1.9 will have native threading,
    but at a certain cost in terms of concurrency overhead, but also
    in terms of language power (continuations are no longer feasable
    to support).

Are asynchronous system calls just a bad idea to begin with? Am I
failing to identify major problems with this approach?[1] Is it just not
feasable at the kernel implementation level? How much of this overlaps
with KSE?

Would it be correct to say that the primary complexity in implementing
this in the kernel, stems from exactly the problem I claim with
application development - a need for additional reliance on explicit
state management, in the blocking cases where, right now, a lot is
kept implicit in the process' kernel stack?

If so I guess one could look at it as moving that problem from the
applications, to the kernel (and/or the runtime library).

[1] One problem I can think of is that one will loose concurrency in
memory access, so that the impact of high-latency paging (to/from
disk) will likely be a lot higher, since your other threads (in the
same native thread/process) will not be able to execute during said
paging. A single page-in (and to a lesser extent page-out) effectively
blocks any number of "threads".

-- 
/ Peter Schuller

PGP userID: 0xE9758B7D or 'Peter Schuller <peter.schuller at infidyne.com>'
Key retrieval: Send an E-Mail to getpgpkey at scode.org
E-Mail: peter.schuller at infidyne.com Web: http://www.scode.org

-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 195 bytes
Desc: not available
Url : http://lists.freebsd.org/pipermail/freebsd-hackers/attachments/20080319/38370f92/attachment.pgp