Comments on the KSE option
Matthew Dillon
dillon at apollo.backplane.com
Sun Oct 29 18:19:04 UTC 2006
:All this debate about the merits of process scope threads and fair
:scheduling is great. But tell me, who was working on making this stuff
:work well quickly and reliably (i.e. work well)? No one! I don't care
:what AIX or Solaris or what else may or may not have done, who was
:making this work well for FreeBSD? Having a slow a thread subsystem is
:a serious detriment, no matter how nice and flexible it looks on paper.
:
:Scott
I hope Julian won't think badly of me for saying this, but I don't
think M:N support in the kernel is a good idea. M:N is the way to go
in my view, but the 'M' has to be implemented in userland. And,
of course, M:N does *NOT* preclude 1:1. The 'N' is just a number picked
out of the ether, after all, so baring userland support you wind up
with 1:1 in the kernel.
There are a couple of people working on LWP's in DragonFly (that is,
direct 1:1 kernel support that includes POSIX signal sharing). It
is my intention to take this mechanism, once it is working, and
transform it into a M:N implementation where the LWPs represent the 'N'
and the userland thread library deals with the 'M'.
If one really thinks about it, why is a userland implementation slower
then a KSE implementation and can it be made more efficient? This
is what I have come up with:
* Extra kevent() calls to register events, extra kevent() calls to poll
for new events. When the userland thread library issues a non-blocking
I/O it has to call kevent() to add the descriptor when EWOULDBLOCK
is returned. When the userland thread scheduler switches threads
ir needs to poll for new events as well.
But this can be automated. There is no why the kernel couldn't
automatically add a descriptor returning EWOULDBLOCK to a kqueue.
Also, there is no reason why the kernel couldn't write to a user
supplied memory location to notify the userland scheduler that a
new kevent is pending. (And note that not using EV_ONESHOT in
current userland thread libraries is NOT an optimal solution to
this problem, for reasons that should be obvious if you think about
it for a few seconds).
* Signal mask handling. The userland thread schedule needs to block
signals during certain critical operations and needs to be able to
adjust the signal mask when switching threads (depending on the scope).
It must also poll for blocked signals.
But this doesn't have to be done with system calls, at least no in
the critical path. There is no reason why userland can't register
a signal mask pointer and pending signal set with a kernel, where
they both reside in user memory. So the kernel needs to do a copyin
or two when processing a signal. Signals so rarely occur that it is
extrordinarily difficult to justify putting all that overhead in the
userland thread scheduler's critical path. It makes sense to shift
the overhead to the actual signal delivery operation.
Userland can then adjust the signal mask simply by changing a pointer,
and poll for blocked signals by testing a single variable in memory.
* IPC between threads. Something similar to IPI messaging is needed,
where the data is passed solely via shared memory. The only system
call involved would be to queue an upcall to the target (rforked)
process. This is more a DragonFly-like abstraction though. We are
big on cpu localization. In an M:N environment, the cpu localization
is abstracted as the 'N'.
* TLS segment switching. Short of trying to implement a caching scheme
in the segment descriptor array this probably needs to remain a
system call. But it isn't very expensive. ~350ns or so on my
DragonFly test box. Frankly, the kernel can't switch threads
all that quickly either.
It takes the kernel at least a 1 uS to switch threads, whereas a
userland thread switch (including FP), plus the TLS call, winds up
being around 1.2 uS. It really isn't that big a difference.
* Blocked FILESYSTEM disk I/O. From a performance standpoint blocked
disk I/O is the biggest issue for a M:N design over a 1:1 design.
In fact, I think ultimately this is *THE* only issue of any
significance.
It seems to me that the kernel is long, LONG overdue for getting
filesystem support for O_NONBLOCK. I am NOT talking about AIO
here, I am talking about making read() or write() to a file in a
filesystem work efficiently in a threaded environment.
Traditionally the kernel blocks unconditionally in kernel space for
such I/O and does read-ahead in 128KB blocks. O_NONBLOCK is ignored.
What we want to do is make it work with O_NONBLOCK (or perhaps some
new flag) in an efficient manner. This implies that, with special
system calls and/or flags, file I/O should be able to return EWOULDBLOC
K *AND* should *ALSO* operate somewhat like a device when it does so,
with the knowledge that the user program tried to issue this large
read kept intact in the kernel so the kernel can do a dependable
read-ahead of some of the data (more then 128KB... at least 512KB
in my view for things to be efficient), and then generate an event
for that descriptor just like a normal device or pipe or socket would.
What I am describing here is NOT AIO. IMHO AIO as a concept is a
complete failure.
-Matt
Matthew Dillon
<dillon at backplane.com>
More information about the freebsd-current
mailing list