Comments on the KSE option

Sun Oct 29 18:19:04 UTC 2006

:All this debate about the merits of process scope threads and fair
:scheduling is great.  But tell me, who was working on making this stuff
:work well quickly and reliably (i.e. work well)?  No one!  I don't care
:what AIX or Solaris or what else may or may not have done, who was 
:making this work well for FreeBSD?  Having a slow a thread subsystem is
:a serious detriment, no matter how nice and flexible it looks on paper.
:
:Scott

    I hope Julian won't think badly of me for saying this, but I don't
    think M:N support in the kernel is a good idea.  M:N is the way to go
    in my view, but the 'M' has to be implemented in userland.  And,
    of course, M:N does *NOT* preclude 1:1.  The 'N' is just a number picked
    out of the ether, after all, so baring userland support you wind up
    with 1:1 in the kernel.

    There are a couple of people working on LWP's in DragonFly (that is,
    direct 1:1 kernel support that includes POSIX signal sharing).  It
    is my intention to take this mechanism, once it is working, and
    transform it into a M:N implementation where the LWPs represent the 'N'
    and the userland thread library deals with the 'M'.

    If one really thinks about it, why is a userland implementation slower 
    then a KSE implementation and can it be made more efficient?  This
    is what I have come up with:

    * Extra kevent() calls to register events, extra kevent() calls to poll
      for new events.  When the userland thread library issues a non-blocking
      I/O it has to call kevent() to add the descriptor when EWOULDBLOCK
      is returned.  When the userland thread scheduler switches threads
      ir needs to poll for new events as well.

      But this can be automated.  There is no why the kernel couldn't
      automatically add a descriptor returning EWOULDBLOCK to a kqueue.
      Also, there is no reason why the kernel couldn't write to a user
      supplied memory location to notify the userland scheduler that a
      new kevent is pending.  (And note that not using EV_ONESHOT in
      current userland thread libraries is NOT an optimal solution to
      this problem, for reasons that should be obvious if you think about
      it for a few seconds).

    * Signal mask handling.  The userland thread schedule needs to block
      signals during certain critical operations and needs to be able to
      adjust the signal mask when switching threads (depending on the scope).
      It must also poll for blocked signals.

      But this doesn't have to be done with system calls, at least no in
      the critical path.  There is no reason why userland can't register
      a signal mask pointer and pending signal set with a kernel, where 
      they both reside in user memory.  So the kernel needs to do a copyin
      or two when processing a signal.  Signals so rarely occur that it is
      extrordinarily difficult to justify putting all that overhead in the
      userland thread scheduler's critical path.  It makes sense to shift
      the overhead to the actual signal delivery operation.

      Userland can then adjust the signal mask simply by changing a pointer,
      and poll for blocked signals by testing a single variable in memory.

    * IPC between threads.  Something similar to IPI messaging is needed,
      where the data is passed solely via shared memory.  The only system
      call involved would be to queue an upcall to the target (rforked)
      process.  This is more a DragonFly-like abstraction though.  We are
      big on cpu localization.  In an M:N environment, the cpu localization
      is abstracted as the 'N'.

    * TLS segment switching.  Short of trying to implement a caching scheme
      in the segment descriptor array this probably needs to remain a
      system call.  But it isn't very expensive.  ~350ns or so on my
      DragonFly test box.  Frankly, the kernel can't switch threads
      all that quickly either.  

      It takes the kernel at least a 1 uS to switch threads, whereas a
      userland thread switch (including FP), plus the TLS call, winds up
      being around 1.2 uS.  It really isn't that big a difference.

    * Blocked FILESYSTEM disk I/O.  From a performance standpoint blocked
      disk I/O is the biggest issue for a M:N design over a 1:1 design.
      In fact, I think ultimately this is *THE* only issue of any
      significance.

      It seems to me that the kernel is long, LONG overdue for getting
      filesystem support for O_NONBLOCK.  I am NOT talking about AIO 
      here, I am talking about making read() or write() to a file in a
      filesystem work efficiently in a threaded environment.

      Traditionally the kernel blocks unconditionally in kernel space for
      such I/O and does read-ahead in 128KB blocks.  O_NONBLOCK is ignored.
      What we want to do is make it work with O_NONBLOCK (or perhaps some
      new flag) in an efficient manner.   This implies that, with special
      system calls and/or flags, file I/O should be able to return EWOULDBLOC
      K *AND* should *ALSO* operate somewhat like a device when it does so,
      with the knowledge that the user program tried to issue this large
      read kept intact in the kernel so the kernel can do a dependable
      read-ahead of some of the data (more then 128KB... at least 512KB
      in my view for things to be efficient), and then generate an event
      for that descriptor just like a normal device or pipe or socket would.

      What I am describing here is NOT AIO.  IMHO AIO as a concept is a
      complete failure.

					-Matt
					Matthew Dillon 
					<dillon at backplane.com>