Realtime thread priorities

Thu Dec 16 14:41:11 UTC 2010

On Wednesday, December 15, 2010 11:16:53 pm David Xu wrote:
> John Baldwin wrote:
> > On Tuesday, December 14, 2010 8:40:12 pm David Xu wrote:
> >> John Baldwin wrote:
> >>> On Monday, December 13, 2010 8:30:24 pm David Xu wrote:
> >>>> John Baldwin wrote:
> >>>>> On Sunday, December 12, 2010 3:06:20 pm Sergey Babkin wrote:
> >>>>>> John Baldwin wrote:
> >>>>>>> The current layout breaks up the global thread priority space (0 - 255) 
> >>>>> into a
> >>>>>>> couple of bands:
> >>>>>>>
> >>>>>>>   0 -  63 : interrupt threads
> >>>>>>>  64 - 127 : kernel sleep priorities (PSOCK, etc.)
> >>>>>>> 128 - 159 : real-time user threads (rtprio)
> >>>>>>> 160 - 223 : time-sharing user threads
> >>>>>>> 224 - 255 : idle threads (idprio and kernel idle procs)
> >>>>>>>
> >>>>>>> If we decide to change the behavior I see two possible fixes:
> >>>>>>>
> >>>>>>> 1) (easy) just move the real-time priority range above the kernel sleep
> >>>>>>> priority range
> >>>>>> Would not this cause a priority inversion when an RT process
> >>>>>> enters the kernel mode?
> >>>>> How so?  Note that timesharing threads are not "bumped" to a kernel sleep 
> >>>>> priority when they enter the kernel either.  The kernel sleep priorities are 
> >>>>> purely a way for certain sleep channels to cause a thread to be treated as 
> >>>>> interactive and give it a priority boost to favor interactive threads.  
> >>>>> Threads in the kernel do not automatically have higher priority than threads 
> >>>>> not in the kernel.  Keep in mind that all stopped threads (threads not 
> >>>>> executing) are always in the kernel when they stop.
> >>>> I have requirement to make a thread running in kernel has more higher
> >>>> priority over a thread running userland code, because our kernel
> >>>> mutex is not sleepable which does not like Solaris did, I have to use
> >>>> semaphore like code in kern_umtx.c to lock a chain, which allows me
> >>>> to read and write user address space, this is how umtxq_busy() did,
> >>>> but it does not prevent a userland thread from preempting a thread
> >>>> which locked the chain, if a realtime thread preempts a thread
> >>>> locked the chain, it may lock up whole processes using pthread.
> >>>> I think our realtime scheduling is not very useful, it is too easy
> >>>> to lock up system.
> >>> Users are not forced to use rtprio.  They choose to do so, and they have to
> >>> be root to enable it (either directly or by extending root privileges via
> >>> sudo or some such).  Just because you don't have a use case for it doesn't
> >>> mean that other people do not.  Right now there is no way possible to say
> >>> that a given userland process is more important than 'sshd' (or any other
> >>> daemon) blocked in poll/select/kevent waiting for a packet.  However, there
> >>> are use cases where other long-running userland processes are in fact far
> >>> more important than sshd (or similar processes such as getty, etc.).
> >>>
> >> You still don't answer me about how to avoid a time-sharing thread
> >> holding a critical kernel resource which preempted by a user RT thread,
> >> and later the RT thread requires the resource, but the time-sharing
> >> thread has no chance to run because another RT thread is dominating
> >> the CPU because it is doing CPU bound work, result is deadlock, even if
> >> you know you trust your RT process, there are many code which were
> >> written by you, i.e the libc and any other libraries using threading
> >> are completely not ready for RT use.
> >> How ever let a thread in kernel have higher priority over a thread
> >> running userland code will fix such a deadlock in kernel.
> > 
> > Put another way, the time-sharing thread that I don't care about (sshd, or
> > some other monitoring daemon, etc.) is stealing a resource I care about
> > (time, in the form of CPU cycles) from my RT process that is critical to
> > getting my work done.
> > 
> > Beyond that a few more points:
> > 
> > - You are ignoring "tools, not policy".  You don't know what is in my binary
> >   (and I can't really tell you).  Assume for a minute that I'm not completely
> >   dumb and can write userland code that is safe to run at this high of a
> >   priority level.  You already trust me to write code in the kernel that runs
> >   at even higher priority now. :)
> > - You repeatedly keep missing (ignoring?) the fact that this is _optional_.
> >   Users have to intentionally decide to enable this, and there are users who
> >   do _need_ this functionality.
> > - You have also missed that this has always been true for idprio processes
> >   (and is in fact why we restrict idprio to root), so this is not "new".
> > - Finally, you also are missing that this can already happen _now_ for plain
> >   old time sharing processes if the thread holding the resource doesn't ever
> >   do a sleep that raises the priority.
> > 
> > For example, if a time-sharing thread with some typical priority >=
> > PRI_MIN_TIMESHARE calls write(2) on a file, it can lock the vnode lock for
> > that file (if it is unlocked) and hold that lock while it's priority is >=
> > PRI_MIN_TIMESHARE.  If an interrupt arrives for a network packet that wakes
> > up sshd for a new SSH connection, the interrupt thread will preempt the
> > thread holding the vnode lock, and sshd will be executed instead of the
> > thread holding the vnode lock when the ithread finishes.  If sshd needs the
> > vnode lock that the original thread holds, then sshd will block until the
> > original thread is rescheduled due to the random fates of time and releases
> > the vnode lock.
> > 
> > In summary, the kernel sleep priorities do _not_ serve to prevent all
> > priority inversions, what they do accomplish is giving preferential treatment
> > to idle, "interactive" threads.
> > 
> > A bit more information on my use case btw:
> > 
> > My RT processes are each assigned a _dedicated_ CPU via cpuset (we remove the
> > CPU from the global cpuset and ensure no interrupts are routed to that CPU).
> > The problem I have is that if my RT process blocks on a lock (e.g. a lock on a
> > VM object during a page fault), then I want the RT thread to lend its RT
> > priority to the thread that holds the lock over on another CPU so that the lock
> > can be released as quickly as possible.  This use case is perfectly safe (the
> > RT thread is not preempting other threads, instead other threads are partitioned
> > off into a separate set of available CPUs).  What I need is to ensure that the
> > syncer or pagedaemon or whoever holds the lock I need gets a chance to run right
> > away when it holds a lock that I need.
> > 
> What I meant is that whenever thread is in kernel mode, it always has
> higher priority over thread running user code, and all threads in kernel
> mode may have same priority except those interrupt threads which
> has higher priority, but this should be carefully designed to use
> mutex and spinlock between interrupt threads and other threads,
> mutex uses turnstile to propagate priority, spin lock disables 
> interrupt, otherwise there still is priority inversion in kernel, i.e 
> rwlock, sx lock.

Except that this isn't really true.  Really, if a thread is asleep in
select() or poll() or kevent(), what critical resource is it holding?  I had
the same view originally when the current set of priorites were setup.
However, I've had to change it since I now have a real-world use case for
rtprio.

First, I think this is the easy part of the argument:  Can you agree that if
a RT process is in the kernel, it should have priority over a TS process in
the kernel?  Thus, if a RT process blocks in the kernel, it would need to
lend enough of a priority to the lock holder to preempt any TS process in the
kernel, yes?  If so, that argues for RT processes in the kernel having a
higher priority than all the other kernel sleep priorities.

The second part is harder, and that is what happens when a RT process is in
userland.  First, some food for thought.  Do you realize that currently, the
syncer and pagedaemon threads run at PVM?  This is intentional so that these
processes run in the "background" even though they are in the kernel.
Specifically, when sshd does wakeup from a sleep at PSOCK or the like, the
kernel doesn't just let it run in the kernel, it effectively lets it keep
that PSOCK priority in userland until the next context switch due to an
interrupt or the quantum expiring.  This means that when you ssh into a box,
the your interactive typing ends up preempting syncer and pagedaemon.  And
this is a good thing, because syncer and pagedaemon are _background_
processes.  Preempting them only for the kernel portion of sshd (as the
change to userret in both your proposal and my original #2 would do) would
not really favor interactive processes because the user relies on the
userland portion of an interactive process to run, too (userland is the part
that echos back the characters as they are typed).  So even now, with TS
threads, we have TS userland code that is _more important_ than code in the
kernel.  Another example is the idlezero kernel process.  This is kernel
code, but is easily far less important than pretty much all userland code.
Kernel code is _not_ always more important than userland code.  It often is,
but it sometimes isn't.  If you can accept that, then it is no longer strange
to consider that even the userland code in a RT process is more important
than kernel code in a TS process.

In our case we do chew up a lot of CPU in userland for our RT processes, but
we handle this case by using dedicated CPUs.  Our RT processes really are the
most important processes on the box.

-- 
John Baldwin