Realtime thread priorities

John Baldwin jhb at
Wed Dec 15 14:38:47 UTC 2010

On Tuesday, December 14, 2010 8:40:12 pm David Xu wrote:
> John Baldwin wrote:
> > On Monday, December 13, 2010 8:30:24 pm David Xu wrote:
> >> John Baldwin wrote:
> >>> On Sunday, December 12, 2010 3:06:20 pm Sergey Babkin wrote:
> >>>> John Baldwin wrote:
> >>>>> The current layout breaks up the global thread priority space (0 - 255) 
> >>> into a
> >>>>> couple of bands:
> >>>>>
> >>>>>   0 -  63 : interrupt threads
> >>>>>  64 - 127 : kernel sleep priorities (PSOCK, etc.)
> >>>>> 128 - 159 : real-time user threads (rtprio)
> >>>>> 160 - 223 : time-sharing user threads
> >>>>> 224 - 255 : idle threads (idprio and kernel idle procs)
> >>>>>
> >>>>> If we decide to change the behavior I see two possible fixes:
> >>>>>
> >>>>> 1) (easy) just move the real-time priority range above the kernel sleep
> >>>>> priority range
> >>>> Would not this cause a priority inversion when an RT process
> >>>> enters the kernel mode?
> >>> How so?  Note that timesharing threads are not "bumped" to a kernel sleep 
> >>> priority when they enter the kernel either.  The kernel sleep priorities are 
> >>> purely a way for certain sleep channels to cause a thread to be treated as 
> >>> interactive and give it a priority boost to favor interactive threads.  
> >>> Threads in the kernel do not automatically have higher priority than threads 
> >>> not in the kernel.  Keep in mind that all stopped threads (threads not 
> >>> executing) are always in the kernel when they stop.
> >> I have requirement to make a thread running in kernel has more higher
> >> priority over a thread running userland code, because our kernel
> >> mutex is not sleepable which does not like Solaris did, I have to use
> >> semaphore like code in kern_umtx.c to lock a chain, which allows me
> >> to read and write user address space, this is how umtxq_busy() did,
> >> but it does not prevent a userland thread from preempting a thread
> >> which locked the chain, if a realtime thread preempts a thread
> >> locked the chain, it may lock up whole processes using pthread.
> >> I think our realtime scheduling is not very useful, it is too easy
> >> to lock up system.
> > 
> > Users are not forced to use rtprio.  They choose to do so, and they have to
> > be root to enable it (either directly or by extending root privileges via
> > sudo or some such).  Just because you don't have a use case for it doesn't
> > mean that other people do not.  Right now there is no way possible to say
> > that a given userland process is more important than 'sshd' (or any other
> > daemon) blocked in poll/select/kevent waiting for a packet.  However, there
> > are use cases where other long-running userland processes are in fact far
> > more important than sshd (or similar processes such as getty, etc.).
> > 
> You still don't answer me about how to avoid a time-sharing thread
> holding a critical kernel resource which preempted by a user RT thread,
> and later the RT thread requires the resource, but the time-sharing
> thread has no chance to run because another RT thread is dominating
> the CPU because it is doing CPU bound work, result is deadlock, even if
> you know you trust your RT process, there are many code which were
> written by you, i.e the libc and any other libraries using threading
> are completely not ready for RT use.
> How ever let a thread in kernel have higher priority over a thread
> running userland code will fix such a deadlock in kernel.

Put another way, the time-sharing thread that I don't care about (sshd, or
some other monitoring daemon, etc.) is stealing a resource I care about
(time, in the form of CPU cycles) from my RT process that is critical to
getting my work done.

Beyond that a few more points:

- You are ignoring "tools, not policy".  You don't know what is in my binary
  (and I can't really tell you).  Assume for a minute that I'm not completely
  dumb and can write userland code that is safe to run at this high of a
  priority level.  You already trust me to write code in the kernel that runs
  at even higher priority now. :)
- You repeatedly keep missing (ignoring?) the fact that this is _optional_.
  Users have to intentionally decide to enable this, and there are users who
  do _need_ this functionality.
- You have also missed that this has always been true for idprio processes
  (and is in fact why we restrict idprio to root), so this is not "new".
- Finally, you also are missing that this can already happen _now_ for plain
  old time sharing processes if the thread holding the resource doesn't ever
  do a sleep that raises the priority.

For example, if a time-sharing thread with some typical priority >=
PRI_MIN_TIMESHARE calls write(2) on a file, it can lock the vnode lock for
that file (if it is unlocked) and hold that lock while it's priority is >=
PRI_MIN_TIMESHARE.  If an interrupt arrives for a network packet that wakes
up sshd for a new SSH connection, the interrupt thread will preempt the
thread holding the vnode lock, and sshd will be executed instead of the
thread holding the vnode lock when the ithread finishes.  If sshd needs the
vnode lock that the original thread holds, then sshd will block until the
original thread is rescheduled due to the random fates of time and releases
the vnode lock.

In summary, the kernel sleep priorities do _not_ serve to prevent all
priority inversions, what they do accomplish is giving preferential treatment
to idle, "interactive" threads.

A bit more information on my use case btw:

My RT processes are each assigned a _dedicated_ CPU via cpuset (we remove the
CPU from the global cpuset and ensure no interrupts are routed to that CPU).
The problem I have is that if my RT process blocks on a lock (e.g. a lock on a
VM object during a page fault), then I want the RT thread to lend its RT
priority to the thread that holds the lock over on another CPU so that the lock
can be released as quickly as possible.  This use case is perfectly safe (the
RT thread is not preempting other threads, instead other threads are partitioned
off into a separate set of available CPUs).  What I need is to ensure that the
syncer or pagedaemon or whoever holds the lock I need gets a chance to run right
away when it holds a lock that I need.

John Baldwin

