Realtime thread priorities

Fri Dec 17 01:59:06 UTC 2010

John Baldwin wrote:
> On Wednesday, December 15, 2010 11:16:53 pm David Xu wrote:
>> John Baldwin wrote:
>>> On Tuesday, December 14, 2010 8:40:12 pm David Xu wrote:
>>>> John Baldwin wrote:
>>>>> On Monday, December 13, 2010 8:30:24 pm David Xu wrote:
>>>>>> John Baldwin wrote:
>>>>>>> On Sunday, December 12, 2010 3:06:20 pm Sergey Babkin wrote:
>>>>>>>> John Baldwin wrote:
>>>>>>>>> The current layout breaks up the global thread priority space (0 - 255) 
>>>>>>> into a
>>>>>>>>> couple of bands:
>>>>>>>>>
>>>>>>>>>   0 -  63 : interrupt threads
>>>>>>>>>  64 - 127 : kernel sleep priorities (PSOCK, etc.)
>>>>>>>>> 128 - 159 : real-time user threads (rtprio)
>>>>>>>>> 160 - 223 : time-sharing user threads
>>>>>>>>> 224 - 255 : idle threads (idprio and kernel idle procs)
>>>>>>>>>
>>>>>>>>> If we decide to change the behavior I see two possible fixes:
>>>>>>>>>
>>>>>>>>> 1) (easy) just move the real-time priority range above the kernel sleep
>>>>>>>>> priority range
>>>>>>>> Would not this cause a priority inversion when an RT process
>>>>>>>> enters the kernel mode?
>>>>>>> How so?  Note that timesharing threads are not "bumped" to a kernel sleep 
>>>>>>> priority when they enter the kernel either.  The kernel sleep priorities are 
>>>>>>> purely a way for certain sleep channels to cause a thread to be treated as 
>>>>>>> interactive and give it a priority boost to favor interactive threads.  
>>>>>>> Threads in the kernel do not automatically have higher priority than threads 
>>>>>>> not in the kernel.  Keep in mind that all stopped threads (threads not 
>>>>>>> executing) are always in the kernel when they stop.
>>>>>> I have requirement to make a thread running in kernel has more higher
>>>>>> priority over a thread running userland code, because our kernel
>>>>>> mutex is not sleepable which does not like Solaris did, I have to use
>>>>>> semaphore like code in kern_umtx.c to lock a chain, which allows me
>>>>>> to read and write user address space, this is how umtxq_busy() did,
>>>>>> but it does not prevent a userland thread from preempting a thread
>>>>>> which locked the chain, if a realtime thread preempts a thread
>>>>>> locked the chain, it may lock up whole processes using pthread.
>>>>>> I think our realtime scheduling is not very useful, it is too easy
>>>>>> to lock up system.
>>>>> Users are not forced to use rtprio.  They choose to do so, and they have to
>>>>> be root to enable it (either directly or by extending root privileges via
>>>>> sudo or some such).  Just because you don't have a use case for it doesn't
>>>>> mean that other people do not.  Right now there is no way possible to say
>>>>> that a given userland process is more important than 'sshd' (or any other
>>>>> daemon) blocked in poll/select/kevent waiting for a packet.  However, there
>>>>> are use cases where other long-running userland processes are in fact far
>>>>> more important than sshd (or similar processes such as getty, etc.).
>>>>>
>>>> You still don't answer me about how to avoid a time-sharing thread
>>>> holding a critical kernel resource which preempted by a user RT thread,
>>>> and later the RT thread requires the resource, but the time-sharing
>>>> thread has no chance to run because another RT thread is dominating
>>>> the CPU because it is doing CPU bound work, result is deadlock, even if
>>>> you know you trust your RT process, there are many code which were
>>>> written by you, i.e the libc and any other libraries using threading
>>>> are completely not ready for RT use.
>>>> How ever let a thread in kernel have higher priority over a thread
>>>> running userland code will fix such a deadlock in kernel.
>>> Put another way, the time-sharing thread that I don't care about (sshd, or
>>> some other monitoring daemon, etc.) is stealing a resource I care about
>>> (time, in the form of CPU cycles) from my RT process that is critical to
>>> getting my work done.
>>>
>>> Beyond that a few more points:
>>>
>>> - You are ignoring "tools, not policy".  You don't know what is in my binary
>>>   (and I can't really tell you).  Assume for a minute that I'm not completely
>>>   dumb and can write userland code that is safe to run at this high of a
>>>   priority level.  You already trust me to write code in the kernel that runs
>>>   at even higher priority now. :)
>>> - You repeatedly keep missing (ignoring?) the fact that this is _optional_.
>>>   Users have to intentionally decide to enable this, and there are users who
>>>   do _need_ this functionality.
>>> - You have also missed that this has always been true for idprio processes
>>>   (and is in fact why we restrict idprio to root), so this is not "new".
>>> - Finally, you also are missing that this can already happen _now_ for plain
>>>   old time sharing processes if the thread holding the resource doesn't ever
>>>   do a sleep that raises the priority.
>>>
>>> For example, if a time-sharing thread with some typical priority >=
>>> PRI_MIN_TIMESHARE calls write(2) on a file, it can lock the vnode lock for
>>> that file (if it is unlocked) and hold that lock while it's priority is >=
>>> PRI_MIN_TIMESHARE.  If an interrupt arrives for a network packet that wakes
>>> up sshd for a new SSH connection, the interrupt thread will preempt the
>>> thread holding the vnode lock, and sshd will be executed instead of the
>>> thread holding the vnode lock when the ithread finishes.  If sshd needs the
>>> vnode lock that the original thread holds, then sshd will block until the
>>> original thread is rescheduled due to the random fates of time and releases
>>> the vnode lock.
>>>
>>> In summary, the kernel sleep priorities do _not_ serve to prevent all
>>> priority inversions, what they do accomplish is giving preferential treatment
>>> to idle, "interactive" threads.
>>>
>>> A bit more information on my use case btw:
>>>
>>> My RT processes are each assigned a _dedicated_ CPU via cpuset (we remove the
>>> CPU from the global cpuset and ensure no interrupts are routed to that CPU).
>>> The problem I have is that if my RT process blocks on a lock (e.g. a lock on a
>>> VM object during a page fault), then I want the RT thread to lend its RT
>>> priority to the thread that holds the lock over on another CPU so that the lock
>>> can be released as quickly as possible.  This use case is perfectly safe (the
>>> RT thread is not preempting other threads, instead other threads are partitioned
>>> off into a separate set of available CPUs).  What I need is to ensure that the
>>> syncer or pagedaemon or whoever holds the lock I need gets a chance to run right
>>> away when it holds a lock that I need.
>>>
>> What I meant is that whenever thread is in kernel mode, it always has
>> higher priority over thread running user code, and all threads in kernel
>> mode may have same priority except those interrupt threads which
>> has higher priority, but this should be carefully designed to use
>> mutex and spinlock between interrupt threads and other threads,
>> mutex uses turnstile to propagate priority, spin lock disables 
>> interrupt, otherwise there still is priority inversion in kernel, i.e 
>> rwlock, sx lock.
> 
> Except that this isn't really true.  Really, if a thread is asleep in
> select() or poll() or kevent(), what critical resource is it holding?  I had
> the same view originally when the current set of priorites were setup.
> However, I've had to change it since I now have a real-world use case for
> rtprio.
> 
> First, I think this is the easy part of the argument:  Can you agree that if
> a RT process is in the kernel, it should have priority over a TS process in
> the kernel?  Thus, if a RT process blocks in the kernel, it would need to
> lend enough of a priority to the lock holder to preempt any TS process in the
> kernel, yes?  If so, that argues for RT processes in the kernel having a
> higher priority than all the other kernel sleep priorities.
> 

Yes, RT processes should preempt any TS, but how can you lend priority
for lockmgr and sx lock and all locking based on msleep() and wakeup() ?
That's why I try to fix it, they have priority inversion, to fix the
problem, a POSIX priority-protect mutex like semantic is needed, that
when a lock is locked, thread needs to raise its priority at high enough
priority to protect priority inversion, when a thread tries to lock a
lower priority ceiling lock, it should abort, this means lock order 
reversal ? kernel may panic for correctness.
Consequences of priority inversion depends on application, it may be
dangerous or trivial, but it is not correct.

> The second part is harder, and that is what happens when a RT process is in
> userland.  First, some food for thought.  Do you realize that currently, the
> syncer and pagedaemon threads run at PVM?  This is intentional so that these
> processes run in the "background" even though they are in the kernel.
> Specifically, when sshd does wakeup from a sleep at PSOCK or the like, the
> kernel doesn't just let it run in the kernel, it effectively lets it keep
> that PSOCK priority in userland until the next context switch due to an
> interrupt or the quantum expiring.  This means that when you ssh into a box,
> the your interactive typing ends up preempting syncer and pagedaemon.  And
> this is a good thing, because syncer and pagedaemon are _background_
> processes.  Preempting them only for the kernel portion of sshd (as the
> change to userret in both your proposal and my original #2 would do) would
> not really favor interactive processes because the user relies on the
> userland portion of an interactive process to run, too (userland is the part
> that echos back the characters as they are typed).  So even now, with TS
> threads, we have TS userland code that is _more important_ than code in the
> kernel.  Another example is the idlezero kernel process.  This is kernel
> code, but is easily far less important than pretty much all userland code.
> Kernel code is _not_ always more important than userland code.  It often is,
> but it sometimes isn't.  If you can accept that, then it is no longer strange
> to consider that even the userland code in a RT process is more important
> than kernel code in a TS process.
> 

I think this may not the intention that a TS thread tries to keep its
high priority over a kernel threads which is important, I guess the
original idea is to eliminate extra context switch between TS,
the TS priority algorithm may have some errors, and this keeps
extra context switch away, for example, current code still sets PPQ
4 but not 1, this further make priorities fuzzy. Thinking about a
TS priority algorithm like the one in current kernel:
assume two threads A and B both start from same priority 160,
they are CPU pigs, a small granularity (N) of clock ticks causes
thread A drop its priority to 161, and it found thread B has
higher priority 160, now B should preempt A, and context switched.
2 * N ticks later, thread B drops its priority to 162, now it
found thread A has higher priority 161, it switches context,
let thread A run. The N is far less than scheduler's quantum.
This can be called as an error, because thread are not scheduled
based on quantum.

However existing algorithm is incorrect for RT scheduling,
RT scheduling is strictly based on static priority, the result of the
existing algorithm is priority inversion (PPQ = 4, and not preempt
at user boundary), because RT scheduling is based on static
priority algorithm, the priority inversion will be forever,
it is unlike TS algorithm which will lower cpu pig to low priority,
and the priority inversion is temporarily killed.

> In our case we do chew up a lot of CPU in userland for our RT processes, but
> we handle this case by using dedicated CPUs.  Our RT processes really are the
> most important processes on the box.
>