Realtime thread priorities

Thu Dec 16 04:16:53 UTC 2010

John Baldwin wrote:
> On Tuesday, December 14, 2010 8:40:12 pm David Xu wrote:
>> John Baldwin wrote:
>>> On Monday, December 13, 2010 8:30:24 pm David Xu wrote:
>>>> John Baldwin wrote:
>>>>> On Sunday, December 12, 2010 3:06:20 pm Sergey Babkin wrote:
>>>>>> John Baldwin wrote:
>>>>>>> The current layout breaks up the global thread priority space (0 - 255) 
>>>>> into a
>>>>>>> couple of bands:
>>>>>>>
>>>>>>>   0 -  63 : interrupt threads
>>>>>>>  64 - 127 : kernel sleep priorities (PSOCK, etc.)
>>>>>>> 128 - 159 : real-time user threads (rtprio)
>>>>>>> 160 - 223 : time-sharing user threads
>>>>>>> 224 - 255 : idle threads (idprio and kernel idle procs)
>>>>>>>
>>>>>>> If we decide to change the behavior I see two possible fixes:
>>>>>>>
>>>>>>> 1) (easy) just move the real-time priority range above the kernel sleep
>>>>>>> priority range
>>>>>> Would not this cause a priority inversion when an RT process
>>>>>> enters the kernel mode?
>>>>> How so?  Note that timesharing threads are not "bumped" to a kernel sleep 
>>>>> priority when they enter the kernel either.  The kernel sleep priorities are 
>>>>> purely a way for certain sleep channels to cause a thread to be treated as 
>>>>> interactive and give it a priority boost to favor interactive threads.  
>>>>> Threads in the kernel do not automatically have higher priority than threads 
>>>>> not in the kernel.  Keep in mind that all stopped threads (threads not 
>>>>> executing) are always in the kernel when they stop.
>>>> I have requirement to make a thread running in kernel has more higher
>>>> priority over a thread running userland code, because our kernel
>>>> mutex is not sleepable which does not like Solaris did, I have to use
>>>> semaphore like code in kern_umtx.c to lock a chain, which allows me
>>>> to read and write user address space, this is how umtxq_busy() did,
>>>> but it does not prevent a userland thread from preempting a thread
>>>> which locked the chain, if a realtime thread preempts a thread
>>>> locked the chain, it may lock up whole processes using pthread.
>>>> I think our realtime scheduling is not very useful, it is too easy
>>>> to lock up system.
>>> Users are not forced to use rtprio.  They choose to do so, and they have to
>>> be root to enable it (either directly or by extending root privileges via
>>> sudo or some such).  Just because you don't have a use case for it doesn't
>>> mean that other people do not.  Right now there is no way possible to say
>>> that a given userland process is more important than 'sshd' (or any other
>>> daemon) blocked in poll/select/kevent waiting for a packet.  However, there
>>> are use cases where other long-running userland processes are in fact far
>>> more important than sshd (or similar processes such as getty, etc.).
>>>
>> You still don't answer me about how to avoid a time-sharing thread
>> holding a critical kernel resource which preempted by a user RT thread,
>> and later the RT thread requires the resource, but the time-sharing
>> thread has no chance to run because another RT thread is dominating
>> the CPU because it is doing CPU bound work, result is deadlock, even if
>> you know you trust your RT process, there are many code which were
>> written by you, i.e the libc and any other libraries using threading
>> are completely not ready for RT use.
>> How ever let a thread in kernel have higher priority over a thread
>> running userland code will fix such a deadlock in kernel.
> 
> Put another way, the time-sharing thread that I don't care about (sshd, or
> some other monitoring daemon, etc.) is stealing a resource I care about
> (time, in the form of CPU cycles) from my RT process that is critical to
> getting my work done.
> 
> Beyond that a few more points:
> 
> - You are ignoring "tools, not policy".  You don't know what is in my binary
>   (and I can't really tell you).  Assume for a minute that I'm not completely
>   dumb and can write userland code that is safe to run at this high of a
>   priority level.  You already trust me to write code in the kernel that runs
>   at even higher priority now. :)
> - You repeatedly keep missing (ignoring?) the fact that this is _optional_.
>   Users have to intentionally decide to enable this, and there are users who
>   do _need_ this functionality.
> - You have also missed that this has always been true for idprio processes
>   (and is in fact why we restrict idprio to root), so this is not "new".
> - Finally, you also are missing that this can already happen _now_ for plain
>   old time sharing processes if the thread holding the resource doesn't ever
>   do a sleep that raises the priority.
> 
> For example, if a time-sharing thread with some typical priority >=
> PRI_MIN_TIMESHARE calls write(2) on a file, it can lock the vnode lock for
> that file (if it is unlocked) and hold that lock while it's priority is >=
> PRI_MIN_TIMESHARE.  If an interrupt arrives for a network packet that wakes
> up sshd for a new SSH connection, the interrupt thread will preempt the
> thread holding the vnode lock, and sshd will be executed instead of the
> thread holding the vnode lock when the ithread finishes.  If sshd needs the
> vnode lock that the original thread holds, then sshd will block until the
> original thread is rescheduled due to the random fates of time and releases
> the vnode lock.
> 
> In summary, the kernel sleep priorities do _not_ serve to prevent all
> priority inversions, what they do accomplish is giving preferential treatment
> to idle, "interactive" threads.
> 
> A bit more information on my use case btw:
> 
> My RT processes are each assigned a _dedicated_ CPU via cpuset (we remove the
> CPU from the global cpuset and ensure no interrupts are routed to that CPU).
> The problem I have is that if my RT process blocks on a lock (e.g. a lock on a
> VM object during a page fault), then I want the RT thread to lend its RT
> priority to the thread that holds the lock over on another CPU so that the lock
> can be released as quickly as possible.  This use case is perfectly safe (the
> RT thread is not preempting other threads, instead other threads are partitioned
> off into a separate set of available CPUs).  What I need is to ensure that the
> syncer or pagedaemon or whoever holds the lock I need gets a chance to run right
> away when it holds a lock that I need.
> 
What I meant is that whenever thread is in kernel mode, it always has
higher priority over thread running user code, and all threads in kernel
mode may have same priority except those interrupt threads which
has higher priority, but this should be carefully designed to use
mutex and spinlock between interrupt threads and other threads,
mutex uses turnstile to propagate priority, spin lock disables 
interrupt, otherwise there still is priority inversion in kernel, i.e 
rwlock, sx lock.

I really don't care if idprio will be preempted at user boundary or not,
I think it really should do, any thread returning to userland should
check if there is a higher priority RT thread is in run queue,
if this is true, it always should switch context, for other cases,
i.e TS (time-sharing) vs TS in run queue, keep current behavior may 
still be a good idea for better performance.

To clarify my idea, this is sample code:
in trap.c:
     set td_pflags |= TDP_KERNELMODE;
in sched_switch():
    if (td_pflags & TDP_KERNELMODE)
        sched_prio(td, PRI_KERNEL);

the PRI_KERNEL_MODE will always be higher than any RT priority and
TS priority and IDLE priority, but will lower than interrupt threads
priority.

in userret:
    td_pflags &=~ TDP_KERNELMODE;
    restore priority to its current user priority
    check rescheduling:
      TS vs TS may ignore some flags;
      TS vs RT in run queue, switch context
      RT vs RT in run queue, compare priority and switch context
      ...

Now kernel itself is safe to run RT priority thread, unlike current
code which will dead lock.