Realtime thread priorities
David Xu
davidxu at freebsd.org
Fri Dec 17 06:20:46 UTC 2010
Julian Elischer wrote:
> On 12/16/10 6:40 AM, John Baldwin wrote:
>> On Wednesday, December 15, 2010 11:16:53 pm David Xu wrote:
>>> John Baldwin wrote:
>>>> On Tuesday, December 14, 2010 8:40:12 pm David Xu wrote:
>>>>> John Baldwin wrote:
>>>>>> On Monday, December 13, 2010 8:30:24 pm David Xu wrote:
>>>>>>> John Baldwin wrote:
>>>>>>>> On Sunday, December 12, 2010 3:06:20 pm Sergey Babkin wrote:
>>>>>>>>> John Baldwin wrote:
>>>>>>>>>> The current layout breaks up the global thread priority space
>>>>>>>>>> (0 - 255)
>>>>>>>> into a
>>>>>>>>>> couple of bands:
>>>>>>>>>>
>>>>>>>>>> 0 - 63 : interrupt threads
>>>>>>>>>> 64 - 127 : kernel sleep priorities (PSOCK, etc.)
>>>>>>>>>> 128 - 159 : real-time user threads (rtprio)
>>>>>>>>>> 160 - 223 : time-sharing user threads
>>>>>>>>>> 224 - 255 : idle threads (idprio and kernel idle procs)
>>>>>>>>>>
>>>>>>>>>> If we decide to change the behavior I see two possible fixes:
>>>>>>>>>>
>>>>>>>>>> 1) (easy) just move the real-time priority range above the
>>>>>>>>>> kernel sleep
>>>>>>>>>> priority range
>>>>>>>>> Would not this cause a priority inversion when an RT process
>>>>>>>>> enters the kernel mode?
>>>>>>>> How so? Note that timesharing threads are not "bumped" to a
>>>>>>>> kernel sleep
>>>>>>>> priority when they enter the kernel either. The kernel sleep
>>>>>>>> priorities are
>>>>>>>> purely a way for certain sleep channels to cause a thread to be
>>>>>>>> treated as
>>>>>>>> interactive and give it a priority boost to favor interactive
>>>>>>>> threads.
>>>>>>>> Threads in the kernel do not automatically have higher priority
>>>>>>>> than threads
>>>>>>>> not in the kernel. Keep in mind that all stopped threads
>>>>>>>> (threads not
>>>>>>>> executing) are always in the kernel when they stop.
>>>>>>> I have requirement to make a thread running in kernel has more
>>>>>>> higher
>>>>>>> priority over a thread running userland code, because our kernel
>>>>>>> mutex is not sleepable which does not like Solaris did, I have to
>>>>>>> use
>>>>>>> semaphore like code in kern_umtx.c to lock a chain, which allows me
>>>>>>> to read and write user address space, this is how umtxq_busy() did,
>>>>>>> but it does not prevent a userland thread from preempting a thread
>>>>>>> which locked the chain, if a realtime thread preempts a thread
>>>>>>> locked the chain, it may lock up whole processes using pthread.
>>>>>>> I think our realtime scheduling is not very useful, it is too easy
>>>>>>> to lock up system.
>>>>>> Users are not forced to use rtprio. They choose to do so, and
>>>>>> they have to
>>>>>> be root to enable it (either directly or by extending root
>>>>>> privileges via
>>>>>> sudo or some such). Just because you don't have a use case for it
>>>>>> doesn't
>>>>>> mean that other people do not. Right now there is no way possible
>>>>>> to say
>>>>>> that a given userland process is more important than 'sshd' (or
>>>>>> any other
>>>>>> daemon) blocked in poll/select/kevent waiting for a packet.
>>>>>> However, there
>>>>>> are use cases where other long-running userland processes are in
>>>>>> fact far
>>>>>> more important than sshd (or similar processes such as getty, etc.).
>>>>>>
>>>>> You still don't answer me about how to avoid a time-sharing thread
>>>>> holding a critical kernel resource which preempted by a user RT
>>>>> thread,
>>>>> and later the RT thread requires the resource, but the time-sharing
>>>>> thread has no chance to run because another RT thread is dominating
>>>>> the CPU because it is doing CPU bound work, result is deadlock,
>>>>> even if
>>>>> you know you trust your RT process, there are many code which were
>>>>> written by you, i.e the libc and any other libraries using threading
>>>>> are completely not ready for RT use.
>>>>> How ever let a thread in kernel have higher priority over a thread
>>>>> running userland code will fix such a deadlock in kernel.
>>>> Put another way, the time-sharing thread that I don't care about
>>>> (sshd, or
>>>> some other monitoring daemon, etc.) is stealing a resource I care about
>>>> (time, in the form of CPU cycles) from my RT process that is
>>>> critical to
>>>> getting my work done.
>>>>
>>>> Beyond that a few more points:
>>>>
>>>> - You are ignoring "tools, not policy". You don't know what is in
>>>> my binary
>>>> (and I can't really tell you). Assume for a minute that I'm not
>>>> completely
>>>> dumb and can write userland code that is safe to run at this high
>>>> of a
>>>> priority level. You already trust me to write code in the kernel
>>>> that runs
>>>> at even higher priority now. :)
>>>> - You repeatedly keep missing (ignoring?) the fact that this is
>>>> _optional_.
>>>> Users have to intentionally decide to enable this, and there are
>>>> users who
>>>> do _need_ this functionality.
>>>> - You have also missed that this has always been true for idprio
>>>> processes
>>>> (and is in fact why we restrict idprio to root), so this is not
>>>> "new".
>>>> - Finally, you also are missing that this can already happen _now_
>>>> for plain
>>>> old time sharing processes if the thread holding the resource
>>>> doesn't ever
>>>> do a sleep that raises the priority.
>>>>
>>>> For example, if a time-sharing thread with some typical priority>=
>>>> PRI_MIN_TIMESHARE calls write(2) on a file, it can lock the vnode
>>>> lock for
>>>> that file (if it is unlocked) and hold that lock while it's priority
>>>> is>=
>>>> PRI_MIN_TIMESHARE. If an interrupt arrives for a network packet
>>>> that wakes
>>>> up sshd for a new SSH connection, the interrupt thread will preempt the
>>>> thread holding the vnode lock, and sshd will be executed instead of the
>>>> thread holding the vnode lock when the ithread finishes. If sshd
>>>> needs the
>>>> vnode lock that the original thread holds, then sshd will block
>>>> until the
>>>> original thread is rescheduled due to the random fates of time and
>>>> releases
>>>> the vnode lock.
>>>>
>>>> In summary, the kernel sleep priorities do _not_ serve to prevent all
>>>> priority inversions, what they do accomplish is giving preferential
>>>> treatment
>>>> to idle, "interactive" threads.
>>>>
>>>> A bit more information on my use case btw:
>>>>
>>>> My RT processes are each assigned a _dedicated_ CPU via cpuset (we
>>>> remove the
>>>> CPU from the global cpuset and ensure no interrupts are routed to
>>>> that CPU).
>>>> The problem I have is that if my RT process blocks on a lock (e.g. a
>>>> lock on a
>>>> VM object during a page fault), then I want the RT thread to lend
>>>> its RT
>>>> priority to the thread that holds the lock over on another CPU so
>>>> that the lock
>>>> can be released as quickly as possible. This use case is perfectly
>>>> safe (the
>>>> RT thread is not preempting other threads, instead other threads are
>>>> partitioned
>>>> off into a separate set of available CPUs). What I need is to
>>>> ensure that the
>>>> syncer or pagedaemon or whoever holds the lock I need gets a chance
>>>> to run right
>>>> away when it holds a lock that I need.
>>>>
>>> What I meant is that whenever thread is in kernel mode, it always has
>>> higher priority over thread running user code, and all threads in kernel
>>> mode may have same priority except those interrupt threads which
>>> has higher priority, but this should be carefully designed to use
>>> mutex and spinlock between interrupt threads and other threads,
>>> mutex uses turnstile to propagate priority, spin lock disables
>>> interrupt, otherwise there still is priority inversion in kernel, i.e
>>> rwlock, sx lock.
>> Except that this isn't really true. Really, if a thread is asleep in
>> select() or poll() or kevent(), what critical resource is it holding?
>> I had
>> the same view originally when the current set of priorites were setup.
>> However, I've had to change it since I now have a real-world use case for
>> rtprio.
>>
>> First, I think this is the easy part of the argument: Can you agree
>> that if
>> a RT process is in the kernel, it should have priority over a TS
>> process in
>> the kernel? Thus, if a RT process blocks in the kernel, it would need to
>> lend enough of a priority to the lock holder to preempt any TS process
>> in the
>> kernel, yes? If so, that argues for RT processes in the kernel having a
>> higher priority than all the other kernel sleep priorities.
>>
>> The second part is harder, and that is what happens when a RT process
>> is in
>> userland. First, some food for thought. Do you realize that
>> currently, the
>> syncer and pagedaemon threads run at PVM? This is intentional so that
>> these
>> processes run in the "background" even though they are in the kernel.
>> Specifically, when sshd does wakeup from a sleep at PSOCK or the like,
>> the
>> kernel doesn't just let it run in the kernel, it effectively lets it keep
>> that PSOCK priority in userland until the next context switch due to an
>> interrupt or the quantum expiring. This means that when you ssh into
>> a box,
>> the your interactive typing ends up preempting syncer and pagedaemon.
>> And
>> this is a good thing, because syncer and pagedaemon are _background_
>> processes. Preempting them only for the kernel portion of sshd (as the
>> change to userret in both your proposal and my original #2 would do)
>> would
>> not really favor interactive processes because the user relies on the
>> userland portion of an interactive process to run, too (userland is
>> the part
>> that echos back the characters as they are typed). So even now, with TS
>> threads, we have TS userland code that is _more important_ than code
>> in the
>> kernel. Another example is the idlezero kernel process. This is kernel
>> code, but is easily far less important than pretty much all userland
>> code.
>> Kernel code is _not_ always more important than userland code. It
>> often is,
>> but it sometimes isn't. If you can accept that, then it is no longer
>> strange
>> to consider that even the userland code in a RT process is more important
>> than kernel code in a TS process.
>>
>> In our case we do chew up a lot of CPU in userland for our RT
>> processes, but
>> we handle this case by using dedicated CPUs. Our RT processes really
>> are the
>> most important processes on the box.
>>
>
> I have to agree with John on this one..
> The real-time property for threads is a dangerous tool which we allow a
> system "Adminstrator" (i.e. someone with root,) to do some things.
> It is perfectly understood that doing the WRONG thing will negatively
> impact the system (maybe even make it unworkable). However the decision to
> set a process to realtime mode means that the Administrator has decided
> that
> that process/thread is more importnat than everything else in the system.
> One could argue about whether this applies to interrupts, but in the
> modern day
> of even cell phones having multiple processors, it gets harder and harder
> to make the case that userland code should not be able to pre-empt
> or block kernel code.
>
> I think this philosophy has always been true.. As Terry Lambert used to
> say
> at the beginning of the project: Unix's job is to delver the bullet to
> where-ever the
> user wants to put it, including the user's foot. When you are the
> administrator
> you get to have a pretty big foot.
>
> In addition many of freeBSD's 'Users' are in fact producers of 'product'
> boxes.
> They know EXACTLY what is running on the system, and where, and want the
> ability
> to label a process in the way that John shows. For them it is the
> primary purpose
> of the box to do task X and doing task X comes before all other tasks,
> possibly even
> non related interrupts.
>
> Julian
>
The main problem is correctness, not if root can use it or not,
I know it is his machine, he can do whatever he wants to do. :-)
I have to repeat:
The question is can the kernel correctly schedule RT threads ? no.
The fact is so many lock semantics are not RT safe, lockmgr, sx lock,
rwlock and other locks based on msleep/wakeup which do not use
priority propagating or do not protect priority have priority inversion.
Also the PPQ = 4 is incorrect for RT scheduling, it is another
kind of priority inversion.
So what can we do here ? if mutex and spin lock can not be used,
it should either raise thread's priority to a high enough
level or all threads have equal priority in kernel.
If future changes can not fix the above problems, those changes
are nonsense.
More information about the freebsd-arch
mailing list