kern/129164: Wrong priority value for normal processes

Bruce Evans brde at
Fri Nov 28 08:01:49 PST 2008

On Fri, 28 Nov 2008, Unga wrote:

> --- On Fri, 11/28/08, Bruce Evans <brde at> wrote:
>> The user confusion is that the garbage returned by
>> rtprio(RTP_LOOKUP,
>> ...) for PRI_TIMESHARE processes is interpreted as a
>> realtime priority.
>> The garbage was originally 0, but is now supposed to be
>> (for no good
>> reason) the current base user priority.  In either case, it
>> has very
>> little to do with the realtime priority, so I call it
>> garbage.  Its
>> upper limit has always been out of bounds for a realtime
>> priority, and
>> the above code makes it go negative and thus its lower
>> limit is out
>> of bounds for a realtime priority too.  Since realtime
>> priorities are
>> unsigned, going below the lower limit just gives more
>> obvious garbage
>> by misrepresenting a negative value in a u_short.

> The rtprio(2) is implemented in /usr/src/sys/kern/kern_resource.c as rtprio_thread().

Actually, it is implemented by as rtprio().  rtprio_thread() is a newer
undocumented syscall.  I didn't know that it existed.  Its existence
enlarges the bug that rtprio(2) was not deprecated in 2001 when the
kernel was changed to not really use rtprio's data structures.  (The
kernel was changed to use a linear priority space instead of separate
priority spaces and runqueues for normalrealtime/idletime.  This would
be simpler for the primary user API too.)

> By looking at the rtprio(2) implementation, it is clear the author of the rtprio(2) intended to set Realtime, Normal and Idletime priorities and read the original priority value (ie. what value was set) of Realtime, Normal and Idletime processes. The rtprio(2) is not intended to be limited only to the Realtime and Idletime classes.

No, it is clear that the author of the rtprio() implementation never
intended to support the Normal class (or the Interrupt class, which
didn't exist when the author wrote rtprio()).  It has no support for
setting the Normal class (even to switch back to Normal after switching
to Real or Idle), and only accidental and broken support for reading
Normal priorities.  As I wrote before, the original author just returned
the Normal process's rtprio struct which was always the default value
{ type = RTP_PRIO_NORMAL, prio = 0 }.  0 means "not applicable" here.
As I wrote before, this was accidentally changed, first to return the
transient current priority, then to return the less transient user
priority, then to return the even less transient base user priority
(in all cases, the priorities are from the linearized kernel priority
space but mangled by subtracting an undocumented bias).

> The rtprio(2) sets the priority class (Realtime, Normal or Idletime) in td_pri_class and sets the priority value in td_base_user_pri of the "thread" structure defined in /usr/include/sys/proc.h. When rtprio(2) reads priority, it reads both the class and the value from td_pri_class and td_base_user_pri, respectively.

It has to set the priority class to one of the above or Intr so that clients
can know if the priority value is a valid realtime priority.  It is only
valid for classes Realtime and Idletime.  Otherwise, it contains a garbage
undocumented value that happens to be the process's base priority mangled
by subtracting an undocumented bias.  This priority has very little to do
with the realtime priority!!

> That is, rtprio(2) expects the original class and value do not change while sheduling. This expectation is now broken.

The original author didn't write this part.  He just had a kernel rtprio
struct which contained the type (class) and prio.  The kernel and the
rtprio(2) API used this fairly directly.  When the kernel priority space
was linearized, this struct went away, but rtpio(2) remained and a
translation was required to complicate the priority space for use by
rtprio(2).  The translation uses this expectation as a hack.  It was
broken for all priority classes from 2001 to 2006 (since the values changed
for all classes).  This was eventually fixed by davidxu for Realtime and
Idletime classes.  For the Intr class, the same fix makes the value
non-changing but still garbage and still different from the original
garbage (0).  For the Normal class, the value is still volatile.

> David Xu pointed out one way how the td_base_user_pri get changed by the sched_ule.

I pointed out other ways.  The usual way is that td_base_user_pri
tracks td_user_pri after some delay.  The latter changes on almost
every statclock tick if the thread is running, so the former is far
from constant.  Userland might want to know a transient user priority,
but rtprio(2) is a wrong interface for this.  To begin with, there are
several user priorities, but rtprio(2) can only return 1 and requires
a type pun to do this.

Another broken API here is struct priority and its use in struct
kinfo_proc.  The kernel originally (after priority linearizeation)
used struct priority directly.  Now all of the kernel priority fields
have been renamed and/or moved and half have subtly different semantics.
struct priority is missing a place to return td_base_user_priority.
struct kinfo_proc should use scalar fields whenever possible, including
for priorities like it used to, since structs in it are harder to

> In my understanding, the "thread" structure should carry the original priority class and value without change for any system call to be referenced at any time.

That is what the original implementation did.  For the Normal class, the
original priority class and value were that of the first process (inherited
on fork()) since rtprio(2) never touched them.

> The original priority value and running priority value are two different. The original priority value should be static, means normally should not change and the running priority value can vary from value 255 (most idle) to 0 (highest priority).

If it is static (like it originally was) then it is useless (like it still is).
You don't want it to be set at fork() time and constant after that since
the running priority value at fork() time is inherited (in adjusted form)
from the parent so it is only relevant to the process's scheduling for a
short time.

> What is most important is to know the original priority class and value. This is useful for cases where an user wants to organize various processes to different priority classes. Some processes he wants to bring to Realtime class, and some processes he wants to run at Normal priority and processes he wants to bring to Idletime category. I'm one such user. One example of use is, run JACK in realtime, Firefox in normal and Bittorrent in Idletime. Once assign processes to various priority classes, one needs to check are they in the intended categories. That's why one needs to inspect the original priority class and the value. By the time you check, depends on the load, the Firefox browser may be running even in Realtime for its to gets executed. Next moment Firefox comes back to its normal.

If it is Normal, then it doesn't have a meaningful realtime priority.
It has an ordinary priority, but that is fundamentally volatile for
not-nearly- idle processes^Wthreads (*) and not very interesting for
almost-idle threads -- for almost-idle threads the user priority is
almost always PUSER + 20 (20 = nice_bias) + nice_value for SCHED_4BSD,
and apparently almost always -32 for SCHED_ULE.  These details seem
to be undocumented.  So the only useful thing that could be encoded
in the priority is the nice value, but getpriority() is the right
interface for that and such encodings lead to enormous complications
-- see top/machine.c.

(*) Another API problem here is that rtprio(2) only deals with processes,
except the usual case of pid = 0 gives curthread.  This works for
Realtime and Idletime processes since all the threads must have the
same type and prio, at least if they were set by rtprio(2) and not by
rtprio_thread(2).  But for Normal processes, kernel priorities are
per-thread and there is no way to return more than 1.  The actual
return value is the from the highest priority thread.  Since the man
page doesn't mention threads at all, this is undocumented.

> So the implementors of sched_ule should clarify if the td_base_user_pri is now dynamic, which field of "thread" structure now carry the original priority value. Or if they made a mistake by overlooking the expectation of rtprio(2), its best if implementors of sched_ule could fix it. Of course, anybody else who understand the sched_ule could look in to it.

SCHED_ULE just gives a different undocumented garbage realtime priority for
Normal threads.


More information about the freebsd-bugs mailing list