Getting rid of the static msleep priority boost

Jeff Roberson jroberson at chesapeake.net
Fri Mar 7 22:41:54 UTC 2008


On Fri, 7 Mar 2008, John Baldwin wrote:

> On Friday 07 March 2008 08:42:37 am John Baldwin wrote:
>> On Friday 07 March 2008 07:16:30 am Jeff Roberson wrote:
>>> Hello,
>>>
>>> I've been studying some problems with recent scheduler improvements that
>>> help a lot on some workloads and hurt on others.  I've tracked the
>>> problem down to static priority boosts handed out by
>>> msleep/cv_broadcastpri.  The basic problem is that a user thread will be
>>> woken with a kernel priority thus allowing it to preempt a thread running
>>> on any processor with a lesser priority.  The lesser priority thread may
>>> in fact hold some resource that the higher priority thread requires.
>>> Thus we context switch several times and perhaps go through priority
>>> propagation as well.
>>>
>>> I have verified that disabling these static priority boosts entirely
>>> fixes the performance problem I've run into on at least one workload.
>>> There are probably others that it helps and hopefully we can discover
>>> that.
>>>
>>> I'd like to know if anyone has a strong preference to keep this feature.
>>> It is likely that it helps in some interactive situations.  I'm not sure
>>> how much however.  I propose that we make a sysctl that disables it and
>>> turn it off by default.  If we see complaints on current@ we can suggest
>>> that they toggle the sysctl to see if it alleviates problems.
>>>
>>> Based on feedback from that experiment and some testing we can then
>>> choose a few options:
>>>
>>> 1)  Disable the static boosts entirely.  Leave kernel priorities for
>>> kernel threads and priority propagation.  Most other kernels do this.
>>> Would make my life in ULE much easier as well.
>>>
>>> 2)  Leave the support for static boosts but remove it from all but a few
>>> key locations.  Leaving it in the api would give some flexibility but
>>> might confuse developers.
>>>
>>> 3)  Leave things as they are.  undesirable.
>>>
>>> I'm leaning towards #2 based on the information I have presently.  This
>>> is almost a significant change to historic BSD behavior so we might want
>>> to tread lightly.
>>
>> One thing to note is that we actually depend on the priority boost (evilly)
>> to pick processes to swap out.  (I think we check for <= PSOCK and don't
>> swap those out).  One thing that I've wanted to happen for a while is that
>> the sleep priority for msleep() just be a parameter available to the
>> scheduler that the scheduler can use to calculate the real internal
>> priority rather than just being a set.  That is, I imagine having:
>>
>> void	sched_set_sleep_prio(struct thread *td, u_char pri);
>> u_char	sched_get_sleep_prio(struct thread *td);
>>
>> (The swap check would use the get call).  The 4BSD scheduler's
>> implementation of sched_set_sleep_prio would look like this:
>>
>> void
>> sched_set_sleep_prio(struct thread *td, u_char pri)
>> {
>>
>> 	td->td_sched->sleep_pri = pri;
>> 	sched_prio(td, pri);
>> }
>>
>> void
>> sched_userret(..)
>> {
>>
>> 	...
>> 	td->td_sched->sleep_pri = 0;	/* not in the kernel anymore */
>> }
>>
>> but other schedulers may just save it and recalculate the priority where
>> the priority calculation just considers the sleep priority as one among
>> many factors.  If nothing else, this allows it to be a scheduler decision
>> to ignore it (so 4BSD could continue to do what it does now, but ULE may
>> ignore it, or ignore certain levels, etc.)
>
> One thing to clarify: I'm not opposed to replacing the PSOCK check with
> something more suitable in the swap code, (in fact, that would be desirable),
> but it might take a good bit of work to do that and is probably easier to
> work on that as a separate change.  I also think there can be some merit in
> having code paths hint to the scheduler the relative interactivity/priority
> of a sleep.

Couple of notes..

The priority argument to sleep is a reasonable way for the code to hint at 
the relative priority/interactivity.  So that argues for leaving these 
arguments in place and making them more advisory.  I don't think we have 
to change the api to take advantage of that.

I'll look more closely for places like the swap that care about the 
absolute priority of a process and see what I can come up with.  Thanks 
for raising that concern.

I'd like to avoid apis that require the sched lock in seperate steps like 
msleep does now to elevate the priority.  So far all sched* apis require 
the thread lock on enter and I'd hate to deviate from that norm.  But 
another option may be just to make a globally visible td_sleep_pri that 
doesn't require the lock for write but does for read.  The other option is 
to bubble the argument down through the sleepq code and into sched_sleep() 
and sched_wakeup().  I like that the best but it's the most api churn.

Thanks,
Jeff

>
> -- 
> John Baldwin
>


More information about the freebsd-arch mailing list