Getting rid of the static msleep priority boost

Sat Mar 8 09:46:01 UTC 2008

On Fri, 7 Mar 2008, Jeff Roberson wrote:

> On Fri, 7 Mar 2008, John Baldwin wrote:
>
>> On Friday 07 March 2008 08:42:37 am John Baldwin wrote:
>>> On Friday 07 March 2008 07:16:30 am Jeff Roberson wrote:
>>>> Hello,
>>>> 
>>>> I've been studying some problems with recent scheduler improvements that
>>>> help a lot on some workloads and hurt on others.  I've tracked the
>>>> problem down to static priority boosts handed out by
>>>> msleep/cv_broadcastpri.  The basic problem is that a user thread will be
>>>> woken with a kernel priority thus allowing it to preempt a thread running
>>>> on any processor with a lesser priority.  The lesser priority thread may
>>>> in fact hold some resource that the higher priority thread requires.
>>>> Thus we context switch several times and perhaps go through priority
>>>> propagation as well.
>>>> 
>>>> I have verified that disabling these static priority boosts entirely
>>>> fixes the performance problem I've run into on at least one workload.
>>>> There are probably others that it helps and hopefully we can discover
>>>> that.
>>>> 
>>>> I'd like to know if anyone has a strong preference to keep this feature.
>>>> It is likely that it helps in some interactive situations.  I'm not sure
>>>> how much however.  I propose that we make a sysctl that disables it and
>>>> turn it off by default.  If we see complaints on current@ we can suggest
>>>> that they toggle the sysctl to see if it alleviates problems.
>>>> 
>>>> Based on feedback from that experiment and some testing we can then
>>>> choose a few options:
>>>> 
>>>> 1)  Disable the static boosts entirely.  Leave kernel priorities for
>>>> kernel threads and priority propagation.  Most other kernels do this.
>>>> Would make my life in ULE much easier as well.
>>>> 
>>>> 2)  Leave the support for static boosts but remove it from all but a few
>>>> key locations.  Leaving it in the api would give some flexibility but
>>>> might confuse developers.
>>>> 
>>>> 3)  Leave things as they are.  undesirable.
>>>> 
>>>> I'm leaning towards #2 based on the information I have presently.  This
>>>> is almost a significant change to historic BSD behavior so we might want
>>>> to tread lightly.
>>> 
>>> One thing to note is that we actually depend on the priority boost 
>>> (evilly)
>>> to pick processes to swap out.  (I think we check for <= PSOCK and don't
>>> swap those out).  One thing that I've wanted to happen for a while is that
>>> the sleep priority for msleep() just be a parameter available to the
>>> scheduler that the scheduler can use to calculate the real internal
>>> priority rather than just being a set.  That is, I imagine having:
>>> 
>>> void	sched_set_sleep_prio(struct thread *td, u_char pri);
>>> u_char	sched_get_sleep_prio(struct thread *td);
>>> 
>>> (The swap check would use the get call).  The 4BSD scheduler's
>>> implementation of sched_set_sleep_prio would look like this:
>>> 
>>> void
>>> sched_set_sleep_prio(struct thread *td, u_char pri)
>>> {
>>>
>>> 	td->td_sched->sleep_pri = pri;
>>> 	sched_prio(td, pri);
>>> }
>>> 
>>> void
>>> sched_userret(..)
>>> {
>>>
>>> 	...
>>> 	td->td_sched->sleep_pri = 0;	/* not in the kernel anymore */
>>> }
>>> 
>>> but other schedulers may just save it and recalculate the priority where
>>> the priority calculation just considers the sleep priority as one among
>>> many factors.  If nothing else, this allows it to be a scheduler decision
>>> to ignore it (so 4BSD could continue to do what it does now, but ULE may
>>> ignore it, or ignore certain levels, etc.)
>> 
>> One thing to clarify: I'm not opposed to replacing the PSOCK check with
>> something more suitable in the swap code, (in fact, that would be 
>> desirable),
>> but it might take a good bit of work to do that and is probably easier to
>> work on that as a separate change.  I also think there can be some merit in
>> having code paths hint to the scheduler the relative interactivity/priority
>> of a sleep.
>
> Couple of notes..
>
> The priority argument to sleep is a reasonable way for the code to hint at 
> the relative priority/interactivity.  So that argues for leaving these 
> arguments in place and making them more advisory.  I don't think we have to 
> change the api to take advantage of that.
>
> I'll look more closely for places like the swap that care about the absolute 
> priority of a process and see what I can come up with.  Thanks for raising 
> that concern.
>
> I'd like to avoid apis that require the sched lock in seperate steps like 
> msleep does now to elevate the priority.  So far all sched* apis require the 
> thread lock on enter and I'd hate to deviate from that norm.  But another 
> option may be just to make a globally visible td_sleep_pri that doesn't 
> require the lock for write but does for read.  The other option is to bubble 
> the argument down through the sleepq code and into sched_sleep() and 
> sched_wakeup().  I like that the best but it's the most api churn.

http://people.freebsd.org/~jeff/sleeppri.diff

What do you think of this?  I added another parameter to sleepq_add() and 
sched_sleep().  So the scheduler is responsible for adjusting the 
priority.  We could do the same thing for wakeup time adjustments like 
sleepq_broadcastpri() but we'd have to pass it through setrunnable() as 
well.

I'd like to normalize the other pri arguments in sleepq to use the same 0 
is not set vs -1 that msleep did.  I realize that 0 is a valid priority 
but for practical purposes this makes things consistent and does not 
really restrict the api.

>
> Thanks,
> Jeff
>
>> 
>> -- 
>> John Baldwin
>> 
> _______________________________________________
> freebsd-arch at freebsd.org mailing list
> http://lists.freebsd.org/mailman/listinfo/freebsd-arch
> To unsubscribe, send any mail to "freebsd-arch-unsubscribe at freebsd.org"
>