Getting rid of the static msleep priority boost

John Baldwin jhb at freebsd.org
Mon Mar 10 17:50:47 UTC 2008


On Saturday 08 March 2008 04:46:32 am Jeff Roberson wrote:
> On Fri, 7 Mar 2008, Jeff Roberson wrote:
> 
> > On Fri, 7 Mar 2008, John Baldwin wrote:
> >
> >> On Friday 07 March 2008 08:42:37 am John Baldwin wrote:
> >>> On Friday 07 March 2008 07:16:30 am Jeff Roberson wrote:
> >>>> Hello,
> >>>> 
> >>>> I've been studying some problems with recent scheduler improvements 
that
> >>>> help a lot on some workloads and hurt on others.  I've tracked the
> >>>> problem down to static priority boosts handed out by
> >>>> msleep/cv_broadcastpri.  The basic problem is that a user thread will 
be
> >>>> woken with a kernel priority thus allowing it to preempt a thread 
running
> >>>> on any processor with a lesser priority.  The lesser priority thread 
may
> >>>> in fact hold some resource that the higher priority thread requires.
> >>>> Thus we context switch several times and perhaps go through priority
> >>>> propagation as well.
> >>>> 
> >>>> I have verified that disabling these static priority boosts entirely
> >>>> fixes the performance problem I've run into on at least one workload.
> >>>> There are probably others that it helps and hopefully we can discover
> >>>> that.
> >>>> 
> >>>> I'd like to know if anyone has a strong preference to keep this 
feature.
> >>>> It is likely that it helps in some interactive situations.  I'm not 
sure
> >>>> how much however.  I propose that we make a sysctl that disables it and
> >>>> turn it off by default.  If we see complaints on current@ we can 
suggest
> >>>> that they toggle the sysctl to see if it alleviates problems.
> >>>> 
> >>>> Based on feedback from that experiment and some testing we can then
> >>>> choose a few options:
> >>>> 
> >>>> 1)  Disable the static boosts entirely.  Leave kernel priorities for
> >>>> kernel threads and priority propagation.  Most other kernels do this.
> >>>> Would make my life in ULE much easier as well.
> >>>> 
> >>>> 2)  Leave the support for static boosts but remove it from all but a 
few
> >>>> key locations.  Leaving it in the api would give some flexibility but
> >>>> might confuse developers.
> >>>> 
> >>>> 3)  Leave things as they are.  undesirable.
> >>>> 
> >>>> I'm leaning towards #2 based on the information I have presently.  This
> >>>> is almost a significant change to historic BSD behavior so we might 
want
> >>>> to tread lightly.
> >>> 
> >>> One thing to note is that we actually depend on the priority boost 
> >>> (evilly)
> >>> to pick processes to swap out.  (I think we check for <= PSOCK and don't
> >>> swap those out).  One thing that I've wanted to happen for a while is 
that
> >>> the sleep priority for msleep() just be a parameter available to the
> >>> scheduler that the scheduler can use to calculate the real internal
> >>> priority rather than just being a set.  That is, I imagine having:
> >>> 
> >>> void	sched_set_sleep_prio(struct thread *td, u_char pri);
> >>> u_char	sched_get_sleep_prio(struct thread *td);
> >>> 
> >>> (The swap check would use the get call).  The 4BSD scheduler's
> >>> implementation of sched_set_sleep_prio would look like this:
> >>> 
> >>> void
> >>> sched_set_sleep_prio(struct thread *td, u_char pri)
> >>> {
> >>>
> >>> 	td->td_sched->sleep_pri = pri;
> >>> 	sched_prio(td, pri);
> >>> }
> >>> 
> >>> void
> >>> sched_userret(..)
> >>> {
> >>>
> >>> 	...
> >>> 	td->td_sched->sleep_pri = 0;	/* not in the kernel anymore */
> >>> }
> >>> 
> >>> but other schedulers may just save it and recalculate the priority where
> >>> the priority calculation just considers the sleep priority as one among
> >>> many factors.  If nothing else, this allows it to be a scheduler 
decision
> >>> to ignore it (so 4BSD could continue to do what it does now, but ULE may
> >>> ignore it, or ignore certain levels, etc.)
> >> 
> >> One thing to clarify: I'm not opposed to replacing the PSOCK check with
> >> something more suitable in the swap code, (in fact, that would be 
> >> desirable),
> >> but it might take a good bit of work to do that and is probably easier to
> >> work on that as a separate change.  I also think there can be some merit 
in
> >> having code paths hint to the scheduler the relative 
interactivity/priority
> >> of a sleep.
> >
> > Couple of notes..
> >
> > The priority argument to sleep is a reasonable way for the code to hint at 
> > the relative priority/interactivity.  So that argues for leaving these 
> > arguments in place and making them more advisory.  I don't think we have 
to 
> > change the api to take advantage of that.
> >
> > I'll look more closely for places like the swap that care about the 
absolute 
> > priority of a process and see what I can come up with.  Thanks for raising 
> > that concern.
> >
> > I'd like to avoid apis that require the sched lock in seperate steps like 
> > msleep does now to elevate the priority.  So far all sched* apis require 
the 
> > thread lock on enter and I'd hate to deviate from that norm.  But another 
> > option may be just to make a globally visible td_sleep_pri that doesn't 
> > require the lock for write but does for read.  The other option is to 
bubble 
> > the argument down through the sleepq code and into sched_sleep() and 
> > sched_wakeup().  I like that the best but it's the most api churn.
> 
> http://people.freebsd.org/~jeff/sleeppri.diff
> 
> What do you think of this?  I added another parameter to sleepq_add() and 
> sched_sleep().  So the scheduler is responsible for adjusting the 
> priority.  We could do the same thing for wakeup time adjustments like 
> sleepq_broadcastpri() but we'd have to pass it through setrunnable() as 
> well.

The cv_broadcastpri() thing is a hack and I wish there was a better way to do 
it.  I.e., I don't like having wakeup setting the priority at all.  I think 
it's a good idea to pass this to sched_sleep(), but I'd rather leave 
sched_sleep() where it is and pass the prio arg to the sleepq_wait() routines 
instead so you don't get a bump unless you actually sleep.  I think it's 
probably a bug that we bump the prio on threads that may not sleep now.

> I'd like to normalize the other pri arguments in sleepq to use the same 0 
> is not set vs -1 that msleep did.  I realize that 0 is a valid priority 
> but for practical purposes this makes things consistent and does not 
> really restrict the api.

Sounds fine to me.  I think we should even formally make 0 an invalid priority 
(via a comment or something).

-- 
John Baldwin


More information about the freebsd-arch mailing list