sched_userret priority adjustment patch for sched_4bsd

Julian Elischer julian at elischer.org
Mon Sep 27 15:10:43 PDT 2004



Stephan Uphoff wrote:

>On Mon, 2004-09-27 at 14:43, John Baldwin wrote:
>  
>
>>>>@@ -272,7 +272,7 @@
>>>> {
>>>>
>>>>        mtx_assert(&sched_lock, MA_OWNED);
>>>>-       if (td->td_priority < curthread->td_priority)
>>>>+       if (td->td_priority < curthread->td_ksegrp->kg_user_pri)
>>>>                curthread->td_flags |= TDF_NEEDRESCHED;
>>>> }
>>>>        
>>>>

in sched_userret() we do:
        kg = td->td_ksegrp;
        if (td->td_priority != kg->kg_user_pri) {
                mtx_lock_spin(&sched_lock);
                td->td_priority = kg->kg_user_pri;
                mtx_unlock_spin(&sched_lock);
        }

but we don't actually take any action in the case where the thread is 
heading out to userland with
a priority of less importance than a waiting thread. That happens in 
AST() where we also set it down
but only in the  case of TDF_NEEDRESCHED being set.

it would make more sense to ALWAYS to the TDF_NEEDRESCHED clause, in 
userret()
based on the user priority... i.e. the priority would be reduced going 
to userland.
Unfortunatly this would stop one of the reasons to for priorityu 
raisning in BSD.

The priority of a thread that waits for IO is raised not only to make it 
start again in the kernel
as an interactive thread, but also so that it can run into userland too 
and get some priority
for actually USING the new data/input..

it maybe that we should consider the priority as a number of different 
components that
should not be added together until needed. The "interractive IO priority 
boost" that comes from
having doen an msleep(). should wear off very quickly.. maybe we knock 
it down again at the
first or second clock tick  but to do that we nned to track that 
"interractive boost" separatly
from the general priority.


>>>>Index: sched_ule.c
>>>>===================================================================
>>>>RCS file: /usr/cvs/src/sys/kern/sched_ule.c,v
>>>>retrieving revision 1.129
>>>>diff -u -r1.129 sched_ule.c
>>>>--- sched_ule.c 11 Sep 2004 10:07:22 -0000      1.129
>>>>+++ sched_ule.c 27 Sep 2004 14:13:01 -0000
>>>>@@ -723,7 +723,7 @@
>>>>         */
>>>>        pcpu = pcpu_find(cpu);
>>>>        td = pcpu->pc_curthread;
>>>>-       if (ke->ke_thread->td_priority < td->td_priority ||
>>>>+       if (ke->ke_thread->td_priority < td->td_ksegrp->kg_user_pri ||
>>>>            td == pcpu->pc_idlethread) {
>>>>                td->td_flags |= TDF_NEEDRESCHED;
>>>>                ipi_selected(1 << cpu, IPI_AST);
>>>>
>>>>An even better fix might be to fix td_base_pri by having it be set on
>>>>kernel entry similar to how 4.x sets curpriority.  The above fix should
>>>>be sufficient for now, however.
>>>>        
>>>>
>>>I don't think that this is enough since TDF_NEEDRESCHED is thread
>>>specific and not cpu specific.
>>>      
>>>
>>Hmm, it is CPU specific in 4.x.  It could be changed back to being a per-cpu 
>>flag easily.
>>    
>>
>
>I guess we disagree here.
>There are just to many resource dependencies in the kernel that can lead
>to priority inversion (vnode locks, disk buffer ownership, etc).
>It would be nice to delay the priority boost until a thread acquires
>such a resource (or even trace resource dependencies and implement
>priority inheritance) ... but this would be a huge task.
>Boosting the priority on kernel entry is easy and less error prone. I
>guess we had this discussion last week and we just disagree on the
>issue. 
>
>  
>
>>I think the sleep functions could then 
>>leave td_base_pri alone.  (I think setting it there is wrong because 
>>td_base_pri is not quite the same as curpriority in 4.x.)  What td_base_pri 
>>is really supposed to provide, btw, is the priority that the thread should go 
>>back to once it has unlocked a mutex and had its priorty boosted while it 
>>held the mutex.  Arguably it should just be using kg_user_pri for this, but 
>>then you loose priority "boosts" from tsleep(), which is why td_base_pri is 
>>set in msleep().  I guess what should happen is something more like this:
>>
>>kernel_entry()
>>{
>>	KASSERT(td->td_priority == td->td_ksegrp->kg_user_pri);
>>	td->td_base_pri = td->td_priority;
>>}
>>
>>msleep()
>>{
>>	sched_prio(...);
>>	td_base_pri = td->td_priority;
>>}
>>
>>The TDF_NEEDRESCHED checks should be using kg_user_pri as in my previous 
>>e-mail.  Also, in sched_prio(), if our priority is ever raised (numerically, 
>>logically less important), we should set TDF_NEEDRESCHED since we may need to 
>>switch (4.x does this in maybe_needresched()).  Then, TDF_NEEDRESCHED could 
>>become a per-cpu flag and have it not be cleared in mi_switch() but be 
>>cleared only in userret().  Hmm, I think all of the TDF_NEEDRESCHED handling 
>>actually beings in sched_userret() btw.
>>    
>>
>
>Wouldn't this lead to unnecessary round-robin switches between threads
>with the same priority on sched_4bsd?
>

maybe..
I added code to be "kind" to preempted threads (by puting them back on 
the head of their queue)
but overall the 4bsd scheduler doesn't translate very well into a 
multithreaded and multi processer world..
if you have multiple threads in a process, how much of the interractive 
boost do you assign to the thread and how
much to the process, which may be doing work derived from the IO but in 
another thread?
It gets even more difficult when you realise that user threads can 
switch between kernel threads without the
kernel being aware.  I have ocnsiderred of the kernel SHUOLD be aware of 
user theads. By which I mean that while we might
have only a few FUll kernel threads (with stacks etc) it might be a 
worthwhile thing to keep a small structure in teh kernel
to correspond with each user thread, that can hold a coup,e of basic 
parameters.  Sort of a hybrid between the current
"Only the UTS knows about all threads" and "full kernel knowledge of all 
threads"..  The kernel knows abot them and their history,
but doesn't need to supply full running resources for them, just a small 
(probably maybe only 8 ints worth may be enough)
amount of info that can be looked up on kernel entry using the mailbox.


>
>	Stephan
>
>_______________________________________________
>freebsd-arch at freebsd.org mailing list
>http://lists.freebsd.org/mailman/listinfo/freebsd-arch
>To unsubscribe, send any mail to "freebsd-arch-unsubscribe at freebsd.org"
>  
>



More information about the freebsd-arch mailing list