What is the PREEMPTION option good for?

Fri Dec 1 16:07:37 PST 2006

On Fri, 1 Dec 2006, Ivan Voras wrote:

> Robert Watson wrote:
>
>> They're independent twiddles, and can be frobbed separately.  If you can
>> easily measure performance in the different configurations, seeing a
>> table of permutations and results would be very nice to see what happens
>> :-).
>
> Ok, this is what I found:
>
> - ipiwakeup doesn't produce differences as calculated by ministat
> - turning off preemption produces visible differences, which are
> calculated by ministat to be upto 10%.

10% is surprisingly high.

I found another setup where PREEMPTION (should) help -- nfs servers.
For building kernels, PREEMPTION on the client is just a tiny
pessimization, but network latency is a problem for nfs and not having
PREEMPTION configured makes it worse.  PREEMPTION is needed even to
give correct scheduling of interrupt threads, and that seems to be all
that it gives, at least in the !KSE case, though the main comment about
it says otherwise.  From kern_switch.c:

% int
% maybe_preempt(struct thread *td)
% {
% ...
% 	 *  [... conditions for preempting]
% 	 *  - If the new thread's priority is not a realtime priority and
   	                                      ^^^^^^^^^^^^^^^^^^^^^^^
% 	 *    the current thread's priority is not an idle priority and
% 	 *    FULL_PREEMPTION is disabled.
% ...
% #ifndef FULL_PREEMPTION
% 	if (pri > PRI_MAX_ITHD && cpri < PRI_MIN_IDLE)
% 	    ^^^^^^^^^^^^^^^^^^
% 		return (0);
% #endif

The condition in the code is very far from being a realtime priority.
"Realtime priority" is a technical term meaning "a user thread whose
scheduling class is PRI_REALTIME" and there is a classification macro
PRI_IS_REALTIME() for such priorities.  Of course, "realtime priority"
in the comment doesn't mean that -- it means something more informal,
which I would expect to include all kernel threads and all realtime
priority user threads.  But the condition in the code is just "not an
interrupt thread".

I don't understand maybe_preempt_in_ksegrp() and have KSE unconfigured.

FULL_PREEMPTION is apparently needed to get kernel threads preempted by
anything other than interrupt threads.  It is not the default, apparently
because it pessimizes more cases than PREEMPTION.

Anyway, with kernels already optimized by about 30% for nfs (mainly
in the client), my ~5.2 UP kernel (with working preemption to interrupt
threads, unlike 5.2) used as the server beats a -current UP kernel
(without PREEMPTION) by about 3% in real time and 30% in dead time for
building kernels with a -current SMP kernel (without PREEMPTION) as
the client.  The difference is entirely due to dead time somewhere in
nfs.  Unfortunately, turning on PREEMPTION and IPI_PREEMPTION didn't
recover all the lost performance.  This is despite the ~current kernel
having slightly lower latency for flood pings and similar optimizations
for nfs that reduce the RPC count by a factor of 4 and the ping latency
by a factor of 2.

In previously clipped context, Robert Watson wrote:
> There's a known performance regression with PREEMPTION and loopback network 
> traffic on UP or UP-like systems due to a poor series of context switches 
> occuring in the network stack.  If your benchmark involves the above web load 
> over the loopback, that could be the source of what you're seeing.  If it's 
> not loopback traffic, then that's not the source of the problem.

I see only a slight additional loss of performance since ~5.2 for loopback.
Approximate latencies for flood pings:

Celeron 366:   RELENG_3: 14uS; RELENG_4: 19uS; current-2006/04/16: 48uS
AthlonXP 2223:                 RELENG_4:  2uS;                     4-5uS ...
                                            ... -current            5-6uS

> You might try fiddling with kern.sched.ipiwakeup.enabled and see what the 
> effect is, btw -- this controls whether or not the scheduler wakes up another 
> idle CPU to run a thread when waking up that thread, rather than queuing it to 
> run which may occur on the other CPU at the next clock tick.

kern.sched.ipiwakeup.enabled seems to be the default.  Does it work
without IPI_PREEMPTION?  Is the rescheduling of even interrupt threads
really delayed until the next clock tick?  I guess it is -- scheduling
delays are normally good for efficiency.  I use HZ = 100 which might
delay scheduling more than the default, but I think you mean scheduling
clock ticks and stathz is normally only 128 Hz.  Scheduling also occurs
on other (non-fast) interrupts.  Maybe the fast interrupt handers in
some network drivers work better mainly because they do more forceful
scheduling (of the task queue thread) than now happens for normal
interrupt handlers.

Bruce