Updated rusage patch

Sun Jun 17 06:37:24 UTC 2007

On Wed, 6 Jun 2007, Jeff Roberson wrote:

> I'd like to make a list of the remaining problems with rusage and potential 
> fixes.  Then we can decide which ones myself and attilio will resolve 
> immediately to clean up some of the effect of the sched lock changes.

I haven't verified which of these fixes is necessary and/or has been done
yet.  The list is a bit incomplete.

3 more minor problems turned up (one caused by applying one of these
fixes?).

(1)
Results of some makeworlds run after the threead lock changes, all with
fixes for pagezero (best previous result 827 seconds; results without
touching pagezero ~845 seconds without PREEMPTION; ~837 seconds with
PREEMPTION).  Only the differences in the following results are interesting.

% Sat Jun  9 03:28:33 UTC 2007:
% 831.61 real      1308.57 user       184.80 sys
%    1320199  voluntary context switches
%    1533639  involuntary context switches
% pgzero time 7 seconds

Base result.

% Wed Jun 13 14:52:15 UTC 2007:
% 833.97 real      1291.71 user       201.64 sys
%    1329247  voluntary context switches
%    1518959  involuntary context switches
% pgzero time 7 seconds

Some change between June 9 and June 13 made a big difference to the user+sys
decomposition.  I think the June 9 result is more correct.

% Wed Jun 13 14:52:15 UTC 2007:
% Same kernel as previous with HZ = 1000 (HZ = 100 except as noted); stathz = 100
% 836.24 real      1310.22 user       191.04 sys
%    1323793  voluntary context switches
%    1559229  involuntary context switches
% pgzero time 7 seconds

The accuracy of the decomposition depends mainly on stathz (the
decomposition is based on statclock tick counts, and there is a
significant bias towards system time when the tick counts are all 0
-- see calcru1() -- which is reduced by increasing stathz)  I forgot
that stathz != HZ and tried the HZ = 1000 pessimization to fix it.
This somehow gave the old decomposition.

(2)
By reading the code, in sched_throw() (from sched_4bsd.c; the version in
sched_ule.c is identical; duplicating this is another bug):

% /*
%  * A CPU is entering for the first time or a thread is exiting.
%  */
% void
% sched_throw(struct thread *td)
% {
% 	/*
% 	 * Correct spinlock nesting.  The idle thread context that we are
% 	 * borrowing was created so that it would start out with a single
% 	 * spin lock (sched_lock) held in fork_trampoline().  Since we've
% 	 * explicitly acquired locks in this function, the nesting count
% 	 * is now 2 rather than 1.  Since we are nested, calling
% 	 * spinlock_exit() will simply adjust the counts without allowing
% 	 * spin lock using code to interrupt us.
% 	 */
% 	if (td == NULL) {
% 		mtx_lock_spin(&sched_lock);
% 		spinlock_exit();
% 	} else {
% 		MPASS(td->td_lock == &sched_lock);
% 	}

Comment doesn't match code (comment only applies to td == NULL case).

% 	mtx_assert(&sched_lock, MA_OWNED);
% 	KASSERT(curthread->td_md.md_spinlock_count == 1, ("invalid count"));
% 	PCPU_SET(switchtime, cpu_ticks());
% 	PCPU_SET(switchticks, ticks);
% 	cpu_throw(td, choosethread());	/* doesn't return */
% }

Setting switchtime, etc., here loses the delta between the current
time and switchtime.  Old code only sets switchtime when a CPU is
entering for the first time.  switchtime is normally not actually
a switch time, but is set by thread_exit() just before calling here.
Not much time should be lost from this, but lots seems to be in practice.
According to a benchmark that does 100000 fork/wait/exits:

         2.99 real         0.13 user         2.78 sys

About 3% of the time is not accounted for.  Interrupt and kernel thread
time can only account for < 1%.

Old code didn't get this nearly right either, despite my attempts to
minimize the unaccounted-for time.  Fixing it should be easier now.
Of course, the part of the time for exiting cannot _all_ be accounted
to the exiting thread.  I want as much of it as possible to go there
and the rest to the next thread (which might be idlethread in general,
so the time would be almost invisible, but for the fork-wait-exit
benchmark the fork-wait thread should always be switched to next to
complete its wait()).

(3)
Bugs found while grepping near cpu_throw:
- kern_thread.c has cpu_throw() hard-coded in 4 comments and one string,
   but now only calls sched_throw().
- sched_throw() is not declared as non-returning in sys/sched.h.
- kern_thread.c has a bogus panic and NOTREACHED comment after sched_throw()
   doesn't return.

Bruce