PID Allocator Performance Results (was: Re: [UPDATE] new pid alloc...)

Sun Feb 8 01:45:48 PST 2004

On Sun, Feb 08, 2004, Poul-Henning Kamp wrote:
> In message <20040208080630.GA14364 at VARK.homeunix.com>, David Schultz writes:
> >I spent some time today benchmarking the various proposed pid
> >allocators.  The full results, along with pretty pictures and a
> >more complete analysis, are at:
> >
> >	http://people.freebsd.org/~das/pbench/pbench.html  [1]
> 
> You _do_ realize that the difference between "tjr" and "net" in the
> bottom plot is not statistically significant ?
> 
> Stratification is visibly present from approx 1500 pids and up, and
> ends up being responsible for 1/3rd of the difference by the time
> you get to 5000 pids.
> 
> (The tell-tale sign here is that the two data sets both fall on two
> mostly straight lines in a random looking pattern, with practically
> no measurements hitting the interval between the two lines.)
> 
> If we assume the stratification has linearity with number of pids,
> which I think looks reasonable, and we read the right hand edge as
> half a second and the left hand edge as zero, we find:
> 
> 	    (.5 - 0) [second]
>   -------------------------------------- = 10 [nsec] / [iteration*pid]
>   10000 [iterations] * (5000 - 0) [pids]
> 
> 10nsec per operation is getting you into the territory of effective
> TSC-timecounter resolution, RAM access time, cache miss delays
> and all sorts of other hardware effects.

To avoid jitter and timestamping overhead, I read the time only at
the start and end of the entire sequence of 10000 operations.
I obtained the sample variance by running the entire test three
times, i.e.

	for (pass = 0; pass < 3; pass++) {
		for (nprocs = 100; nprocs < 5000; nprocs++) {
			set up the test, fork the sleepers;
			take starting timestamp;
			for (iter = 0; iter < 10000; iter++)
				run test;
			take ending timestamp;
		}
	}

Nevertheless, you're definitely right about the stratification.
I'm not sure how to explain that.  My best theory is that there's
some confounding factor, such as the pageout daemon waking up,
that has a constant overhead.  If you look at the original
samples, there's always two that are normal and one outlier.  More
samples would probably correct for this, and if I were running the
benchmark again, I would have done more iterations of the outer
loop in the pseudocode above and fewer of the inner loop.

> So all in all, I would say that you have proven that "tjr" and "net"
> are better than "old", but not that there is any statistically
> significant performance difference between them.

Yes, I realize that.  I took 10 more samples of 10000 forks each
with 5000 sleeping processes in the background and got the
following:

tjr:
1.130558492
1.125901197
1.144079485
1.118981882
1.131435699
1.123052511
1.133321135
1.121301171
1.133015788
1.124377539

net:
1.116848091
1.119333603
1.117941526
1.121989527 (got an outlier (2.547301682) here, so I reran this test)
1.118023912
1.110658198
1.126021045
1.106436712
1.116406694
1.100889638

This data show a difference at the 95% confidence level, namely,
that the NetBSD algorithm is about 1% faster on a system with 5000
processes (and only 0.1% faster if you're looking at the total
overhead of fork() rather than vfork().)  I think that pretty much
rules out performance as the deciding factor between the two.