A mini-course in benchmark number crunching.

Tue Sep 28 13:08:11 PDT 2004

Hi there...

Since we'll be entering silly season benchmark-wise in a few weeks when
5.3 goes golden, I'll share an interesting benchmark I ran here today.

The situation:  I have a kernel branch in perforce which I would like to
compare to "straight -current" on buildworld performance.

The target system has three disks in addition to the system disk, so
I made a tar copy of the a newly checked out src tree and put it on
one of the disks.

I then set up the computer to boot single-user, and created the following
script which would be run from the console in single user mode:

	#!/bin/sh

	# Always bail on errors.
	set -e

	# Avoid having cur-dir on one of the traffic disks.
	cd /

	# In single user root is mounted r/o so remount it.
	mount -o rw -u /

	# Unmount and mount the filesystem with the tar file.
	# (the unmount is in case we re-run the test)
	umount /hex > /dev/null 2>&1 || true
	mount /hex

	# Always get at least three samples.  
	# Three or more samples allows us to calculate a standard deviation.
	for i in 1 2 3
	do
		# In case of rerun:  unmount the two filesystems.
		umount /usr/src > /dev/null 2>&1 || true
		umount /usr/obj > /dev/null 2>&1 || true

		# Create filesystems from scratch.
		# This improves repeatability.
		newfs -O 2 -U /dev/ad4 > /dev/null 2>&1
		newfs -O 2 -U /dev/ad6 > /dev/null 2>&1

		# Mount filesystems.
		mount /usr/src
		mount /usr/obj

		# Extract source tree.
		( cd /usr && tar xf /hex/src.tar )

		# Run test.
		# Note that stdout/stderr is not stored on disk, we are
		# only interested in the last two lines anyway: one to tell
		# us that the result was OK and one with the times.
		(
			cd /usr/src
			/usr/bin/time make -j 12 buildworld 2>&1 | tail -2
		) 
	done

So, I built the two kernels from the same kernel config file and ran
the test, and got these numbers:

Plain -current:
     1476.48 real      1972.63 user       798.28 sys
     1475.75 real      1965.80 user       814.99 sys
     1482.53 real      1969.07 user       814.13 sys

buf_work branch:
     1472.52 real      1965.67 user       792.49 sys
     1469.86 real      1960.00 user       803.77 sys
     1480.43 real      1958.09 user       814.67 sys

Running src/tools/tools/ministat on the numbers in turn tells us that
there is no statistical significant difference between the two datasets.

Real time:

x _current
+ _buf_work
+--------------------------------------------------------------------------+
|      +             +                x   x                    +          x|
||___________________M________A_|_________M________A_______|___________|   |
+--------------------------------------------------------------------------+
    N           Min           Max        Median           Avg        Stddev
x   3       1475.75       1482.53       1476.48     1478.2533     3.7216439
+   3       1469.86       1480.43       1472.52       1474.27     5.4980087
No difference proven at 95.0% confidence

User time:
x _current
+ _buf_work
+--------------------------------------------------------------------------+
|    +        +                          *               x                x|
||____________M_____A__________________| |_______________A________________||
+--------------------------------------------------------------------------+
    N           Min           Max        Median           Avg        Stddev
x   3        1965.8       1972.63       1969.07     1969.1667      3.416026
+   3       1958.09       1965.67          1960     1961.2533     3.9423639

System time:
x _current
+ _buf_work
+--------------------------------------------------------------------------+
|+               x               +                            x+x          |
||___________________|__________AM______________A_____________M|__________||
+--------------------------------------------------------------------------+
    N           Min           Max        Median           Avg        Stddev
x   3        798.28        814.99        814.13     809.13333     9.4090931
+   3        792.49        814.67        803.77     803.64333     11.090543
No difference proven at 95.0% confidence

I have, in so many words, proven nothing with my test.

The direct statistical approach assumes that the three runs for
each kernel were run under identical circumstances, but this is not
the case here:  The first is run right after a reboot, the other
two sequentially after that.  This means that a large number of
tools may be cached in RAM for the second and third run.  This
should not lead us to belive that the second and third run are in
identical circumstances: the third run may have filled ram to the
extent where things needs to be thrown out again for instance.

Let us look at the real time for the two kernels on a run by run basis:

	current		buf_work	difference

	1476.48		1472.52		-3.96
	1475.75		1469.86		-5.89
	1482.53		1480.43		-2.10

Hmm, seen this way, buf_work is consistently faster than current
by a fraction of a percent.  The same situation holds for the user
time.  The first two runs of system time show the same pattern but
in the last run buf_work is half a second slower than current.

Eight out of nine doesn't sound bad, and the probability of buf_work
being a tad better than current is probably very high, but we do
not have an actual statistical proven difference: we have no standard
deviation for the difference.

In a situation like this there are two ways one can proceed in order
to get that statistical proof:  Either run more iterations per boot,
increase the three to four, five or however much is necessary to
get a better standard deviation so that the direct statstical
approach works.  Often it helps to throw the first iteration out
since it is often atypical (loading a copy of make(1) etc into RAM).
But even with 20 iterations, it may not be possible to get the
standard deviation narrow enough.  For instance a cyclic phenomena
relating to ram/vm contents could spread the points.

Running only one iteration per boot, and doing multile runs would
be a mistake though, because that would only measure the performance
right after boot, and that can vary distinctively from the real world
experience.

The correct method, is to run multiple runs (at least three) with
three iterations per boot, and then examine the difference for
each iteration separately.  Three runs allows a standard deviation
to be calculated and "ministat" will do all the hard math for you.

I'm not a very good teacher, but I hope this example can inspire some
less lame benchmarking when people start to compare 5.3-R to other
versions, operating systems etc.

If nothing else, please just remember the first rule of statistics:

	"You can't prove anything without a standard deviation".

Poul-Henning

-- 
Poul-Henning Kamp       | UNIX since Zilog Zeus 3.20
phk at FreeBSD.ORG         | TCP/IP since RFC 956
FreeBSD committer       | BSD since 4.3-tahoe
Never attribute to malice what can adequately be explained by incompetence.