Some initial postmark numbers from a dual-PIII+ATA, 4.x and 6.x

Sun Feb 6 06:44:42 PST 2005

On Sun, 6 Feb 2005, Jeremie Le Hen wrote:

> Hi Robert,
> 
> > This would seem to place it closer to 4.x than 5.x -- possibly a property
> > of a lack of preemption.  Again, the differences here are so small it's a
> > bit difficult to reason using them.
> 
> Thanks for the result.  I'm quite dubitative now : I thought this was a
> fact that RELENG_5 have worse performances than RELENG_4 for the moment,
> partly due to lack of micro-optimizations.  There have been indeed
> numerous reports about weak performances on 5.x.  Seeing your results,
> it appears that RELENG_4, RELENG_5 and CURRENT are in fact very close. 
> What should we think then ? 

You should think that benchmark results are a property of several factors: 

- Work load
- Software baseline
- Hardware configuration
- Software configuration
- Experimental method
- Effectiveness of documentation

Let's evaluate each:

- The workload was postmark in a relatively stock configuration.  I
  selected a smaller number of transactions than some other reporters,
  based on the fact that my hardware is quite a bit slower and I wanted to
  try and get coverage of a number of versions.  I selected a 90-ish
  second run.  The postmark benchmark is basically about effective
  caching, efficient I/O processes, and how the file system managed
  meta-data.

- Software baseline: I selected to run with 4.x, 5.x, and 6.x kernels, all
  configured for "production" use.  I.e., no debugging features enabled.
  I also used a statically compiled 4.x postmark binary for all tests on
  any versions, to try and avoid the effects of compiler changes, etc.  I
  was primarily interested in evaluating the performance of the kernel as
  a variable.

- Hardware configuration: I'm using somewhat dated PIII MP hardware with a
  relatively weak I/O path.  It was the hardware on-hand and easily
  preemptible.  The hardware has pretty good CPU:I/O performance, meaning
  that with many interesting workloads, the work will be I/O-bound, not
  CPU-bound.  It becomes a question of feeding the CPUs and keeping the
  available I/O path used effectively.

- Software configuration: I network booted the kernel, and used one of two
  user spaces on disk -- a 4.x world and a 6.x world.  However, I used a
  single shared UFS1 partition for the postmark target.  My hope was that
  static linkiing would eliminate issues involving library changes, and
  that using the same file system partition would help reduce disk
  location effects (note that disk performance varies substantially based
  on the location of data on the platter -- if you lay out a disk into
  several partitions, they will have quite different performance
  properties -- often in excess of the measurable experimental results of
  the property you're testing for).  However, as a result I used UFS1 for
  both tests, which is not the default install configuration for FreeBSD
  5.x and 6.x.

- Experimental method: I attempted to control additional variables in as
  much as possible.  However, I used a small number of runs per
  configuration: two.  I selected that number to illustrate whether there
  were caching effects in play between multiple runs without reboots.  The
  numbers suggest slight caching effects, but not huge ones.  The numbers
  weren't large enough to give a sampling distribution that could be
  analyzed -- on the other hand, they were relatively long runs resulting
  in "mean results", meaning that we benefited from a sampling effect and
  a smoothing effect by virtue of the experiment design.  To run this
  experiment properly, you'd want to distinguish the caching/non-caching
  cases better, control the time between runs better, and have larger
  samples.  In order to try to explain the results I got, I waved my hands
  at CPU cost, and will go into that some more below.  I did not test the 
  CPU load during the experiment in a rigorous or reproduceable way.

- Effectiveness of documentation: my experiment was documented, although
  not in great detail.  I neglected to document the version of postmark
  (1.5c), the partition layout details, and the complete configuration
  details.  I've included more here.

In my original results post, I demonstrated that, subject to the
conditions of the tests (documented above and previously), FreeBSD 5.x/6.x
performance was in line with 4.x performance, or perhaps marginally
faster.  This surprised me also: I expected to see a 5%-10% performance
drop on UP based on increased overhead, and hoped for a moderate
measurable SMP performance gain relative to 4.x.  On getting the results I
did, I reran a couple of sample cases -- specifically, 4.x and 6.x kernels
on SMP with some informal measurement of system time.  I concluded that
the systems were basically idle throughout the tests, which was a likely
result of the I/O path being the performance bottleneck.  It's likely that
the slight performance improvement between 4.x and 6.x relates to
preemption and the ability to turn around I/O's in the ATA driver faster,
or maybe some minor pipelining effect in GEOM or such.  It would be
interesting to know what it is that makes 6.x faster, but it may be hard
to find out given the amount of change in the system.

I also informally concluded that 6.x was seeing a higher percentage system
time than 4.x.  This result needs to be investigated properly in an
experiment of its own, since it was based on informal watching of %system
in systat, combined with a subjective observation that the numbers
appeared bigger.  An experiment involving the use of time(1) would be a
good place to start.  What's interesting about this informal observation
(not a formal experimental conclusion!) is that it might explain the
differing postmark result from some of the other reporters.  The system I
tested on has a decent CPU oomph, but it's relatively slow ATA drive
technology -- not a RAID, not UDMA100, etc.  So if a bit more CPU was
burned to get slightly more efficient use of the I/O channel, then that
was immediately visible as a positive factor.  On systems with much
stronger I/O capabilities, perhaps to the point of being CPU-bound, that
can hurt rather than help, as there are fewer resources available to
support the critical path.

Another point that may have helped my configuration is that it ran on a
PIII, where the relative costs of synchronization primitives are much
lower.  A few months ago, I ran a set of micro-benchmarks that
illustrated that on the P4 architecture, synchronization primitives are
many times more expensive than regular operations when compared with
previous architectures.  It could be that the instruction blend came out
"net worse" in the 5.x/6.x systems on P4-based hardware.

Another point in the favor of the configuration I was running in was that
the ATA driver is MPSAFE.  This means its interrupt handler is able to
preempt most running code, and that it can execute effectively in parallel
against other parts of the kernel (including the file system).  Several of
the reported results were on the twe storage adapter, which does not have
that property.  Last night, Scott Long mailed me patches to fix dumping on
twe, and also make it MPSAFE.  I hope to run some stability testing on
that, and then hopefully we can get those patches into the hands of people
doing performance testing with twe and see if they help.  FWIW, similar
changes on amr and ips have resulted in substantial I/O improvements,
primarily by increasing the number of transactions per second throughput
by reducing latency in processing the I/O transactions.  It's easy to
imagine this having a direct effect on a benchmark that is really a
measure of meta-data transaction throughput.

Finally, my slightly hazy recollection of earlier posts was that postmark
generally illustrated somewhat consistent performance between FreeBSD
revisions (excepting NFS async breakage), but that Linux seemed to tromp
all over on meta-data operations.  There was some hypothesizing by Matt
and Poul-Henning that this was a result of having what Poul-Henning refers
to as a "Lemming Syncer" -- i.e., a design issue in the way we stream data
to disk.

Robert N M Watson