HZ=100: not necessarily better?

Sun Jun 18 00:21:45 UTC 2006

On Sat, 17 Jun 2006, Danial Thom wrote:

> At some point you're going to have to figure out that there's a reason that 
> every time anyone other than you tests FreeBSD it completely pigs out. 
> Sqeezing out some extra bytes in netperf isn't "performance". Performance is 
> everything that a system can do. If you're eating 10% more cpu to get a few 
> more bytes in netperf, you haven't increased the performance of the system.

This test wasn't netperf, it was a 32-process web server and a 32-process 
client, doing sendfile on UFS-backed data files.  It was definitely a potted 
benchmark, in that it omits some of the behaviors of web servers (dynamic 
content, significantly variable data set, etc), but is intended to be more 
than a simple micro-benchmark involving two sockets and packet blasting. 
Specifically, it was intended to validate whether or not there were 
immediately observable changes in TCP behavior based on adjusting HZ under 
load.  The answer was a qualified yes: there was a small but noticeable 
negative affect on high load web serving in the test environment by reducing 
HZ, likely due to to reduced timer accuracy.  Specifically: simply frobbing HZ 
isn't a strategy that necessarily results in a performance improvement.

> You need to do things like run 2 benchmarks at once. What happens to the 
> "performance" of one benchmark when you increase the "performance" of the 
> other? Run a database benchmark while you're running a network benchmark, or 
> while you're passing a controlled stream of traffic through the box.

The point of this exercise was to demonstrate the complexity of the issue of 
adjusting HZ, and to suggest that simply changing the value in the further 
absense of evidence could have negative effects, and that we might want to 
investigate a more mature middle ground, such as a modified timer 
architecture.  I'm sorry if that conclusion wasn't clear from my e-mail.

> I'd also love to see the results of the exact same test with only 1 cpu 
> enabled, to see how well you scale generally. I'm astounded that no-one ever 
> seems to post 1 vs 2 cpu performance, which is the entire point of SMP.

Single CPU results were included in my e-mail.  There are actually a couple of 
other variations of interest you want to measure in more general benchmarking 
exercises:

- Kernel compiled without any SMP support.  Specifically, without lock
   prefixes on atomic instructions.

- Kernel compiled with SMP support, but with use of additional CPUs disabled.

- Kernel compiled with SMP support, and with varying numbers of CPUs enabled.

The first two cases are important, because they help identify the difference 
between the general overhead of compiling in locked instructions (and related 
issues), and the overheads associated with contention, caches, inter-CPU IPI 
traffic, scheduling, etc.  By failing to compare the top to cases, it might be 
easy to conclude that a performance improve is due to the additional cost of 
atomic instructions, whereas in reality it may be the result of a poor 
scheduling decision, or of data unnecessarily cache missing in both CPUsrather 
than one because processing of the data is split poorly over available CPUs.

Robert N M Watson
Computer Laboratory
University of Cambridge