HZ=100: not necessarily better?

Sun Jun 18 00:17:37 UTC 2006

Danial Thom wrote:

> 
> --- Robert Watson <rwatson at FreeBSD.org> wrote:
> 
> 
>>Scott asked me if I could take a look at the
>>impact of changing HZ for some 
>>simple TCP performance tests.  I ran the first
>>couple, and got some results 
>>that were surprising, so I thought I'd post
>>about them and ask people who are 
>>interested if they could do some investigation
>>also.  The short of it is that 
>>we had speculated that the increased CPU
>>overhead of a higher HZ would be 
>>significant when it came to performance
>>measurement, but in fact, I measure 
>>improved performance under high HTTP load with
>>a higher HZ.  This was, of 
>>course, the reason we first looked at
>>increasing HZ: improving timer 
>>granularity helps improve the performance of
>>network protocols, such as TCP. 
>>Recent popular opinion has swung in the
>>opposite direction, that higher HZ 
>>overhead outweighs this benefit, and I think we
>>should be cautious and do a 
>>lot more investigating before assuming that is
>>true.
>>
>>Simple performance results below.  Two boxes on
>>a gig-e network with if_em 
>>ethernet cards, one running a simple web server
>>hosting 100 byte pages, and 
>>the other downloading them in parallel
>>(netrate/http and netrate/httpd).  The 
>>performance difference is marginal, but at
>>least in the SMP case, likely more 
>>than a measurement error or cache alignment
>>fluke.  Results are 
>>transactions/second sustained over a 30 second
>>test -- bigger is better; box 
>>is a dual xeon p4 with HTT; 'vendor.*' are the
>>default 7-CURRENT HZ setting 
>>(1000) and 'hz.*' are the HZ=100 versions of
>>the same kernels.  Regardless, 
>>there wasn't an obvious performance improvement
>>by reducing HZ from 1000 to 
>>100.  Results may vary, use only as directed.
>>
>>What we might want to explore is using a
>>programmable timer to set up high 
>>precision timeouts, such as TCP timers, while
>>keeping base statistics 
>>profiling and context switching at 100hz.  I
>>think phk has previously proposed 
>>doing this with the HPET timer.
>>
>>I'll run some more diverse tests today, such as
>>raw bandwidth tests, pps on 
>>UDP, and so on, and see where things sit.  The
>>reduced overhead should be 
>>measurable in cases where the test is CPU-bound
>>and there's no clear benefit 
>>to more accurate timing, such as with TCP, but
>>it would be good to confirm 
>>that.
>>
>>Robert N M Watson
>>Computer Laboratory
>>University of Cambridge
>>
>>
>>peppercorn:~/tmp/netperf/hz> ministat *SMP
>>x hz.SMP
>>+ vendor.SMP
>>
> 
> +--------------------------------------------------------------------------+
> 
>>|xx x xx   x       xx  x     +              +  
>>+  +   +    ++ +         ++|
>>|  |_______A________|                    
>>|_____________A___M________|     |
>>
> 
> +--------------------------------------------------------------------------+
> 
>>     N           Min           Max       
>>Median           Avg        Stddev
>>x  10         13715         13793         13750
>>      13751.1     29.319883
>>+  10         13813         13970         13921
>>      13906.5     47.551726
>>Difference at 95.0% confidence
>>         155.4 +/- 37.1159
>>         1.13009% +/- 0.269913%
>>         (Student's t, pooled s = 39.502)
>>
>>peppercorn:~/tmp/netperf/hz> ministat *UP
>>x hz.UP
>>+ vendor.UP
>>
> 
> +--------------------------------------------------------------------------+
> 
>>|x           x xx   x      xx+   ++x+   ++  * +
>>   +                      +|
>>|        
>>|_________M_A_______|___|______M_A____________|
>>                 |
>>
> 
> +--------------------------------------------------------------------------+
> 
>>     N           Min           Max       
>>Median           Avg        Stddev
>>x  10         14067         14178         14116
>>      14121.2     31.279386
>>+  10         14141         14257         14170
>>      14175.9     33.248058
>>Difference at 95.0% confidence
>>         54.7 +/- 30.329
>>         0.387361% +/- 0.214776%
>>         (Student's t, pooled s = 32.2787)
>>
>>_______________________________________________
>>freebsd-performance at freebsd.org mailing list
>>
> 
> 
> --- Robert Watson <rwatson at FreeBSD.org> wrote:
> 
> 
>>Scott asked me if I could take a look at the
>>impact of changing HZ for some 
>>simple TCP performance tests.  I ran the first
>>couple, and got some results 
>>that were surprising, so I thought I'd post
>>about them and ask people who are 
>>interested if they could do some investigation
>>also.  The short of it is that 
>>we had speculated that the increased CPU
>>overhead of a higher HZ would be 
>>significant when it came to performance
>>measurement, but in fact, I measure 
>>improved performance under high HTTP load with
>>a higher HZ.  This was, of 
>>course, the reason we first looked at
>>increasing HZ: improving timer 
>>granularity helps improve the performance of
>>network protocols, such as TCP. 
>>Recent popular opinion has swung in the
>>opposite direction, that higher HZ 
>>overhead outweighs this benefit, and I think we
>>should be cautious and do a 
>>lot more investigating before assuming that is
>>true.
>>
>>Simple performance results below.  Two boxes on
>>a gig-e network with if_em 
>>ethernet cards, one running a simple web server
>>hosting 100 byte pages, and 
>>the other downloading them in parallel
>>(netrate/http and netrate/httpd).  The 
>>performance difference is marginal, but at
>>least in the SMP case, likely more 
>>than a measurement error or cache alignment
>>fluke.  Results are 
>>transactions/second sustained over a 30 second
>>test -- bigger is better; box 
>>is a dual xeon p4 with HTT; 'vendor.*' are the
>>default 7-CURRENT HZ setting 
>>(1000) and 'hz.*' are the HZ=100 versions of
>>the same kernels.  Regardless, 
>>there wasn't an obvious performance improvement
>>by reducing HZ from 1000 to 
>>100.  Results may vary, use only as directed.
>>
>>What we might want to explore is using a
>>programmable timer to set up high 
>>precision timeouts, such as TCP timers, while
>>keeping base statistics 
>>profiling and context switching at 100hz.  I
>>think phk has previously proposed 
>>doing this with the HPET timer.
>>
>>I'll run some more diverse tests today, such as
>>raw bandwidth tests, pps on 
>>UDP, and so on, and see where things sit.  The
>>reduced overhead should be 
>>measurable in cases where the test is CPU-bound
>>and there's no clear benefit 
>>to more accurate timing, such as with TCP, but
>>it would be good to confirm 
>>that.
>>
>>Robert N M Watson
>>Computer Laboratory
>>University of Cambridge
>>
>>
>>peppercorn:~/tmp/netperf/hz> ministat *SMP
>>x hz.SMP
>>+ vendor.SMP
>>
> 
> +--------------------------------------------------------------------------+
> 
>>|xx x xx   x       xx  x     +              +  
>>+  +   +    ++ +         ++|
>>|  |_______A________|                    
>>|_____________A___M________|     |
>>
> 
> +--------------------------------------------------------------------------+
> 
>>     N           Min           Max       
>>Median           Avg        Stddev
>>x  10         13715         13793         13750
>>      13751.1     29.319883
>>+  10         13813         13970         13921
>>      13906.5     47.551726
>>Difference at 95.0% confidence
>>         155.4 +/- 37.1159
>>         1.13009% +/- 0.269913%
>>         (Student's t, pooled s = 39.502)
>>
>>peppercorn:~/tmp/netperf/hz> ministat *UP
>>x hz.UP
>>+ vendor.UP
>>
> 
> +--------------------------------------------------------------------------+
> 
>>|x           x xx   x      xx+   ++x+   ++  * +
>>   +                      +|
>>|        
>>|_________M_A_______|___|______M_A____________|
>>                 |
>>
> 
> +--------------------------------------------------------------------------+
> 
>>     N           Min           Max       
>>Median           Avg        Stddev
>>x  10         14067         14178         14116
>>      14121.2     31.279386
>>+  10         14141         14257         14170
>>      14175.9     33.248058
>>Difference at 95.0% confidence
>>         54.7 +/- 30.329
>>         0.387361% +/- 0.214776%
>>         (Student's t, pooled s = 32.2787)
>>
>>_______________________________________________
>>freebsd-performance at freebsd.org mailing list
>>
> 
> 
> And what was the cost in cpu load to get the
> extra couple of bytes of throughput?
> 
> Machines have to do other things too. That is the
> entire point of SMP processing. Of course
> increasing the granularity of your clocks will
> cause to you process events that are
> clock-reliant more quickly, so you might see more
> "throughput", but there is a cost. Weighing (and
> measuring) those costs are more important than
> what a single benchmark does.
> 
> At some point you're going to have to figure out
> that there's a reason that every time anyone
> other than you tests FreeBSD it completely pigs
> out. Sqeezing out some extra bytes in netperf
> isn't "performance". Performance is everything
> that a system can do. If you're eating 10% more
> cpu to get a few more bytes in netperf, you
> haven't increased the performance of the system.
> 
> You need to do things like run 2 benchmarks at
> once. What happens to the "performance" of one
> benchmark when you increase the "performance" of
> the other? Run a database benchmark while you're
> running a network benchmark, or while you're
> passing a controlled stream of traffic through
> the box.
> 
> I just finished a couple of simple tests and find
> that 6.1 has not improved at all since 5.3 in
> basic interrupt processing and context switching
> performance (which is the basic building block
> for all system performance). Bridging 140K pps (a
> full 100Mb/s load) uses 33% of the cpu(s) in
> Freebsd 6.1, and 17% in Dragonfly 1.5.3, on a
> dual-core 1.8Ghz opteron system. (I finally got
> vmstat to work properly after getting rid of your
> stupid 2 second timeout in the MAC learning
> table). I'll be doing some mySQL benchmarks next
> week while passing a controlled stream through
> the system. But since I know that the controlled
> stream eats up twice as much CPU on FreeBSD, I
> already know much of the answer, since FreeBSD
> will have much less CPU left over to work with.
> 
> Its unfortunate that you seem to be tuning for
> one thing while completely unaware of all of the
> other things you're breaking in the process. The
> Linux camp understands that in order to scale
> well they have to sacrifice some network
> performance. Sadly they've gone too far and now
> the OS is no longer suitable as a high-end
> network appliance. I'm not sure what Matt
> understands because he never answers any
> questions, but his results are so far quite
> impressive. One thing for certain is that its not
> all about how many packets you can hammer out
> your socket interface (nor has it ever been). Its
> about improving the efficiency of the system on
> an overall basis. Thats what SMP processing is
> all about, and you're never going to get where
> you want to be using netperf as your guide.
> 
> I'd also love to see the results of the exact
> same test with only 1 cpu enabled, to see how
> well you scale generally. I'm astounded that
> no-one ever seems to post 1 vs 2 cpu performance,
> which is the entire point of SMP.
> 
> 
> DT
> 
> 

You have some valid points, but they get lost in your
overly abbrassive tone.  Several of us have watched
your behaviour on the DFly lists, and I dearly hope that
it doesn't overflow to our lists.  It would be a shame
to loose your insight and input.

Scott