What version of FBSD does Yahoo run?

Thu Oct 7 14:35:28 PDT 2004

In a message dated 10/7/04 4:06:34 PM Eastern Daylight Time, drosih at rpi.edu 
writes:
Here's one benchmark, showing UDP packet/second generation
     rate from userland on a dual xeon machine under various
     target loads:

     Desired Optimal 5.x-UP  5.x-SMP 4.x-UP  4.x-SMP
      50000   50000   50000   50000   50000   50000
      75000   75000   75001   75001   75001   75001
     100000  100000  100000  100000  100000  100000
     125000  125000  125000  125000  125000  125000
     150000  150000  150015  150014  150015  150015
     175000  175000  175008  175008  175008  169097
     200000  200000  200000  179621  181445  169451
     225000  225000  225022  179729  181367  169831
     250000  250000  242742  179979  181138  169212
     275000  275000  242102  180171  181134  169283
     300000  300000  242213  179157  181098  169355

That does show results for both single-processor (5.x-UP 4.x-UP)
and multi- processor (5.x-SMP, 4.x-SMP) benchmarks.  It may be
that he ignored the table as soon as he read "dual Xeon".
--------------------------------------------
I haven't seen this before. If I did, I would immediately ask:

- What is the control  here? What does your "benchmark" test?
- Is this on a gigabit link? What are the packet sizes? Was network
availability a factor in limiting the test results?
- What does "target load" mean? Does it mean don't try to send
more than that? If so, what does it show if you reach it? If you 
don't measure the utilization that it takes to saturate your "target"
I don't see the point of having it.
- It seems that the only thing you could learn from this test would 
be what is the maximum pps
you could achieve unidirectionally out of a system. Why is that
useful, since its hardly ever the requirement unless you're 
building a traffic generator? 
- a relatively slow machine (a 1.7Ghz celeron with a 32-bit/33mhz
fxp NIC running 4.9) pushes over 250Kpps, so why is your machine, 
with seemingly superior hardware, so slow?

- the test seems backwards. What you are doing in this test is
not something that any device does. If you want to measure user-space
performance, it has to include receive and transmit response, not 
just transmit.  Perhaps it indirectly shows process-switching performance, 
but doesn't tell you very much about network performance, since transmit
is much more trivial than receive in terms of processing requirements.
When you transmit you know exactly what you  have, when you receive
you have to do a lot of checking and testing to see what needs to
be done.

When I test network performance, I want to isoloate kernel
performance if possible. If you're evaluating the system for use as 
a network device (such as a router, a bridge, a firewall, etc), you
have to eliminate userland from the formula. The interaction between
user space and the kernel is a key factor in your "benchmark" that is absent
in a pure network device, so its not useful in testing pure stack 
performance. 

Also, there is a significant problem with "maximum packets/second" tests. 
As you reach high levels of saturation, you often get abnormal processing
requirements that skew the results. For example as you get higher and 
higher bus saturations the processing requirements change, as I/Os take
longer waiting for access to the bus, transmit queues may fill, etc. 
Testing under such unusual conditions  may inlcude abnormal recovery 
code to handle such saturations that would never occur with a machine 
under "normal" loads.

A better way to test is measuring utilization under realistically normal 
conditions. Machines can get very inefficient if their recovery code is poor, 
but it may not matter since no-one realistically runs a machine at 98%
utilization. 

Assuming that your benchmark does test something, Your "results" 
seem to show that a uniprocessor machine is substantially
more efficient than an SMP box. It also seems that the gap has widened 
between UP and SMP performance in 5.x. Wasn't one of the goals 
of 5.x to substantially improve SMP performance? This seems to show 
the opposite.

TM