bhyve win-guest benchmark comparing

Mon Oct 22 11:26:07 UTC 2018

Hello,

I started using bhyve for some of my local setups about one or two years 
ago.  I'm utilizing rc.local along with some homebrew start_if.NG 
scripts to connect tap(4) and ng_bridge(4) with a single vlan(4) uplink 
child, so I know bhyve well enough to know that it isn't comparable in 
many ways with ESXi as a "product", and I'm completely fine with the 
extra jobs I have to do for using bhyve!

But I've always felt that there are significant performance penalties, 
wich hasn't been a big issue for my own guests (tinker and WSUS windows).
Since I wanted to evaluate the possibility to replace ESXi instances 
elsewhere, I decided to run some hopefully meaningful benchmark tests.  
Unfortunately, the performance penalty is much too high.  I'd like to 
share my measurings here.

Host-Config:

     database-------------------------------------
          |                                         |
       da0    da1 <- Windows Server 2012R2, SQLExpress2017
      /  \      |
     | r0 |    |
     S    S    S
     S    S    S
     D    D    D
mps0/vmhba0
|
|                     ,- bhyve-ssd (ufs 12.0-beta1)
|    ahci0/vmhba32 –-:
|    |                `- esxi-ssd (6.7)
|    |
32GXeonE34x3.6G(hyperthreading enabled)

So the guest is booting from it's own physical disk (single SSD via mps).

Guest-Config:
When the host was running FreeBSD, the relevant bhyve disk setup reads 
"-s 3,ahci,hd:/dev/da1,hd:/dev/da0"
Likewise, when the host was running ESXi, the corresponding 
disks/vml.... were attached to the ESXi "SATA Controller" (via RDM).
So in both cases the built-in generic guest-OS (Win2k12R2) AHCI driver 
was in use, for both, the OS-system disk and the db/bechmark disk.
Both hypervisors assign 2 CPU cores (in one package) and 4GB RAM.
The guest operating system of choice is Windows Server 2012R2.  As 
real-world application I chose MS-SQLServerExpress 2017. Simply because 
I looked for a "industry" benchmark tool and found a trial version which 
was easy to setup, bringing test-data along with several workload templates.
After OS-setup was done, all (G)UI-actions were done through RDP session 
in both cases.

Test-Runs:
Each hypervisor had only the one bench-guest running, no other 
tasks/guests were running besides system's native standard processes.
Since the time between powering up the guest and finishing logon 
differed notably (~5s vs. ~20s) from one host to the other, I did a 
quick synthetic IO-Test beforehand.
I'm using IOmeter since heise.de published a great test pattern called 
IOmix – about 18 years ago I guess.  This access pattern has always 
perfectly reflected the system performance for human computer usage with 
non-caculation-centric applications, and still is my favourite, despite 
throughput and latency changed by some orders of manitudes during the 
last decade (and I had defined something for "fio" which mimics IOmix 
and shows reasonable relational results; but I'm still prefering IOmeter 
for homogenous IO benchmarking).

The results is about factor 7 :-(
~3800iops&69MB/s (CPU-guest-usage 42%IOmeter+12%irq)
                 vs.
~29000iops&530MB/s (CPU-guest-usage 11%IOmeter+19%irq)

     [with debug kernel and debug-malloc, numbers are 3000iops&56MB/s,
      virtio-blk instead of ahci,hd: results in 5660iops&104MB/s with 
non-debug kernel
      – much better, but even higher CPU load and still factor 4 slower]

What I don't understand is, why the IOmeter process differs that much in 
CPU utilization!?!  It's the same binary on the same OS (guest) with the 
same OS-driver and the same underlying hardware – "just" the AHCI 
emulation and the vmm differ...

Unfortunately, the picture for virtio-net vs. vmxnet3 is similar sad.
Copying a single 5GB file from CIFS share to DB-ssd results in 100% 
guest-CPU usage, where 40% are irqs and the throughput max out at ~40MB/s.
When copying the same file from the same source with the same guest on 
the same host but host booted ESXi, there's 20% guest-CPU usage while 
transfering 111MB/s – the uplink GbE limit.

These synthetic benchmark very well explain the "feelable" difference 
when using a guest between the two hypervisors, but fortunately not by 
that factor most times.  So I continued with the initially aimed 
database test.

Disclaimer: I'm no database expert and it's not about achieving maximum 
performance from DB workload.  It's just about generating reporducable 
CPU-bound load together with IO load to illustrate overall performance 
_differences_.
So I combined two "industry standard" benchmarks from "Benchmark 
Factory" and scaled them (TPC-C by 75 and TCP-H by 3) to generate a 
database with 10GB size.

Interestingly, the difference is by far not as big as expected after the 
previous results.
There's clearly a difference, but the worst case isn't even factor 2.
I did two consecutive runs for each hypervisor. Run4 and Run5 were on 
bhyve, Run6 and Run7 on ESXi.
Please see the graph here:
http://www.schmalzbauer.de/downloads/sqlbench_bhyve-esxi.png

Even more interestingly, disk load "graphs" looked very similar – I 
don't really have a graph for the bhyve-run.
But during the bhyve-run I saw 200-500MB/s transfer bandwidth, which is 
exactly what I see in the ESXi graph.
So the bhyve setup is able to deliver constantly high performance in 
that case!
But there's a variation which I don't understand.  Almost any other 
application suffers from disk IO constraints on bhyve.
Of course, block size ist the most important parameter here, but MSSQL 
doesn't use big block sizes as far as I know (formerly these were 8k and 
then, since 2010 I guess, 64k).
This result perfectly reflects my observation with my local WSUS, which 
is also database load and I never found performance to be an issue on 
that guest.

I have another picture comparing pure synthetic benchmarks, showing only 
smaller "FPU/ALU" differences (memory bandwith measured with Intels mlc 
were exactly the same) but huge disk IO differences, although I used 
virtio-blk instead of ahci,hd: for byhve (where HDD selection show's 
"Red Hat VirtIO"):
http://www.schmalzbauer.de/downloads/sbmk_bhyve-esxi.png

Question:
Are these (emulation(only?) related, I guess) performace issues well 
known?  I mean, does somebody know what needs to be done in what area, 
in order to catch up with the other results? So it's just a matter of 
time/resources?
Or are these results surprising and extensive analysis must be done 
before anybody can tell how to fix the IO limitations?

Is the root cause for the problematic low virtio-net throughput probably 
the same as for the disk IO limits?  Both really hurt in my use case and 
the host is not idling in relation, but even showing higher load with 
lower results.  So even if the lower user-experience-performance would 
be considered as toleratable, the guests/host ratio was only half dense.

Thanks,

-harry