bhyve win-guest benchmark comparing
Harry Schmalzbauer
freebsd at omnilan.de
Mon Oct 22 11:26:07 UTC 2018
Hello,
I started using bhyve for some of my local setups about one or two years
ago. I'm utilizing rc.local along with some homebrew start_if.NG
scripts to connect tap(4) and ng_bridge(4) with a single vlan(4) uplink
child, so I know bhyve well enough to know that it isn't comparable in
many ways with ESXi as a "product", and I'm completely fine with the
extra jobs I have to do for using bhyve!
But I've always felt that there are significant performance penalties,
wich hasn't been a big issue for my own guests (tinker and WSUS windows).
Since I wanted to evaluate the possibility to replace ESXi instances
elsewhere, I decided to run some hopefully meaningful benchmark tests.
Unfortunately, the performance penalty is much too high. I'd like to
share my measurings here.
Host-Config:
database-------------------------------------
| |
da0 da1 <- Windows Server 2012R2, SQLExpress2017
/ \ |
| r0 | |
S S S
S S S
D D D
mps0/vmhba0
|
| ,- bhyve-ssd (ufs 12.0-beta1)
| ahci0/vmhba32 –-:
| | `- esxi-ssd (6.7)
| |
32GXeonE34x3.6G(hyperthreading enabled)
So the guest is booting from it's own physical disk (single SSD via mps).
Guest-Config:
When the host was running FreeBSD, the relevant bhyve disk setup reads
"-s 3,ahci,hd:/dev/da1,hd:/dev/da0"
Likewise, when the host was running ESXi, the corresponding
disks/vml.... were attached to the ESXi "SATA Controller" (via RDM).
So in both cases the built-in generic guest-OS (Win2k12R2) AHCI driver
was in use, for both, the OS-system disk and the db/bechmark disk.
Both hypervisors assign 2 CPU cores (in one package) and 4GB RAM.
The guest operating system of choice is Windows Server 2012R2. As
real-world application I chose MS-SQLServerExpress 2017. Simply because
I looked for a "industry" benchmark tool and found a trial version which
was easy to setup, bringing test-data along with several workload templates.
After OS-setup was done, all (G)UI-actions were done through RDP session
in both cases.
Test-Runs:
Each hypervisor had only the one bench-guest running, no other
tasks/guests were running besides system's native standard processes.
Since the time between powering up the guest and finishing logon
differed notably (~5s vs. ~20s) from one host to the other, I did a
quick synthetic IO-Test beforehand.
I'm using IOmeter since heise.de published a great test pattern called
IOmix – about 18 years ago I guess. This access pattern has always
perfectly reflected the system performance for human computer usage with
non-caculation-centric applications, and still is my favourite, despite
throughput and latency changed by some orders of manitudes during the
last decade (and I had defined something for "fio" which mimics IOmix
and shows reasonable relational results; but I'm still prefering IOmeter
for homogenous IO benchmarking).
The results is about factor 7 :-(
~3800iops&69MB/s (CPU-guest-usage 42%IOmeter+12%irq)
vs.
~29000iops&530MB/s (CPU-guest-usage 11%IOmeter+19%irq)
[with debug kernel and debug-malloc, numbers are 3000iops&56MB/s,
virtio-blk instead of ahci,hd: results in 5660iops&104MB/s with
non-debug kernel
– much better, but even higher CPU load and still factor 4 slower]
What I don't understand is, why the IOmeter process differs that much in
CPU utilization!?! It's the same binary on the same OS (guest) with the
same OS-driver and the same underlying hardware – "just" the AHCI
emulation and the vmm differ...
Unfortunately, the picture for virtio-net vs. vmxnet3 is similar sad.
Copying a single 5GB file from CIFS share to DB-ssd results in 100%
guest-CPU usage, where 40% are irqs and the throughput max out at ~40MB/s.
When copying the same file from the same source with the same guest on
the same host but host booted ESXi, there's 20% guest-CPU usage while
transfering 111MB/s – the uplink GbE limit.
These synthetic benchmark very well explain the "feelable" difference
when using a guest between the two hypervisors, but fortunately not by
that factor most times. So I continued with the initially aimed
database test.
Disclaimer: I'm no database expert and it's not about achieving maximum
performance from DB workload. It's just about generating reporducable
CPU-bound load together with IO load to illustrate overall performance
_differences_.
So I combined two "industry standard" benchmarks from "Benchmark
Factory" and scaled them (TPC-C by 75 and TCP-H by 3) to generate a
database with 10GB size.
Interestingly, the difference is by far not as big as expected after the
previous results.
There's clearly a difference, but the worst case isn't even factor 2.
I did two consecutive runs for each hypervisor. Run4 and Run5 were on
bhyve, Run6 and Run7 on ESXi.
Please see the graph here:
http://www.schmalzbauer.de/downloads/sqlbench_bhyve-esxi.png
Even more interestingly, disk load "graphs" looked very similar – I
don't really have a graph for the bhyve-run.
But during the bhyve-run I saw 200-500MB/s transfer bandwidth, which is
exactly what I see in the ESXi graph.
So the bhyve setup is able to deliver constantly high performance in
that case!
But there's a variation which I don't understand. Almost any other
application suffers from disk IO constraints on bhyve.
Of course, block size ist the most important parameter here, but MSSQL
doesn't use big block sizes as far as I know (formerly these were 8k and
then, since 2010 I guess, 64k).
This result perfectly reflects my observation with my local WSUS, which
is also database load and I never found performance to be an issue on
that guest.
I have another picture comparing pure synthetic benchmarks, showing only
smaller "FPU/ALU" differences (memory bandwith measured with Intels mlc
were exactly the same) but huge disk IO differences, although I used
virtio-blk instead of ahci,hd: for byhve (where HDD selection show's
"Red Hat VirtIO"):
http://www.schmalzbauer.de/downloads/sbmk_bhyve-esxi.png
Question:
Are these (emulation(only?) related, I guess) performace issues well
known? I mean, does somebody know what needs to be done in what area,
in order to catch up with the other results? So it's just a matter of
time/resources?
Or are these results surprising and extensive analysis must be done
before anybody can tell how to fix the IO limitations?
Is the root cause for the problematic low virtio-net throughput probably
the same as for the disk IO limits? Both really hurt in my use case and
the host is not idling in relation, but even showing higher load with
lower results. So even if the lower user-experience-performance would
be considered as toleratable, the guests/host ratio was only half dense.
Thanks,
-harry
More information about the freebsd-virtualization
mailing list