bhyve disk performance issue

From: Matthew Grooms <mgrooms_at_shrew.net>
Date: Fri, 16 Feb 2024 17:19:40 UTC
Hi All,

I'm in the middle of a project that involves building out a handful of 
servers to host virtual Linux instances. Part of that includes testing 
bhyve to see how it performs. The intent is to compare host storage 
options such as raw vs zvol block devices and ufs vs zfs disk images 
using hardware raid vs zfs managed disks. It would also involve testing 
different guest options such as nvme vs virtio block storage. 
Unfortunately I hit a road block due to a performance issue that I can't 
explain and would like to bounce it off the list. Here are the hardware 
specs for the systems ...

Intel Xeon 6338 CPU ( 32c/64t )
256G 2400 ECC RAM
16x 4TB Samsung SATA3 SSDs
Avago 9361-16i ( mrsas - HW RAID10 )
Avago 9305-16i ( mpr - zpool RAID10 )

I started by performing some bonnie++ benchmarks on the host system 
running AlmaLinux 9.3 and FreeBSD 14 to get a baseline using HW RAID10. 
The disk controllers are 8x PCIe v3 but that should be adequate 
considering the 6gbit disk interfaces ...

RHEL9 + EXT4
----------------------------------------------------------------------------------------------------
Version  2.00       ------Sequential Output------ --Sequential Input- 
--Random-
                     -Per Chr- --Block-- -Rewrite- -Per Chr- --Block-- 
--Seeks--
Name:Size etc        /sec %CP  /sec %CP  /sec %CP  /sec %CP /sec %CP  
/sec %CP
localhost.loca 502G 2224k  99  2.4g  96  967m  33 3929k  93 1.6g  33 
+++++ +++
Latency              4403us   30844us   69444us   27015us 22675us    8754us
Version  2.00       ------Sequential Create------ --------Random 
Create--------
localhost.localdoma -Create-- --Read--- -Delete-- -Create-- --Read--- 
-Delete--
               files  /sec %CP  /sec %CP  /sec %CP  /sec %CP /sec %CP  
/sec %CP
                  16 +++++ +++ +++++ +++ +++++ +++ +++++ +++ +++++ +++ 
+++++ +++
Latency               118us     108us     829us 101us       6us     393us

FreeBSD14 + UFS
----------------------------------------------------------------------------------------------------
Version  1.98       ------Sequential Output------ --Sequential Input- 
--Random-
                     -Per Chr- --Block-- -Rewrite- -Per Chr- --Block-- 
--Seeks--
Name:Size etc        /sec %CP  /sec %CP  /sec %CP  /sec %CP /sec %CP  
/sec %CP
test.shrew. 523440M  759k  99  2.0g  99  1.1g  61 1945k  99 1.3g  42 
264.8  99
Latency             11106us   31930us     423ms    4824us 321ms   12881us
Version  1.98       ------Sequential Create------ --------Random 
Create--------
test.shrew.lab      -Create-- --Read--- -Delete-- -Create-- --Read--- 
-Delete--
               files  /sec %CP  /sec %CP  /sec %CP  /sec %CP /sec %CP  
/sec %CP
                  16 13095.927203  20 +++++ +++ 25358.227072 24 
13573.129095  19 +++++ +++ 25354.222712  23
Latency              4382us      13us      99us 3125us       5us      67us

Good enough. The next thing I tried was running the same benchmark in a 
RHEL9 guest to test the different storage config options, but that's 
when I started to encounter difficulty repeating tests that produced 
consistent results. At first the results appeared somewhat random but, 
after a few days of trial and error, I started to identify a pattern. 
The guest would sometimes perform well for a while, usually after a 
restart, and then hit a sharp drop off in performance over time. For 
example:

Version  2.00 ------Sequential Output------ --Sequential Input- --Random-
                     -Per Chr- --Block-- -Rewrite- -Per Chr- --Block-- 
--Seeks--
Name:Size etc        /sec %CP  /sec %CP  /sec %CP  /sec %CP /sec %CP  
/sec %CP
linux-blk    63640M  694k  99  1.6g  99  737m  76  985k  99 1.3g  69 
+++++ +++
Latency             11579us     535us   11889us    8597us 21819us    8238us
Version  2.00       ------Sequential Create------ --------Random 
Create--------
linux-blk           -Create-- --Read--- -Delete-- -Create-- --Read--- 
-Delete--
               files  /sec %CP  /sec %CP  /sec %CP  /sec %CP /sec %CP  
/sec %CP
                  16 +++++ +++ +++++ +++ +++++ +++ +++++ +++ +++++ +++ 
+++++ +++
Latency              7620us     126us    1648us     151us 15us     633us

--------------------------------- speed drop 
---------------------------------

Version  2.00       ------Sequential Output------ --Sequential Input- 
--Random-
                     -Per Chr- --Block-- -Rewrite- -Per Chr- --Block-- 
--Seeks--
Name:Size etc        /sec %CP  /sec %CP  /sec %CP  /sec %CP /sec %CP  
/sec %CP
linux-blk    63640M  676k  99  451m  99  314m  93  951k  99 402m  99 
15167 530
Latency             11902us    8959us   24711us   10185us 20884us    5831us
Version  2.00       ------Sequential Create------ --------Random 
Create--------
linux-blk           -Create-- --Read--- -Delete-- -Create-- --Read--- 
-Delete--
               files  /sec %CP  /sec %CP  /sec %CP  /sec %CP /sec %CP  
/sec %CP
                  16     0  96 +++++ +++ +++++ +++     0  96 +++++ 
+++     0  75
Latency               343us     165us    1636us     113us 55us    1836us

The above test ran 6 times for roughly 20 mins producing higher speed 
results before slowing to the lower speed result. The time to complete 
the benchmark also increased from a about 2.5 mins to about 8 minutes. 
To ensure I didn't miss something in my baseline, I repeated that 
benchmark on the host system in a loop for about an hour but the output 
was consistent with my original testing. To ensure performance didn't 
bounce back after it slowed, I repeated the benchmark in a loop on the 
guest for about 4 hours but the output was also consistent. I then tried 
switching between block device, img on ufs, img on zfs dataset and zvols 
as well as switching between virtio block and nvme in the guest. All of 
these options appeared to suffer from the same problem albeit with 
slightly different performance numbers. I also tried swapping out the 
storage controller and running some benchmarks using a zpool over the 
disk array to see if that was any better. Same issue. I also tried 
pinning the guest CPU to specific cores ( -p x:y ). No improvement.

Here is a list of a few other things I'd like to try:

1) Wiring guest memory ( unlikely as it's 32G of 256G )
2) Downgrading the host to 13.2-RELEASE
3) Test a different guest OSs other than RHEL8 & RHEL9
4) Test a different model of RAID/SAS controller
5) Test xen vs bhyve disk performance

At this point I thought it prudent to post here for some help. Does 
anyone have an idea of what might cause this issue? Does anyone have 
experience testing bhyve with an SSD disk array of this size or larger? 
I'm happy to provide more data points on request.

Thanks,

-Matthew