Re: bhyve disk performance issue

From: Matthew Grooms <mgrooms_at_shrew.net>
Date: Wed, 28 Feb 2024 20:03:03 UTC
On 2/28/24 13:31, Vitaliy Gusev wrote:
> Hi,  Matthew.
>
HI Vitaliy,

Thanks for the pointers.

> I still do not know what command line was used for bhyve. I  couldn't 
> find it through the thread, sorry. And I couldn't find virtual disk 
> size that you used.
>
Sorry about that. I'll try to get you the exact command line invocation 
used to launch the guest process once I have test hardware again.

>
> Could you, please, simplify bonnie++ output, it is hard to decode due 
> to alignment and use exact numbers for:
>
> READ seq  - I see you had 1.6GB/s for the good time and ~500MB/s for 
> the worst.
> WRITE seq  - ...
>
I summarized the output for you. Here it is again:

Fast: ~ 1.6g/s seq write and 1.3g/s seq read
Slow: ~ 451m/s seq write and 402m/s seq read

> If you have slow results both for the read and write operations, you 
> probably should perform testing _only_ for READs and do not do 
> anything until READs are fine.
>
> Again, if you have slow performance for Ext4 Filesystem in guest VM 
> placed on the passed disk image, you should try to test on the raw 
> disk image, i.e. without Ext4, because it could be related.
>
> If you run test inside VM on a filesystem, you can have deal with 
> filesystem bottlenecks, bugs, fragmentation etc. Do you want to fix 
> them all? I don’t think so.
>
> For example, if you pass disk image 40G and create Ext4 filesystem, 
> and during testing the filesystem becomes full over 80%, I/O could be 
> performed not so fine.
>
> You probably should eliminate that guest filesystem behaviour when you 
> meet IO performance slowdown.
>
> Also, please look at the TRIM operations when you perform WRITE 
> testing. It could be also related to the slow write I/O.
>
The virtual disks were provisioned with either a 128G disk image or a 
1TB raw partition, so I don't think space was an issue.

Trim is definitely not an issue. I'm using a tiny fraction of the 32TB 
array have tried both heavily under-provisioned HW RAID10 and SW RAID10 
using GEOM. The latter was tested after sending full trim resets to all 
drives individually.

I will try to incorporate the rest of your feedback into my next round 
of testing. If I can find a benchmark tool that works with a raw block 
device, that would be ideal.

Thanks,

-Matthew


> ——
> Vitaliy
>
>> On 28 Feb 2024, at 21:29, Matthew Grooms <mgrooms@shrew.net> wrote:
>>
>> On 2/27/24 04:21, Vitaliy Gusev wrote:
>>> Hi,
>>>
>>>
>>>> On 23 Feb 2024, at 18:37, Matthew Grooms <mgrooms@shrew.net> wrote:
>>>>
>>>>> ...
>>>> The problem occurs when an image file is used on either ZFS or UFS. 
>>>> The problem also occurs when the virtual disk is backed by a raw 
>>>> disk partition or a ZVOL. This issue isn't related to a specific 
>>>> underlying filesystem.
>>>>
>>>
>>> Do I understand right, you ran testing inside VM inside guest VM  on 
>>> ext4 filesystem? If so you should be aware about additional overhead 
>>> in comparison when you were running tests on the hosts.
>>>
>> Hi Vitaliy,
>>
>> I appreciate you providing the feedback and suggestions. I spent over 
>> a week trying as many combinations of host and guest options as 
>> possible to narrow this issue down to a specific host storage or a 
>> guest device model option. Unfortunately the problem occurred with 
>> every combination I tested while running Linux as the guest. Note, I 
>> only tested RHEL8 & RHEL9 compatible distributions ( Alma & Rocky ). 
>> The problem did not occur when I ran FreeBSD as the guest. The 
>> problem did not occur when I ran KVM in the host and Linux as the guest.
>>
>>> I would suggest to run fio (or even dd) on raw disk device inside 
>>> VM, i.e. without filesystem at all.  Just do not forget do “echo 3 > 
>>> /proc/sys/vm/drop_caches” in Linux Guest VM before you run tests.
>>
>> The two servers I was using to test with are are no longer available. 
>> However, I'll have two more identical servers arriving in the next 
>> week or so. I'll try to run additional tests and report back here. I 
>> used bonnie++ as that was easily installed from the package repos on 
>> all the systems I tested.
>>
>>>
>>> Could you also give more information about:
>>>
>>>  1. What results did you get (decode bonnie++ output)?
>>
>> If you look back at this email thread, there are many examples of 
>> running bonnie++ on the guest. I first ran the tests on the host 
>> system using Linux + ext4 and FreeBSD 14 + UFS & ZFS to get a 
>> baseline of performance. Then I ran bonnie++ tests using bhyve as the 
>> hypervisor and Linux & FreeBSD as the guest. The combination of host 
>> and guest storage options included ...
>>
>> 1) block device + virtio blk
>> 2) block device + nvme
>> 3) UFS disk image + virtio blk
>> 4) UFS disk image + nvme
>> 5) ZFS disk image + virtio blk
>> 6) ZFS disk image + nvme
>> 7) ZVOL + virtio blk
>> 8) ZVOL + nvme
>>
>> In every instance, I observed the Linux guest disk IO often perform 
>> very well for some time after the guest was first booted. Then the 
>> performance of the guest would drop to a fraction of the original 
>> performance. The benchmark test was run every 5 or 10 minutes in a 
>> cron job. Sometimes the guest would perform well for up to an hour 
>> before performance would drop off. Most of the time it would only 
>> perform well for a few cycles ( 10 - 30 mins ) before performance 
>> would drop off. The only way to restore the performance was to reboot 
>> the guest. Once I determined that the problem was not specific to a 
>> particular host or guest storage option, I switched my testing to 
>> only use a block device as backing storage on the host to avoid 
>> hitting any system disk caches.
>>
>> Here is the test script I used in the cron job ...
>>
>> #!/bin/sh
>> FNAME='output.txt'
>>
>> echo 
>> ================================================================================ 
>> >> $FNAME
>> echo Begin @ `/usr/bin/date` >> $FNAME
>> echo >> $FNAME
>> /usr/sbin/bonnie++ 2>&1 | /usr/bin/grep -v 'done\|,' >> $FNAME
>> echo >> $FNAME
>> echo End @ `/usr/bin/date` >> $FNAME
>>
>> As you can see, I'm calling bonnie++ with the system defaults. That 
>> uses a data set size that's 2x the guest RAM in an attempt to 
>> minimize the effect of filesystem cache on results. Here is an 
>> example of the output that bonnie++ produces ...
>>
>> Version 2.00       ------Sequential Output------ --Sequential Input- 
>> --Random-
>>                     -Per Chr- --Block-- -Rewrite- -Per Chr- --Block-- 
>> --Seeks--
>> Name:Size etc        /sec %CP  /sec %CP  /sec %CP /sec %CP  /sec %CP  
>> /sec %CP
>> linux-blk    63640M  694k  99  1.6g  99  737m  76 985k  99  1.3g  69 
>> +++++ +++
>> Latency             11579us     535us   11889us 8597us   21819us    
>> 8238us
>> Version  2.00       ------Sequential Create------ --------Random 
>> Create--------
>> linux-blk           -Create-- --Read--- -Delete-- -Create-- --Read--- 
>> -Delete--
>>               files  /sec %CP  /sec %CP  /sec %CP /sec %CP  /sec %CP  
>> /sec %CP
>>                  16 +++++ +++ +++++ +++ +++++ +++ +++++ +++ +++++ +++ 
>> +++++ +++
>> Latency              7620us     126us 1648us     151us      15us     
>> 633us
>>
>> --------------------------------- speed drop 
>> ---------------------------------
>>
>> Version  2.00       ------Sequential Output------ --Sequential Input- 
>> --Random-
>>                     -Per Chr- --Block-- -Rewrite- -Per Chr- --Block-- 
>> --Seeks--
>> Name:Size etc        /sec %CP  /sec %CP  /sec %CP /sec %CP  /sec %CP  
>> /sec %CP
>> linux-blk    63640M  676k  99  451m  99  314m  93 951k  99  402m  99 
>> 15167 530
>> Latency             11902us    8959us   24711us 10185us   20884us    
>> 5831us
>> Version  2.00       ------Sequential Create------ --------Random 
>> Create--------
>> linux-blk           -Create-- --Read--- -Delete-- -Create-- --Read--- 
>> -Delete--
>>               files  /sec %CP  /sec %CP  /sec %CP /sec %CP  /sec %CP  
>> /sec %CP
>>                  16     0  96 +++++ +++ +++++ +++     0  96 +++++ 
>> +++     0  75
>> Latency               343us     165us 1636us     113us      55us    
>> 1836us
>>
>> In the example above, the benchmark test repeated about 20 times with 
>> results that were similar to the performance shown above the dotted 
>> line ( ~ 1.6g/s seq write and 1.3g/s seq read ). After that, the 
>> performance dropped to what's shown below the dotted line which is 
>> less than 1/4 the original speed ( ~ 451m/s seq write and 402m/s seq 
>> read ).
>>
>>>  2. What results expecting?
>>>
>> What I expect is that, when I perform the same test with the same 
>> parameters, the results would stay more or less consistent over 
>> time. This is true when KVM is used as the hypervisor on the same 
>> hardware and guest options. That said, I'm not worried about bhyve 
>> being consistently slower than kvm or a FreeBSD guest being 
>> consistently slower than a Linux guest. I'm concerned that the 
>> performance drop over time is indicative of an issue with how bhyve 
>> interacts with non-freebsd guests.
>>
>>>  3. VM configuration, virtio-blk disk size, etc.
>>>  4. Full command for tests (including size of test-set), bhyve, etc.
>>
>> I believe this was answered above. Please let me know if you have 
>> additional questions.
>>
>>>
>>>  5. Did you pass virtio-blk as 512 or 4K ? If 512, probably you 
>>> should try 4K.
>>>
>> The testing performed was not exclusively with virtio-blk.
>>
>>>  6. Linux has several read-ahead options for IO schedule, and it 
>>> could be related too.
>>>
>> I suppose it's possible that bhyve could be somehow causing the disk 
>> scheduler in the Linux guest to act differently. I'll see if I can 
>> figure out how to disable that in future tests.
>>
>>> Additionally could also you play with “sync=disabled” volume/zvol 
>>> option? Of course it is only for write testing.
>>
>> The testing performed was not exclusively with zvols.
>>
>> Once I have more hardware available, I'll try to report back with 
>> more testing. It may be interesting to also see how a Windows guest 
>> performs compared to Linux & FreeBSD. I suspect that this issue may 
>> only be triggered when a fast disk array is in use on the host. My 
>> tests use a 16x SSD RAID 10 array. It's also quite possible that the 
>> disk IO slowdown is only a symptom of another issue that's triggered 
>> by the disk IO test ( please see end of my last post related to 
>> scheduler priority observations ). All I can say for sure is that ...
>>
>> 1) There is a problem and it's reproducible across multiple hosts
>> 2) It affects RHEL8 & RHEL9 guests but not FreeBSD guests
>> 3) It is not specific to any host or guest storage option
>>
>> Thanks,
>>
>> -Matthew
>>
>