Date: Thu, 04 Nov 2021 20:34:27 UTC
https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=259651 Bug ID: 259651 Summary: bhyve process uses all memory/swap Product: Base System Version: 12.2-RELEASE Hardware: Any OS: Any Status: New Severity: Affects Only Me Priority: --- Component: bhyve Assignee: virtualization@FreeBSD.org Reporter: reg@FreeBSD.org I've had a Windows Server running in bhyve on FreeNAS for a few years now. It uses DFS-R to sync a few windows file systems to my remote backup location. The VM has several zvol backed AHCI devices and a virtio network adapter. It has been running (mostly) stably for a long time with adequate performance (as in, it can mostly saturate the 1GB link it's on and can get disk speeds in the VM which are as fast as I expect from the low power backing store). Recently I made a few changes to the machine and the host, some of which are hard to reverse, and the VM has started to consume all available RAM, then all the swap and eventually it gets killed by the OOM handler... A few crashes corrupted the DFS-R databases, and so now the machine wants to do a huge amount of IO (both network and disk) to resync (but that's my problem). There are other reports online of RAM exhaustion from bhyve, but I couldn't find an open bug, so I'm filing one. My problem seemed to start on updating to TrueNAS-12.0-U5.1, but I also did some other reconfiguration around this time, and judging from the other reports, this might be a long-standing issue. The other change I made was to mess with the CPU/RAM allocation to this VM, and I accidentally misread the number of the cores as the total number of cores, not the per CPU cores, so I allocated way more cores as my CPU has threads (2xCPUs, 2xcores, 2xthreads, 8GB RAM)... Needless to say, the VM quickly swamped the host. However, this also caused the memory use to grow. I've now scaled the CPUs back to (1xCPU, 1xcore, 2xthreads, 6GB RAM) and the memory use is now staying stable - although it's currently rebuilding some DFS-R database so it's not maxing out the VM CPUs. The behavior I observe is that the memory use stays stable as long as the host CPU use is reasonable. As soon as the host starts to max out its real cores (it's a 2xcore, 2xthread CPU) and the bhyve VM is doing a lot of IO, the memory use grows rapidly. When the byhe process is stopped (by shutting down the VM, if you can get in quick enough), it takes a very long time to exit and sits in a 'tx->tx' state. It looks like it's trying to flush buffers, although the zpool seems to show only reads while the process is exiting. My guess as to the bug is that byhve has a huge amount of outstanding IO, but I'm not sure how to monitor that. When the host CPU is really busy these IO buffers are not being freed properly, and are eventually leaking. Around the same time as making these changes, I also turned on dedup on one of the zvols (the backups on that disk are rewritten every day, even though they're the same, so I was getting a lot of snapshot growth). I've turned that off, but it didn't seem to change the behavior. I also added the ZIL and L2ARC devices to the pool around this time. I've not tried removing them. The host and the VM have been set up for a long time and working, so I'm going to ignore suggestions to get a bigger box or tune my zarc values... But I'm happy to debug it - I've been able to reproduce this relatively reliably with different CPU settings, although it does rely on Windows cooperating. I can't mess with it too much since I do need to keep the other backups going directly via TrueNAS to the other pools going ;-). TrueNAS Server: ThinkServer TS140, Intel(R) Core(TM) i3-4130 CPU @ 3.40GHz, 20GB RAM. zpool: 2 striped mirrored 3TB TOSHIBA HDWD130, with mirrored 12GB ZIL and 19GB L2ARC on SATA SSDs. TrueNAS-12.0-U6 (FreeBSD 12.2-RELEASE-p10), 25GB of swap. Windows Server VM: 2xCPU, 1xcore, 1xthread, 6GB RAM (original, see other comments). 4xAHCI zvol with 64K cluster, one of which had dedup on for a period as an experiment, 512B blocks. (VM BSODs immediately if I try using virtio-blk). 1xVirtIO NIC (em0), with 0.1.208 virtio-win drivers. Windows Server 2019, fully patched. -- You are receiving this mail because: You are the assignee for the bug.