[Bug 266000] noticable higher i/o and cpu usage in 13.1 zfs on root (virtualized)

From: <bugzilla-noreply_at_freebsd.org>
Date: Tue, 23 Aug 2022 09:17:22 UTC
https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=266000

            Bug ID: 266000
           Summary: noticable higher i/o and cpu usage in 13.1 zfs on root
                    (virtualized)
           Product: Base System
           Version: 13.1-RELEASE
          Hardware: Any
                OS: Any
            Status: New
          Severity: Affects Only Me
          Priority: ---
         Component: kern
          Assignee: bugs@FreeBSD.org
          Reporter: olle@dalnix.se

I've noticed, on a couple of Digital Ocean vm:s, that after upgrading to
13.1-RELEASE-p1 the hypervisor graphs show a lot higher I/O and CPU usage than
normal.

So far I've upgraded a couple of boxes. All are out of the box ZFS on root
installs.


prison03:

It's a webserver with about dozen of jails for websites, one jail for db, and
one for a proxy server. The jail roots and "web" data are on separate block
storage volumes. The he storage volumes are also using ZFS.

First i noticed it was behaving a bit sluggish when doing simple tasks as
running find.

Second I noticed the hypervisor graphs had much higher CPU usage than normal.
Talked to DO support a bit, tried moving it to another hypervisor etc. Didn't
help. Then I rebooted to the "old" kernel, and cpu and i/o went down (although
I couldn't actually test this, since it's a production box, and booting the old
kernel pf wouldn't work). But, running find etc went by snappy as before.

I have some annotated screenshots of the hypervisor graphs here:

https://nextcloud.dalnix.se/index.php/s/8C9yrQqgGbSoQ37

After downgrading, it's snappy fast and graphs are back to normal.



prison04:

Same here, much higher I/O after upgrade. Graphs:

https://nextcloud.dalnix.se/index.php/s/r2A8JXcRJF97rZW

Only hosts one website. Normally not doing much.



prison08:

Sluggish. The graphs are way off for what it actually does. It's an old web
server, with no traffic. The only thing running are normal system stuff and
offloading some ZFS snapshots.

Since this is a box scheduled for destruction, I never noticed the high cpu, so
I have no before and after graphs.

https://nextcloud.dalnix.se/index.php/s/dRiF94ED2oYDCmt

last pid: 14617;  load averages:  3.24,  2.96,  2.85                           
                up 30+21:30:47  09:02:23

very wierd, it doesn't do anything, still busy =D.



*******01db03:

This is a dedicated database server.

With this one, the i/o cpu went so bad it made it unusable when people actually
started to use it. It's a DB server for a GIS type app. Normally it doesn't
have *that* much load. But, the (I'm guessing) i/o wait, caused the DB server
to stop responding.

I did some troubleshooting on the DB level, and "fixed" the thing that was
causing it. Looked at slow queries, and one, took longer and longer to the
point of no return. So I added a index to a table, and that "fixed" it. But,
obviously this application error wasn't a problem previous to 13.1.

Graph screenshots available here:

https://nextcloud.dalnix.se/index.php/s/9pNsyaJa62wRaiW



I have a bunch of physical hardware servers as well, but, they do not appear to
have any issues.

Also two upgraded (psysical hardware) storage servers. They all seem fine. No
increased load.


Is this an OpenZFS 2.1.4 + kvm (I *think* DO uses kvm) bug?

-- 
You are receiving this mail because:
You are the assignee for the bug.