Desperate with 870 QVO and ZFS

From: <egoitz_at_ramattack.net>
Date: Wed, 06 Apr 2022 11:18:49 UTC
Good morning,

I write this post with the expectation that perhaps someone could help
me 

I am running some mail servers with FreeBSD and ZFS. They use 870 QVO
(not EVO or other Samsung SSD disks) disks as storage. They can easily
have from 1500 to 2000 concurrent connections. The machines have 128GB
of ram and the CPU is almost absolutely idle. The disk IO is normally at
30 or 40% percent at most.

The problem I'm facing is that they could be running just fine and
suddenly at some peak hour, the IO goes to 60 or 70% and the machine
becomes extremely slow. ZFS is all by default, except the sync parameter
which is set disabled. Apart from that the ARC is limited to 64GB. But
even this is extremely odd. The used ARC is near 20GB. I have seen, that
meta cache in arc is very near to the limit that FreeBSD automatically
sets depending on the size of the ARC you set. It seems that almost all
ARC is used by meta cache. I have seen this effect in all my mail
servers with this hardware and software config.

I do attach a zfs-stats output, but from now that the servers are not so
loaded as described. I do explain. I run a couple of Cyrus instances in
these servers. One as master, one as slave on each server. The commented
situation from above, happens when both Cyrus instances become master,
so when we are using two Cyrus instances giving service in the same
machine. For avoiding issues, know we have balanced and we have a master
and a slave in each server. You know, a slave instance has almost no io
and only a single connection for replication. So the zfs-stats output is
from now we have let's say half of load in each server, because they
have one master and one slave instance.

As said before, when I place two masters in same server, perhaps all day
works, but just at 11:00 am (for example) the IO goes to 60% (it doesn't
increase) but it seems like if the IO where not being able to be served,
let's say more than a limit. More than a concrete io limit (I'd say
60%).

I don't really know if, perhaps the QVO technology could be the guilty
here.... because... they say are desktop computers disks... but later...
I have get a nice performance when copying for instance mailboxes from
five to five.... I can flood a gigabit interface when copying mailboxes
between servers from five to five.... they seem to perform....

Could anyone please shed us some light in this issue?. I don't really
know what to think.

Best regards,