Re: Desperate with 870 QVO and ZFS

From: <egoitz_at_ramattack.net>
Date: Wed, 06 Apr 2022 11:28:08 UTC
The most extrange thing is... When machine boots ARC is in 40 value of
GB used (for instance), but later decreases to 20GB (and this is not an
example... is exact) in all my servers.... it's like if the ARC metadata
which is more or less 17GB would limite the whole ARC..... 

With the traffic of this machines, it should I suppose the ARC should be
larger than it is... and ARC in loader.conf is limited to 64GB (the half
the ram this machines have)

El 2022-04-06 13:18, egoitz@ramattack.net escribió:

> ATENCION: Este correo se ha enviado desde fuera de la organización. No pinche en los enlaces ni abra los adjuntos a no ser que reconozca el remitente y sepa que el contenido es seguro.
> 
> Good morning,
> 
> I write this post with the expectation that perhaps someone could help me 
> 
> I am running some mail servers with FreeBSD and ZFS. They use 870 QVO (not EVO or other Samsung SSD disks) disks as storage. They can easily have from 1500 to 2000 concurrent connections. The machines have 128GB of ram and the CPU is almost absolutely idle. The disk IO is normally at 30 or 40% percent at most.
> 
> The problem I'm facing is that they could be running just fine and suddenly at some peak hour, the IO goes to 60 or 70% and the machine becomes extremely slow. ZFS is all by default, except the sync parameter which is set disabled. Apart from that the ARC is limited to 64GB. But even this is extremely odd. The used ARC is near 20GB. I have seen, that meta cache in arc is very near to the limit that FreeBSD automatically sets depending on the size of the ARC you set. It seems that almost all ARC is used by meta cache. I have seen this effect in all my mail servers with this hardware and software config.
> 
> I do attach a zfs-stats output, but from now that the servers are not so loaded as described. I do explain. I run a couple of Cyrus instances in these servers. One as master, one as slave on each server. The commented situation from above, happens when both Cyrus instances become master, so when we are using two Cyrus instances giving service in the same machine. For avoiding issues, know we have balanced and we have a master and a slave in each server. You know, a slave instance has almost no io and only a single connection for replication. So the zfs-stats output is from now we have let's say half of load in each server, because they have one master and one slave instance.
> 
> As said before, when I place two masters in same server, perhaps all day works, but just at 11:00 am (for example) the IO goes to 60% (it doesn't increase) but it seems like if the IO where not being able to be served, let's say more than a limit. More than a concrete io limit (I'd say 60%).
> 
> I don't really know if, perhaps the QVO technology could be the guilty here.... because... they say are desktop computers disks... but later... I have get a nice performance when copying for instance mailboxes from five to five.... I can flood a gigabit interface when copying mailboxes between servers from five to five.... they seem to perform....
> 
> Could anyone please shed us some light in this issue?. I don't really know what to think.
> 
> Best regards, 
> 
> ATENCION: Este correo se ha enviado desde fuera de la organización. No pinche en los enlaces ni abra los adjuntos a no ser que reconozca el remitente y sepa que el contenido es seguro.