Re: ZFS pool balance and performance

From: Frank Leonhardt <freebsd-doc_at_fjl.co.uk>
Date: Mon, 25 Aug 2025 14:09:49 UTC
On 25/08/2025 14:21, Chris Ross wrote:
<snip>
> Okay.  A program to flip bytes and write blocks would be easy enough, but as
> you note even if it worked it would have downsides.  But, right now it not
> working is more my concern (below)
>
>> 4) If the imbalance is caused by ZFS choosing to migrate new CoW data to one a particular dataset (or away from another) then this will only encourage it to continue. ZFS is designed to balance writes between vdevs on a zpool for efficiency, so if you start with two balanced vdevs they would remain balanced. If you added a vdev and continued writing to the zpool, it would tend towards balance over time. It favours vdevs with the most space for a write if all other things are equal, so normal usage should drift data anyway from the existing vdevs and on to the new. Key being "all other things are equal".
>>
>> So I think you need to find out why ZFS has decided your zdevs are more efficient unbalanced (whether it's right or not). More writes are just going to make matters worse. If it's changed it's mind, normal use will balance it over time.
> Yes.  This is my thought too.  There are many gigabytes added/written to this
> pool on a daily basis.  Time Machine backups may or may not be making files
> rather than writing existing files, but writes to be sure.  The data archival
> I put on by hand regularly is always new files.  MB or GB at a time, but I
> would expect that to spread them evenly, and it seems not to be.
>
> Any idea what to look for as to why ZFS may be preferring one of the devs
> that to me seem equivalent?

Well one thing I've noticed over the years is that ZFS is not good AT 
ALL about reporting drives that are on the way out. You only get to know 
about it when the fire has spread to the whole enclosure.

Following an incident I got really interested in this matter and did 
some research - there are traces of it in freebsd-questions but I don't 
seem to have written it up in my blog. This might be helpful though:

https://blog.frankleonhardt.com/2025/freebsd-zfs-raidz-failed-disk-replacement/

(One always hopes one's blog posts are going to be helpful; whether they 
are or not is a different matter).

Long story short, don't trust ZFS saying a drive or VDEV is okay. This 
just means it hasn't failed completely. SAS drives, in particular, never 
want to fail so could retry for ages and eventually succeed even though 
you, as their owner, would rather know about it.

So, have a look at the drives using smartmontools (which do work on SAS 
now - the post is wrong).

Try "zpool iostat -v 5" - the -v gives you stats for each vdev and 
drive, not just the whole pool. All the drives in the pool should have 
similar stats, although not identical as drives in he real world aren't. 
If there's an outlier, that might be your problem.

Good old "diskinfo -t" doing a speed benchmark test on all your drives 
is worth a try. Like any benchmark, it's not accurate (especially now 
drives lie about their geometry). But all the drives of the same type in 
a zpool should produce similar numbers. Obviously wait until the pool is 
quiet, but you can do it while it's online.

It'd be nice, of course, if zfs actually said "I'm a bit concerned about 
drive X as it's slowing down." and stuff like that, but it doesn't.

How do I know all this? Earlier this year bought a large number of used 
2Tb SAS drives on eBay expecting a lot of them to be a bit flaky so I 
could test ZFS failure modes. I was not disappointed!

Regards, Frank.