ZFS/NFS hickups and some tools to monitor stuff...

Sun Mar 29 19:16:23 UTC 2020

> I thought that snapshot deletion was single threaded within a zpool, since TXGs are zpool-wide, not per dataset. So you may not be able to destroy snapshot in parallel.

Yeah, I thought so too but decided to try it anyway. It sometimes goes faster and since I decoupled the "read snapshots to delete" and the “do the deletion” into separate threads now it doesn’t have to read all snapshots to delete first and then delete them all, but can interleave the jobs.

Basically what my code now does is:

	for all datasets (recursively)
	   collect_snapshots_to_delete
	   if more than a (configurable limit) is queued
		start a deletion worker (configurable limit)

So it can continue gathering snapshots to delete while deleting a batch. And it doesn’t have to wait for the reading all snapshots in all dataset before starting to delete stuff. So if it (for some reason) is slow then atleast it will have deleted _some_ snapshots until we terminate the “clean” command

I did some tests on the speed with different number of “worker” threads and I actually did see some speed improvements (cut the time in half in some cases). But it varies a lot I guess - if all metadata is in the ARC then it normally is pretty quick anyway.

I’ve been thinking of also adding separate read workers so if one dataset takes a long time to read it’s snapshots then others could continue but it’s a bit harder to code in a good way :-)

What we do now is (simplified):

	# Create hourly snapshots that expire in 2 days:
	zfs snap -r -E “se.liu.it:expires” -e 2d "DATA/staff@${DATETIME}"

	# Clean expired snapshots (10 workers, atleast 500 snapshots per delete)
	zfs clean -r -E “se.liu.it:expires” -P10 -L500 -e DATA/staff

I have my patch available at GitHub ( https://github.com/ptrrkssn/freebsd-stuff <https://github.com/ptrrkssn/freebsd-stuff> ) if it would be of interest. 

(At first I modified the “zfs destroy” command but since I always feel nervous about using that one since a slip of the finger could have
catastrophic consequences so I decided to create a separate one that only works on snapshots and nothing else).

> I expect zpool/zfs commands to be very slow when large zfs operations are in flight. The fact that you are not seeing the issue locally means the issue is not directly with the zpool/dataset but somehow with the interaction between NFS Client <-> NFS Server <-> ZFS dataset … NFS does not have to be sync, but can you force the NFS client to always use sync writes? That might better leverage the SLOG. Since our use case is VMs and VirtualBox does sync writes, we get the benefit of the SLOG.
> 
>>> If this is during typical activity, you are already using 13% of your capacity. I also don’t like the 80ms per operation times.
>> 
>> The spinning rust drives are HGST He10 (10TB SAS 7200rpm) drives on Dell HBA330 controllers (LSI SAS3008). We also use HP server with their own Smart H241 HBAs and are seeing similar latencies there.
> 
> That should be Ok, but I have heard some reports of issues with the HP Smart 4xx series controllers with FreeBSD. Why are you seeing higher disk latency with SAS than we are with SATA? I assume you checked logs for device communication errors and retries?

Yeah, no errors. The HP H241 HBAs are not as well supported as the SAS3008 ones, but they work OK. At least if you force them into “HBA” mode (changeable from BIOS. Until we did that they had their problems yes… also there were some firmware issues on certain releases)

Anyway, I we are going to expand the RAM in the servers from 256GB to 512GB (or 768GB). A test I did on our test server seems to indicate that the metadata set fits much better with more RAM so everything is much faster.

(Now I’d also like to see persistent L2ARC support (it would be great to have the metadata cached on faster SSDs and have it survive a reboot) - but that won’t happen until the switch to OpenZFS (FreeBSD 13 hopefully) so…

- Peter