ZFS/NFS hickups and some tools to monitor stuff...

Peter Eriksson pen at lysator.liu.se
Sun Mar 29 18:23:41 UTC 2020

>> Mostly home directories. No VM image files. About 20000 filesystems per server with around 100 snapshots per filesystem. Around 150-180M files/directories per server.
> Wow. By “filesystems” I assume you mean zfs datasaets and not zpools?
Yes, ZFS datasets. Our AD right now contains ~130000 users (not all are active thankfully, typically around 3000-5000 are online at the same time (90% via SMB)) spread out over a number of servers).

> Why so many? In the early days of zfs, due to the lack of per-user quotas, there were sites with one zfs dataset per user (home directory) so that quotas could be enforced. This raised serious performance issues, mostly ay boot time due to the need to mount so many different datasets. Per-
> But, this may be the underlying source of the performance issues.

Yeah. I’m aware of the user quotas which is nice. However they only solve part of the problem(s):

1. With the GDPR laws and the “right to be forgotten” we need to be able to delete a user's files when they leave their employment here and/or when students stop studying. This together with snapshots would make this a much harder operation (if we just have one big dataset for all users then we can’t just delete a specific user’s dataset (and snapshots). Basically we would have to delete the snapshot for *all users* then…

2. Users that write a lot of data every day - this uses up a lot of “quota” and not “refquota”. The “user quota” just counts the “refquota”… And then even if we find that user and makes them stop writing, their old data will live on in the snapshots…

Right now we are trying out a scheme where we give each user dataset a “userquota” of X, a “refquota” of X+1G and a “quota” of 3*X. This will hopefully lessen the problem of “slow servers" when a user is filling up their ref quotas since they’ll run into their user quotas before the ref quotas…

I’ve also modified the “zfs snap” command to avoid taking snapshots on near-full datasets. We’ll see if this makes things better.

(I’ve also modified it to have a “zfs clean” command to parallelise the snapshot deletion a bit (and make it possible to be smart on which snapshots to remove. We used to do this via a Python script but that is not nearly as efficient as doing it directly in the “zfs” command. We normally set a user property like “se.liu.it:expires” to a date when a snapshot should expire, and then the “zfs clean” command can look for that property and delete just those that have expired. The idea is to keep hourly snapshots for 2 days, daily snapshots for 2 weeks and weekly snapshots for 3 months (or so, we’ve been testing differents times here) - any older is on the backup server so users easily can recover their own files via the Windows “Previous Versions” feature (or via the .zfs/snapshot” directory for Unix users). 

>>> Maybe I’m asking eh obvious, but what is performance like natively on the server for these operations?
>> Normal response times:
>> This is from a small Intel NUC running OmniOS so NFS 4.0:
>> $ ./pfst -v /mnt/filur01
>> [pfst, version 1.7 - Peter Eriksson <pen at lysator.liu.se>]
>> 2020-03-28 12:19:10 [2114 µs]: /mnt/filur01: mkdir("t-omnibus-821-1”)
> You misunderstood my question. When you are seeing the performance issue via NFS, do you also see a performance issue directly on the NFS server?

The tests I ran did not indicate the same performance issues directly on the server (I ran the same test program locally). Well, except for “zfs”-commands being slow though. 

> If the SMB clients are all (or mostly) Windows 8 or newer, the Microsoft CIFS/SMB client stack has lots of caching to make poor server performance feel good. That caching by the client may be masking comparable performance issues via SMB. Testing directly on the server will remove the network fie share layer from the discussion, or focus the discussion there.
>> Mkdir & rmdir takes about the same amount of time here. (0.6 - 1ms).
> Do reads/writes from/to existing files show the same degradation? Especially reads?

Didn’t test that at the time unfortunately. And right now things are running pretty OK..

>>> What does the disk %busy look like on the disks that make up the vdev’s? (iostat -x)
>> Don’t have those numbers (when we were seeing problems) unfortunately but if I remember correctly fairly busy during the resilver (not surprising).
>> Current status (right now):
>> # iostat -x 10 |egrep -v pass
>>                       extended device statistics  
>> device       r/s     w/s     kr/s     kw/s  ms/r  ms/w  ms/o  ms/t qlen  %b  
>> nvd0           0       0      0.0      0.0     0     0     0     0    0   0 
>> da0            3      55     31.1   1129.4    10     1    87     3    0  13 
>> da1            4      53     31.5   1109.1    10     1    86     3    0  13 
>> da2            5      51     41.9   1082.4     9     1    87     3    0  14
> If this is during typical activity, you are already using 13% of your capacity. I also don’t like the 80ms per operation times.

The spinning rust drives are HGST He10 (10TB SAS 7200rpm) drives on Dell HBA330 controllers (LSI SAS3008). We also use HP server with their own Smart H241 HBAs and are seeing similar latencies there. 

https://www.storagereview.com/review/hgst-ultrastar-he10-10tb-enterprise-hard-drive-review <https://www.storagereview.com/review/hgst-ultrastar-he10-10tb-enterprise-hard-drive-review>

> What does zpool list show (fragmentation)?

22-27% fragmentation with 50-53% cap (108T size) on the 3 biggest servers.

>> At least for the resilver problem.
> There are tunings you can apply to make the resilver even more background than it usually is. I don’t have them off the top of my head.

Yeah, I tried those. Didn’t make much difference though…

> I have mamnaged zfs server with hundreds of thousands of snapshots with no performance penalty, except for snapshot management functions (zfs list -t snapshot, which would take many, many minutes to complete), so just the presence of snapshots should not hurt. Be aware that destroying snapshots in the order in which they were _created_ is much faster. In other words, always destroy the oldest snapshot first and work your way forward.

Yeah, I know. That’s what we are doing.

- Peter

More information about the freebsd-fs mailing list