ZFS/NFS hickups and some tools to monitor stuff...

Sat Mar 28 12:23:23 UTC 2020

> On 28 Mar 2020, at 04:07, PK1048.COM <info at pk1048.com> wrote:
>> 
>> The last couple of weeks I’ve been fighting with a severe case of NFS users complaining about slow response times from our (5) FreeBSD 11.3-RELEASE-p6 file servers. Now even though our SMB (Windows) users (thankfully since they are like 500 per server vs 50 NFS users) didn’t see the same slowdown (or atleast didn’t complain about it) the root cause is probably ZFS-related.
> 
> What is the use case for the NFS users: Unix home directories? VM image files? etc.

Mostly home directories. No VM image files. About 20000 filesystems per server with around 100 snapshots per filesystem. Around 150-180M files/directories per server.

>> 3. Rsync cloning of data into the servers. Response times up to 15 minutes was seen… Yes, 15 minutes to do a mkdir(“test-dir”). Possibly in conjunction with #1 above….
> 
> There is something very wrong with your setup with that kind of NFS response time.
> 
> Maybe I’m asking eh obvious, but what is performance like natively on the server for these operations?

Normal response times:

This is from a small Intel NUC running OmniOS so NFS 4.0:

$ ./pfst -v /mnt/filur01
[pfst, version 1.7 - Peter Eriksson <pen at lysator.liu.se>]
2020-03-28 12:19:10 [2114 µs]: /mnt/filur01: mkdir("t-omnibus-821-1")
2020-03-28 12:19:10 [ 413 µs]: /mnt/filur01: rmdir("t-omnibus-821-1")
2020-03-28 12:19:11 [1375 µs]: /mnt/filur01: mkdir("t-omnibus-821-2")
2020-03-28 12:19:11 [ 438 µs]: /mnt/filur01: rmdir("t-omnibus-821-2")
2020-03-28 12:19:12 [1329 µs]: /mnt/filur01: mkdir("t-omnibus-821-3")
2020-03-28 12:19:12 [ 428 µs]: /mnt/filur01: rmdir("t-omnibus-821-3")
2020-03-28 12:19:13 [1253 µs]: /mnt/filur01: mkdir("t-omnibus-821-4")
2020-03-28 12:19:13 [ 395 µs]: /mnt/filur01: rmdir("t-omnibus-821-4")

Ie, an mkdir() takes around 1-2ms and an rmdir 0.4ms.

Same from a CentOS 7 client (different hardware, a Dell workstation, NFS 4.1):

$ ./pfst -v /mnt/filur01
[pfst, version 1.6 - Peter Eriksson <pen at lysator.liu.se>]
2020-03-28 12:21:15 [ 633 µs]: /mnt/filur01: mkdir("t-electra-965-1")
2020-03-28 12:21:15 [ 898 µs]: /mnt/filur01: rmdir("t-electra-965-1")
2020-03-28 12:21:16 [1019 µs]: /mnt/filur01: mkdir("t-electra-965-2")
2020-03-28 12:21:16 [ 709 µs]: /mnt/filur01: rmdir("t-electra-965-2")
2020-03-28 12:21:17 [ 955 µs]: /mnt/filur01: mkdir("t-electra-965-3")
2020-03-28 12:21:17 [ 668 µs]: /mnt/filur01: rmdir("t-electra-965-3")

Mkdir & rmdir takes about the same amount of time here. (0.6 - 1ms).

(The above only tests operations on already-mounted filesystems. Mounting filesystems have their own set of partly different problems occasionally.)

> What does the disk %busy look like on the disks that make up the vdev’s? (iostat -x)

Don’t have those numbers (when we were seeing problems) unfortunately but if I remember correctly fairly busy during the resilver (not surprising).

Current status (right now):

# iostat -x 10 |egrep -v pass
                        extended device statistics  
device       r/s     w/s     kr/s     kw/s  ms/r  ms/w  ms/o  ms/t qlen  %b  
nvd0           0       0      0.0      0.0     0     0     0     0    0   0 
da0            3      55     31.1   1129.4    10     1    87     3    0  13 
da1            4      53     31.5   1109.1    10     1    86     3    0  13 
da2            5      51     41.9   1082.4     9     1    87     3    0  14 
da3            3      55     31.1   1129.7    10     1    85     2    0  13 
da4            4      53     31.3   1108.7    10     1    86     3    0  13 
da5            5      52     41.7   1081.3     9     1    87     3    0  14 
da6            3      55     27.6   1103.9    10     1    94     2    0  13 
da7            4      54     34.2   1064.7    10     1    92     3    0  14 
da8            5      55     39.6   1088.2    10     1    94     3    0  15 
da9            3      55     27.7   1103.4    10     1    92     2    0  13 
da10           4      53     34.2   1065.0    10     1    92     3    0  14 
da11           5      55     39.5   1089.0    10     1    95     3    0  15 
da12           1      23      4.7    553.5     0     0     0     0    0   0 
da13           1      23      4.7    553.5     0     0     0     0    0   0 
da14           0      23      1.1    820.6     0     0     0     0    0   1 
da15           0      23      1.0    820.7     0     0     0     0    0   1 
da16           0       0      0.0      0.0     1     0     0     1    0   0 
da17           0       0      0.0      0.0     1     0     0     0    0   0 

>> Previously #1 and #2 hasn’t caused that much problems, and #3 definitely. Something has changed the last half year or so but so far I haven’t been able to figure it out.
> 
> The degradation over time makes me think fragmentation of lots of small writes or an overly full zpool, but that should effect SMB as well as NFS.

Yeah, the zpool is around 50% full so no.

(One of the servers right now:)

# zpool iostat -v DATA
                                     capacity     operations    bandwidth
pool                              alloc   free   read  write   read  write
--------------------------------  -----  -----  -----  -----  -----  -----
DATA                              57.7T  50.3T     46    556   403K  8.48M
  raidz2                          29.0T  25.0T     23    253   204K  3.49M
    diskid/DISK-7PK8DWXC              -      -      3     52  31.1K  1.10M
    diskid/DISK-7PK5X6XC              -      -      4     50  31.5K  1.08M
    diskid/DISK-7PK4W4BC              -      -      5     49  41.9K  1.06M
    diskid/DISK-7PK204LG              -      -      3     52  31.1K  1.10M
    diskid/DISK-7PK2GDHG              -      -      4     50  31.3K  1.08M
    diskid/DISK-7PK7850C              -      -      5     49  41.7K  1.06M
  raidz2                          28.8T  25.2T     23    256   199K  3.39M
    diskid/DISK-7PK62HHC              -      -      3     53  27.6K  1.08M
    diskid/DISK-7PK6SG3C              -      -      4     52  34.2K  1.04M
    diskid/DISK-7PK8DRHC              -      -      5     53  39.6K  1.06M
    diskid/DISK-7PK85ADG              -      -      3     53  27.7K  1.08M
    diskid/DISK-2TK7PBYD              -      -      4     51  34.2K  1.04M
    diskid/DISK-7PK6WY9C              -      -      5     53  39.5K  1.06M
logs                                  -      -      -      -      -      -
  diskid/DISK-BTHV7146043U400NGN   289M   372G      0     23      0   821K
  diskid/DISK-BTHV71460441400NGN   274M   372G      0     23      0   821K

> What type of SSD’s are you using for the SLOG (every zpool has a built-in ZIL, you have a SLOG when you add separate devices for the Intent Log for sync operations)? I had tried a run of the mill consumer SSD for the SLOG and NFS sync performance went DOWN! For SLOG devices you really must have real enterprise grade server class SSDs designed for very heavy random WRITE load. Today you can get enterprise SSDs designed for READ, WRITE, or a combination.

Dell-original Enterprise SSDs like Intel DC S3520-series & Intel DC S3700-series (SATA)
Also tested some Intel SSD 750 (NVME on PCIe 3.0-cards)

Going to test some pure SAS SSD write-optimised drives too (as soon as we can get delivery of them). However I don’t really suspect the SSDs
anymore. (We’ve had other issues with the S3520:series though so might replace them anyway - they have a tendency to play “possum” every now and then - “die” and go offline - however if we just let them sit there after a couple of weeks they would reappear again automagically… We’ve seen the same thing happen with the Intel P3520’s too (PCIe variant) but then it just didn’t drop off the bus - it took down the server too due to hanging the PCIe bus - crappy buggers)

> Depending on your NFS use case, you may or may not be seeing sync operations. If you are seeing async operations then a ZIL/SLOG will not help as they are only used for sync operations. For async operations multiple writes are combined in RAM and committed to disk at least every 5 seconds. For sync operations the writes are immediately written to the ZIL/SLOG, so a SLOG _can_ be of benefit if it is fast enough for random write operations. Since a sync operation must be committed to non-volatile storage before the write call can return, and having a single SLOG device can lead to loss of data, it is prudent to mirror the SLOG devices. Having a mirrored pair does not slow down operations as both devices are written to in parallel. The SLOG only exists to meet the sync requirement. If the system does not crash, the SLOG is (almost) never actually read. The exception, and the reason I say “almost”, is that if you are writing so much data over long periods of time that the zpool disks cannot catch up between write requests, then you may have to read some TXG (transaction groups) from the SLOG to get them into the zpool.
> 
>> We’ve tried it with and without L2ARC,
> 
> ARC and L2ARC will only effect read operations, and you can see the effect with zfs-stats -E.

Well, L2ARC uses up some of the ARC so if ARC is too tight from the beginning then it’ll make reads go even slower. 

Which actually is my main suspect (the ARC) right now. At least for the resilver problem. When the resilver was running it seems it would read in a lot of data into the ARC causing it to drop older data (& metadata= from the ARC cache. Which then became a problem due to two issues:

1. The problematic server had “compressed ARC” disabled due to a previous kernel-crashing problem (happened less often with that disabled) 
2. ZFS prioritises writes over reads
3. The metadata (directory information for example) wasn’t in the ARC anymore so needed to be read from disk again and again

This together with the fact that we:

We are taking hourly snapshots of all filesystems - this needed to access metadata for all filesystems.
We are cleaning snapshots nightly - this needed to access metadata for all filesystems (and all old snapshots).

We had a problem with the cleaning script so it was running behind and more and more snapshots were piling up on a couple of the fileservers so instead of 100 snapshots there were around 400. Including the server where we were resilvering.

The 500-600 samba “smbd” processes are using a lot of RAM each (100-200MB), ie 120-60GB, so would compete with the ARC too.

Many of the NFS operations access metadata from the filesystems. For example mount operations, or filename lookups. So if any of that was evicted from the ARC then it would have to go to disk to read it again. And when ZFS was busy writing resilver data that would take a long time.

Or when a filesystem is near quota then that too will cause the ZFS transaction sizse to go down - inflating write IOPS and starving the reads IOPs again…

We are probably going to invest in more RAM for the servers (going from 256GB to 512GB or 768GB) to allow for way more ARC.

> You should not need to make any system tunings to make this work.

Well, I had to tune down vfs.zfs.arc_max from the default (90% of RAM) to a 128GB since the Samba smbd processes also need RAM to run and the auto-tuning in FreeBSD of the ARC doesn’t seem to work that well. In order to mitigate that problem I’m currently using these settings:

vfs.zfs.arc_max = 96GB (tuned down from 128GB since ARC overshot the 128GB allocated and started killing random processes) 
vfs.zfs.arc_meta_limit = 50% of ARC (default is 25%)

The overshooting of ARC (especially when resilvering) was probably due to us also tuning the kern.maxvnodes - seems if you touch that number when the system is already running ARC goes berserk. So I can probably increase ARC to 128GB again. It looks to be more stable now that we’ve removed the kern.maxvnodes setting completely.

It would have been nice to that a setting to set a minimum limit of metadata so it would be more protected from normal data evicting it…

We’ve installed 512GB instead of 256GB in a test server now and will so what settings might work well there...

Right now (with no resilver running and not much “near-full-quota” filesystems, and many of the extraneous snapshots deleted this are looking much better:

$ ./pfst -t200ms /mnt/filur0*
…
2020-03-26 21:16:40 [1678 ms]: /mnt/filur01: mkdir("t-omnibus-782-190284") [Time limit exceeded]
2020-03-26 21:16:47 [1956 ms]: /mnt/filur01: mkdir("t-omnibus-782-190289") [Time limit exceeded]
2020-03-26 21:16:53 [1439 ms]: /mnt/filur01: mkdir("t-omnibus-782-190293") [Time limit exceeded]
2020-03-26 21:16:59 [1710 ms]: /mnt/filur01: mkdir("t-omnibus-782-190297") [Time limit exceeded]
2020-03-26 21:17:05 [2044 ms]: /mnt/filur01: mkdir("t-omnibus-782-190301") [Time limit exceeded]
2020-03-26 21:17:11 [1955 ms]: /mnt/filur01: mkdir("t-omnibus-782-190305") [Time limit exceeded]
2020-03-26 21:17:16 [1515 ms]: /mnt/filur01: mkdir("t-omnibus-782-190309") [Time limit exceeded]
...
2020-03-28 06:54:54 [ 370 ms]: /mnt/filur06: mkdir("t-omnibus-783-311285") [Time limit exceeded]
2020-03-28 07:00:01 [ 447 ms]: /mnt/filur01: mkdir("t-omnibus-782-311339") [Time limit exceeded]
2020-03-28 07:00:01 [ 312 ms]: /mnt/filur06: mkdir("t-omnibus-783-311591") [Time limit exceeded]
2020-03-28 07:00:02 [ 291 ms]: /mnt/filur06: mkdir("t-omnibus-783-311592") [Time limit exceeded]
2020-03-28 07:00:05 [ 378 ms]: /mnt/filur06: mkdir("t-omnibus-783-311594") [Time limit exceeded]
2020-03-28 10:35:13 [1876 ms]: /mnt/filur01: mkdir("t-omnibus-782-324215") [Time limit exceeded]

(This only prints operations taking more than 200ms)

The last ~2s one is probably due to a new filesystem being created. This triggers the NFS mount daemon to update the kernel list of exports (and hold a NFS kernel lock) which take some time (although it’s much faster now with the “incremental update” fix in mountd. 

The slow ones around 21:16 is due to me doing a number of “zfs rename DATA/home/$USER DATA/archive/home/$USER” operations. This too causes a lot of NFS mountd activity (removing NFS exports from the kernel) causing a NFS kernel lock to be held...

- Peter