ZFS and SSD, trim caused stalling

Thu May 5 08:24:37 UTC 2016

Hello,

Doing some tests with Intel P3500 NVMEs I have found a serious performance problem caused by the TRIM operation. 
Maybe it’s better not to use trim on these SSDs, I am not sure, but anyway this reveals a serious performance problem which
can happen with other SSDs. Actually I have seen a comparable behavior at least with another SSD, although less serious. For
example, trying with a 128 GB OCZ Vertex4, there was some stalling, although this particular SSD trims at around 2 GB/s while it can
sustain a write throughput of 200 MB/s until it reaches 50% capacity, falling to around 100 MB/s.

I know this is a very worst case benchmark, but operations like the deletion of a large snapshot or a dataset could
trigger similar problems.

In order to do a gross check of the I/O performance of this system, I created a raidz2 pool with 10 NVMEs. After
creating it, I used Bonnie++. As a single bonnie instance is unable to generate enough I/O activity, I actually ran 
four in parallel.

Doing a couple of tests, I noticed that the second time I launched four Bonnies the writing activity was completely
stalled. Repeeating a single test I noticed this (file OneBonnie.png):

The Bonnies were writing for 30 minutes, the read/write test took around 50 minutes, and the reading test took
10 minutes more or less. But after the Bonnie processes finished, the deletion of the files took more or less
30 minutes of heavy trim activity. 

Running two tests, one after another, showed something far more serious. The second group of four Bonnies
was stalled for around 15 minutes while there was heavy trim I/O activity. And according to the service times
reported by devstat, the stall didn’t happen in the disk I/O subsystem. Looking at the activity between 8:30 and 
8:45 it can be seen that the service time reported for the writing operations is 0, which means that the write operations
aren’t actually reaching the disk. (files TwoBonniesTput.png and TwoBonniesTimes.png)

ZFS itself is starving the whole vdev. Doing some silly operations such a “ls” was a problem as well, the system
performance was awful. 

Apart from disabling TRIM there would be two solutions to this problem:

1) Somewhat deferring the TRIM operations. Of course it implies that the block freeing work must be throttled, which
can cause its own issues.

2) Skipping the TRIMs sometimes. Depending on the particular workload and SSD model, TRIM can be almost mandatory
or just a “nice to have” feature. In a case like this, deleting large files (four 512 GB files) has caused a very serious impact. In
this case TRIM has done more harm than good. 

The selective TRIM skipping could be based just on the number of TRIM requests pending on the vdev queues (past some
threshold the TRIM requests would be discarded) or maybe the ZFS block freeing routines would make a similar decision. I’m not
sure where it’s better to implement this.

A couple of sysctl variables could keep a counter of discarded TRIM operations and total “not trimmed” bytes, making if possible
 to know the impact of this measure. And this mechanism could be based on some static threshold configured via a sysctl variable or,
even better, ZFS could make a decision based on the queue depth. In case write or read requests got an unacceptable service
time, the system would invalidate the TRIM requests.

What do you think? In some cases it’s clear that TRIM can do more harm than good. I think that this measure can buy the best
of both worlds: TRIMming when possible, during “normal” I/O activity, and avoiding the troubles caused by it during exceptional
activity (deletion of very large files/large number of files/large snapshots/datasets).

Borja.