[Bug 209571] ZFS and NVMe performing poorly. TRIM requests stall I/O activity

Tue May 17 07:37:23 UTC 2016

https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=209571

            Bug ID: 209571
           Summary: ZFS and NVMe performing poorly. TRIM requests stall
                    I/O activity
           Product: Base System
           Version: 10.3-RELEASE
          Hardware: Any
                OS: Any
            Status: New
          Severity: Affects Many People
          Priority: ---
         Component: kern
          Assignee: freebsd-bugs at FreeBSD.org
          Reporter: borjam at sarenet.es

Created attachment 170388
  --> https://bugs.freebsd.org/bugzilla/attachment.cgi?id=170388&action=edit
throughput graphs for two bonnie++ runs

On a test system with 10 Intel P3500 NVMEs I have found that TRIM
activity can cause a severe I/O stall. After running several bonnie++
tests, the ZFS file system has been almost unusable for 15 minutes (yes,
FIFTEEN!).

HOW TO REPRODUCE:

- Create a ZFS pool, in this case, a raidz2 pool with the 10 NVMEs.

- Create a dataset without compression (we want to test actual I/O
performance)

- Run bonnie++. As bonnie++ can quickly saturate a single CPU core and
hence it's unable to generate enough bandwidth for this setup, I run
four bonnie++ processes concurrently. In order to demonstrate this
issue, each bonnie++ performs two runs. So,
( bonnie++ -s 512g -x 2 -f) & # four times.

Graphs included. Made with devilator (an Orca compatible data collector)
pulling data from devstat(9). The disk is just one out of 10 (the other
9 graphs are identical, as expected).

The first run of four bonnie++ processes runs without flaws. On graph
1 (TwoBonniesTput) we have the first bonnie++ from the start of the
graph to around 08:30 (the green line is the "Intelligent reading"
phase, and a second bonnie++ starting right after it.

Bonnie++ does several tests, beginning with a write test (blue line
showing around 230 MBps, from the start to 07:40), followed by a
read/write test (from 07:40 to 08:15 on the graphs), showing
read/write/delete activity and finally a read test (green line showing
250 MBps from 08:15 to 08:30 more or less). After bonnie++ ends, the
files it created are deleted. In this particular test, four concurrent
bonnie++ processes created four files of 512 GB each, a total of 2 TB.

After the first run, the disks show the TRIM activity going on at a rate of
around
200 MB/s. It seems quite slow, since a test I did at home on an OCZ Vertex4 SSD
(albeit, a single one, not a pool) gave a peak of 2 GB/s. But I understand that 
the ada driver is coalescing TRIM requests, while the nvd driver doesn't.

The trouble is: the second bonnie++ process is started right after the first
one,
and, THERE IS ALMOST NO WRITE ACTIVITY FOR 15 MINUTES. The writing activity is 
just frozen, and it doesn't pick up until about 08:45, stalling again, although
for a shorter time, around 08:50. 

On exhibit 2, "TwoBonniesTimes", it can be seen that the write latency during
the stall
is zero, which means (unless I am wrong) that no write commands are actually
reaching
the disks.

During the stalls the ZFS system was unresponsive. Any commands such as a
simple
"zfs list" were painfully slow, taking even some minutes to complete.

EXPECTED BEHAVIOR:

I understand that a heavy TRIM activity must have an impact, but in this case
it's
causing a complete starvation for the rest of the ZFS I/O activity which is
clearly
wrong. This behavior could cause a severe problem, por example, when destroying
a large
snapshot. In this case, the system is deleting 2 TB of data.

ATTEMPTS TO MITIGATE IT:

The first thing I tried was to reduce the priority of the TRIM operations in
the I/O
scheduler, 
    vfs.zfs.vdev.trim_max_pending=100
    vfs.zfs.vdev.trim_max_active=1
    vfs.zfs.vdev.async_write_min_active=8

with no visible effect.

After reading the article describing the ZFS I/O scheduler I suspected that the
trim
activity might be activating the write throttle. So I just disabled it.

    vfs.zfs.delay_scale=0

But it didn't help either. The writing processes still got stuck, but on
dp->dp_s rather
than dmu_tx_delay.

There are two problems here. It seems that the nvd driver doesn't coalesce trim 
requests. On the other hand, ZFS is dumping a lot of trim requests assuming
that the
lower layer will coalesce them. 

I don't think it's a good idea to make such an assumption blindly in ZFS. On
the other
hand, I think that there should be some throttling mechanism applied to trim
requests.

-- 
You are receiving this mail because:
You are the assignee for the bug.