TRIM, iSCSI and %busy waves

Thu Apr 5 15:00:18 UTC 2018

On Thu, Apr 5, 2018 at 8:08 AM, Eugene M. Zheganin <eugene at zhegan.in> wrote:

> Hi,
>
> I have a production iSCSI system (on zfs of course) with 15 ssd disks and
> it's often suffering from TRIMs.
>
> Well, I know what TRIM is for, and I know it's a good thing, but sometimes
> (actually often) I'm seeing my disks in gstat are overwhelmed by the TRIM
> waves, this looks like a "wave" of 20K 100%busy delete operations starting
> on first pool disk, then reaching second, then third,... - at the time it
> reaches the 15th disk the first one if freed from TRIM operations, and in
> 20-40 seconds this wave begins again.
>

There's two issues here. First, %busy doesn't necessarily mean what you
think it means. Back in the days of one operation at a time, it might have
been a reasonable indicator that the drive is busy. However, today with
queueing a 100% busy disk often can take additional load.

The second problem is that TRIMs suck for a lot of reasons. FFS (I don't
know about ZFS) sends lots of TRIMs at once when you delete a file. These
TRIMs are UFS block sized, so need to be combined in the ada/da layer. The
combining in the ada and da drivers isn't optimal, but implements a
'greedy' method where we pack as much as possible into each TRIM, which
makes each TRIM take longer. Plus, TRIMs are non NCQ commands, so force a
drain of all the other commands to do them. And we don't have any
throttling in 11.x (at the moment), so they tend to flood the device and
starve out other traffic when there's a lot of them. Not all controllers
support NCQ trim (LSI doesn't at the moment, I don't think). With NCQ we
only queue one at a time and that helps.

I'm working on trim shaping in -current right now. It's focused on NVMe,
but since I'm doing the bulk of it in cam_iosched.c, it will eventually be
available for ada and da. The notion is to measure how long the TRIMs take,
and only send them at 80% of that rate when there's other traffic in the
queue (so if trims are taking 100ms, send them no faster than 8/s). While
this will allow for better read/write traffic, it does slow the TRIMs down
which slows down whatever they may be blocking in the upper layers. Can't
speak to ZFS much, but for UFS that's freeing of blocks so things like new
block allocation may be delayed if we're almost out of disk (which we have
no signal for, so there's no way for the lower layers to prioritize trims
or not).

> I'm also having a couple of iSCSI issues that I'm dealing through bounty
> with, so may be this is related somehow. Or may be not. Due to some issues
> in iSCSI stack my system sometimes reboots, and then these "waves" are
> stopped for some time.
>
> So, my question is - can I fine-tune TRIM operations ? So they don't
> consume the whole disk at 100%. I see several sysctl oids, but they aren't
> well-documented.
>

You might be able to set the delete method.

> P.S. This is 11.x, disks are Toshibas, and they are attached via LSI HBA.
>

Which LSI HBA?

Warner