zfs performance issues with iscsi (istgt)

Mon Nov 8 08:26:49 UTC 2010

After scratching my head for a few weeks, I've decided to ask for some help.

First, I've got two machines connected by gigabit ethernet, network performance is not a problem as I am able to substantially saturate the wire when not using iscsi [say iperf] or ftp. Both systems are 8.1-RELENG. They are both multi-core, 8G of RAM. 

Symptoms: When doing writes (size relatively independent) from a client to a server via iSCSI I seem to be
 hitting a wall between 18-26MB/s of write. This can be repeated continuously whether doing a newfs on a 2TB iscsi volume or doing a dd from /dev/zero to the iscsi target. I haven't compared read performance. What originally put me on to this was watching the newfs *fly* across the screen, and then hang for several seconds, and then *fly* again, and
 then pause. 

This looked like a write-delay problem, so I tweaked txgwrite values and/or the synctime values. This showed some improvements (iostat showed something closer to continuous write performance to the server but there was still a delay whether the write_limit was 384MB all the way up to 4GB. This tells me the spindles weren't holding the throughput back. The iostat size was never much beyond 20-26MB/s, peaks were frequently two-three times that, but then it would be 1MB/s for a few seconds which would bring us back to this average). CPU and network load were never the limiting factor, nor did the spindles ever get above 20-30% busy. 

So I added two USB keys that write at around 30-40MB/s, and mirrored them as a ZIL log. iostat verifies they are being used, but not continuously, it seems that the txgwrite value applies to writing to the ZIL. I also tried turning off the ZIL log and saw no particular performance increase (or
 decrease). When newfs (which jumps around a lot more than dd) the performance throughput does not change much at all. Even at 26K-40K pps, interrupt loads and such are not problematic, turning on polling does not change the performance appreciably.

The "server" is a RAIDZ2 of 15 drives @ 2TB each. So *write* throughput should be pretty fast sequentially (i.e. the dd case), but it is returning identically. This server does nothing much but istgt -- tried NCQ values from 255 down to 32 to no improvement.

Even though network performance was not showing a particular limit, I *did* get from 18MB/s to 26MB/s by tweaking tcp sendbuf* and tcp send* values way beyond reason even though the TCP throughput hadn't been a problem in non iscsi operations.

So whatever i'm doing is not addressing the particular problem. The drives have plenty of available I/O, but instead of using it, or the RAM in the system, or the ZIL in the system, it seems
 largely idle, pegs the system with continuous (but not max speed) writes and halts the network transfers, and then continues on its way. 

Even if its a threading issue (i.e. we are single threading) there should be some way to make this behave like a normal system considering how much RAM, SSD, and other resources I'm trying to through at this thing. For example, after the buffer starts to empty, additional writes from the client should be accepted and NCQ should help reorder to process them in an efficient fashion, etc, etc. 

istgt settings:
istgt version 0.3
istgt extra version 20100707

    MaxSessions              32
    MaxConnections           32
    FirstBurstLength         65536
    MaxBurstLength           262144
    MaxRecvDataSegmentLength 262144

Local benchmarks like dd if=/dev/zero of=/tank/dump bs=1M count=12000 returns like 200MB/s. 12582912000 bytes transferred in 61.140903 secs (205801867 bytes/sec), and show continuous (as expected) writes to the spindles. (200MB/s is pretty close to the max I/O speed we can expect given the port the controller is in and RAID overhead, etc with 7200 RPM drives, at 5900 RPM the number is about 80MB/s). 

If this is an istgt problem, is there a way to get reasonable performance out of it?

I know I'm not losing my mind here, so if someone has tackled this particular problem (or its sort), please chime in and let me know what tunable I'm missing. :)

Thanks very much, in advance,

DJ