Terrible ix performance

Thu Jul 4 03:41:07 UTC 2013

On 07/04/13 13:06, Outback Dingo wrote:
> On Wed, Jul 3, 2013 at 10:01 PM, Lawrence Stewart <lstewart at freebsd.org
> <mailto:lstewart at freebsd.org>> wrote:
> 
>     On 07/04/13 10:18, Kevin Oberman wrote:
>     > On Wed, Jul 3, 2013 at 4:21 PM, Steven Hartland
>     <killing at multiplay.co.uk <mailto:killing at multiplay.co.uk>>wrote:
[snip]
>     >>
>     >> Out of interest have you tried limiting the number of queues?
>     >>
>     >> If not give it a try see if it helps, add the following to
>     >> /boot/loader.conf:
>     >> hw.ixgbe.num_queues=1
>     >>
>     >> If nothing else will give you another data point.
> 
>     As noted in my first post to this thread, if iperf is able to push a
>     single flow at 8Gbps, then the NIC is unlikely to be the source of the
>     problem and trying to tune it is a waste of time (at least at this
>     stage).
> 
>     iperf tests memory-network-memory transfer speed without any disk
>     involvement, so the fact that it can get 8Gbps and ftp is getting around
>     4Gbps implies that either the iperf TCP tuning is better (only likely to
>     be relevant if the RTT is very large - Outback Dingo you still haven't
>     provided us with the RTT) or the disk subsystem at one or both ends is
>     slowing things down.
> 
>     Outback Dingo: can you please run another iperf test without the -w
>     switch on both client and server to see if your send/receive window
>     autotuning on both ends is working. If all is well, you should see the
>     same results of ~8Gbps.
> 
>     >> You might also try SIFTR to analyze the behavior and perhaps even
>     figure
>     > out what the limiting factor might be.
>     >
>     > kldload siftr
>     > See "Run-time Configuration" in the siftr(4) man page for details.
>     >
>     > I'm a little surprised Lawrence didn't already suggest this as he
>     is one of
>     > the authors. (The "Bugs" section is rather long and he might know
>     that it
>     > won't be useful in this case, but it has greatly helped me look at
>     > performance issues.)
> 
>     siftr is useful if you suspect a TCP/netstack tuning issue. Given that
>     iperf gets good results and the OP's tuning settings should be adequate
>     to achieve good performance if the RTT is low (4MB
>     sendbuf_max/recvbuf_max), I suspect the disk subsystem and/or VM is more
>     likely to be the issue i.e. siftr data is probably irrelevant.
> 
>     Outback Dingo: Can you confirm you have appropriate tuning on both sides
>     of the connection? You didn't specify if the loader.conf/sysctl.conf
>     parameters you provided in the reply to Jack are only on one side of the
>     connection or both.
> 
> 
> Yeah i concur, im starting to think the bottleneck is the zpool
> 
> 
> iperf -i 10 -t 20 -c 10.10.1.11 -l 2.5M
> ------------------------------------------------------------
> Client connecting to 10.10.1.11, TCP port 5001
> TCP window size:  257 KByte (default)
> ------------------------------------------------------------
> [  3] local 10.10.1.178 port 47360 connected with 10.10.1.11 port 5001
> [ ID] Interval       Transfer     Bandwidth
> [  3]  0.0-10.0 sec  9.61 GBytes  8.26 Gbits/sec
> [  3] 10.0-20.0 sec  8.83 GBytes  7.58 Gbits/sec
> [  3]  0.0-20.0 sec  18.4 GBytes  7.92 Gbits/sec
> nas4free: /testing # iperf -i 10 -t 20 -c 10.10.1.11 -l 2.5M
> ------------------------------------------------------------
> Client connecting to 10.10.1.11, TCP port 5001
> TCP window size:  257 KByte (default)
> ------------------------------------------------------------
> [  3] local 10.10.1.178 port 37691 connected with 10.10.1.11 port 5001
> [ ID] Interval       Transfer     Bandwidth
> [  3]  0.0-10.0 sec  5.29 GBytes  4.54 Gbits/sec
> [  3] 10.0-20.0 sec  8.06 GBytes  6.93 Gbits/sec
> [  3]  0.0-20.0 sec  13.4 GBytes  5.73 Gbits/sec
> nas4free: /testing # iperf -i 10 -t 20 -c 10.10.1.11 -l 2.5M
> ------------------------------------------------------------
> Client connecting to 10.10.1.11, TCP port 5001
> TCP window size:  257 KByte (default)
> ------------------------------------------------------------
> [  3] local 10.10.1.178 port 17560 connected with 10.10.1.11 port 5001
> [ ID] Interval       Transfer     Bandwidth
> [  3]  0.0-10.0 sec  9.48 GBytes  8.14 Gbits/sec
> [  3] 10.0-20.0 sec  8.68 GBytes  7.46 Gbits/sec
> [  3]  0.0-20.0 sec  18.2 GBytes  7.80 Gbits/sec
> nas4free: /testing # iperf -i 10 -t 20 -c 10.10.1.11 -l 2.5M
> ------------------------------------------------------------
> Client connecting to 10.10.1.11, TCP port 5001
> TCP window size:  257 KByte (default)
> ------------------------------------------------------------
> [  3] local 10.10.1.178 port 14729 connected with 10.10.1.11 port 5001
> [ ID] Interval       Transfer     Bandwidth
> [  3]  0.0-10.0 sec  7.81 GBytes  6.71 Gbits/sec
> [  3] 10.0-20.0 sec  9.11 GBytes  7.82 Gbits/sec
> [  3]  0.0-20.0 sec  16.9 GBytes  7.27 Gbits/sec

Ok. It does seem like your issue is VM/disk related rather than
network/protocol related in that case. Going forward, I suggest that you
test with FTP as you make tweaks in order to keep things as close to raw
TCP bulk transfer as possible but including the disks/VM i.e. don't use
NFS/SSH/CIFS to evaluate effectiveness of tuning tweaks.

> The current configuration on both boxes is 
> kernel="kernel"
> bootfile="kernel"
> kernel_options=""
> kern.hz="20000"

Why such a high hz setting? I'd suggest lowering to 2000 on both
machines unless you have good reason for it to be so high.

> hw.est.msr_info="0"
> hw.hptrr.attach_generic="0"
> kern.maxfiles="65536"
> kern.maxfilesperproc="50000"
> kern.cam.boot_delay="8000"
> autoboot_delay="5"
> isboot_load="YES"
> zfs_load="YES"
> hw.ixgbe.enable_aim=0
> 
> and
> cat /etc/sysctl.conf 
> # Disable core dump
> kern.coredump=0
> # System tuning
> net.inet.tcp.delayed_ack=0
> # System tuning
> net.inet.tcp.rfc1323=1
> # System tuning
> net.inet.tcp.sendspace=262144
> # System tuning
> net.inet.tcp.recvspace=262144
> # System tuning
> net.inet.tcp.sendbuf_max=4194304
> # System tuning
> net.inet.tcp.sendbuf_inc=262144
> # System tuning
> net.inet.tcp.sendbuf_auto=1
> # System tuning
> net.inet.tcp.recvbuf_max=4194304
> # System tuning
> net.inet.tcp.recvbuf_inc=262144
> # System tuning
> net.inet.tcp.recvbuf_auto=1
> # System tuning
> net.inet.udp.recvspace=65536
> # System tuning
> net.inet.udp.maxdgram=57344
> # System tuning
> net.local.stream.recvspace=65536
> # System tuning
> net.local.stream.sendspace=65536
> # System tuning
> kern.ipc.maxsockbuf=16777216
> # System tuning
> kern.ipc.somaxconn=8192
> # System tuning
> kern.ipc.nmbclusters=262144
> # System tuning
> kern.ipc.nmbjumbop=262144
> # System tuning
> kern.ipc.nmbjumbo9=131072
> # System tuning
> kern.ipc.nmbjumbo16=65536
> # System tuning
> kern.maxfiles=65536
> # System tuning
> kern.maxfilesperproc=50000
> # System tuning
> net.inet.icmp.icmplim=300
> # System tuning
> net.inet.icmp.icmplim_output=1
> # System tuning
> net.inet.tcp.path_mtu_discovery=0
> # System tuning
> hw.intr_storm_threshold=9000

Your network-related tuning looks good to me.

> Box A is 
> zpool status
>   pool: testing
>  state: ONLINE
>   scan: none requested
> config:
> 
>         NAME        STATE     READ WRITE CKSUM
>         testing     ONLINE       0     0     0
>           da0.nop   ONLINE       0     0     0
>           da1.nop   ONLINE       0     0     0
>           da2.nop   ONLINE       0     0     0
>           da3.nop   ONLINE       0     0     0
>           da4.nop   ONLINE       0     0     0
>           da5.nop   ONLINE       0     0     0
>           da6.nop   ONLINE       0     0     0
>           da7.nop   ONLINE       0     0     0
>           da8.nop   ONLINE       0     0     0
>           da9.nop   ONLINE       0     0     0
>           da10.nop  ONLINE       0     0     0
>           da11.nop  ONLINE       0     0     0
>           da12.nop  ONLINE       0     0     0
>           da13.nop  ONLINE       0     0     0
>           da14.nop  ONLINE       0     0     0
>           da15.nop  ONLINE       0     0     0
> 
> fio --direct=1 --rw=randwrite --bs=4k --size=2G --numjobs=1 --runtime=60
> --group_reporting --name=randwrite
> fio: this platform does not support process shared mutexes, forcing use
> of threads. Use the 'thread' option to get rid of this warning.
> randwrite: (g=0): rw=randwrite, bs=4K-4K/4K-4K/4K-4K, ioengine=sync,
> iodepth=1
> fio-2.0.15
> Starting 1 process
> Jobs: 1 (f=1): [w] [100.0% done] [0K/150.9M/0K /s] [0 /38.7K/0  iops]
> [eta 00m:00s]
> randwrite: (groupid=0, jobs=1): err= 0: pid=101192: Wed Jul  3 23:01:09 2013
>   write: io=2048.0MB, bw=147916KB/s, iops=36978 , runt= 14178msec
>     clat (usec): min=9 , max=122101 , avg=24.17, stdev=229.23
>      lat (usec): min=10 , max=122101 , avg=24.42, stdev=229.23
>     clat percentiles (usec):
>      |  1.00th=[   11],  5.00th=[   12], 10.00th=[   14], 20.00th=[   21],
>      | 30.00th=[   21], 40.00th=[   22], 50.00th=[   22], 60.00th=[   23],
>      | 70.00th=[   23], 80.00th=[   24], 90.00th=[   29], 95.00th=[   35],
>      | 99.00th=[   99], 99.50th=[  114], 99.90th=[  131], 99.95th=[  137],
>      | 99.99th=[  181]
>     bw (KB/s)  : min=58200, max=223112, per=99.93%, avg=147815.61,
> stdev=31976.97
>     lat (usec) : 10=0.01%, 20=15.49%, 50=82.15%, 100=1.39%, 250=0.96%
>     lat (usec) : 500=0.01%, 750=0.01%, 1000=0.01%
>     lat (msec) : 2=0.01%, 20=0.01%, 250=0.01%
>   cpu          : usr=11.05%, sys=87.08%, ctx=563, majf=0, minf=0
>   IO depths    : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%,
>>=64=0.0%
>      submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%,
>>=64=0.0%
>      complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%,
>>=64=0.0%
>      issued    : total=r=0/w=524288/d=0, short=r=0/w=0/d=0
> 
> Run status group 0 (all jobs):
>   WRITE: io=2048.0MB, aggrb=147915KB/s, minb=147915KB/s,
> maxb=147915KB/s, mint=14178msec, maxt=14178msec
> fio --direct=1 --rw=randread --bs=4k --size=2G --numjobs=1 --runtime=60
> --group_reporting --name=randread
> fio: this platform does not support process shared mutexes, forcing use
> of threads. Use the 'thread' option to get rid of this warning.
> randread: (g=0): rw=randread, bs=4K-4K/4K-4K/4K-4K, ioengine=sync, iodepth=1
> fio-2.0.15
> Starting 1 process
> randread: Laying out IO file(s) (1 file(s) / 2048MB)
> Jobs: 1 (f=1): [r] [100.0% done] [292.9M/0K/0K /s] [74.1K/0 /0  iops]
> [eta 00m:00s]
> randread: (groupid=0, jobs=1): err= 0: pid=101304: Wed Jul  3 23:02:08 2013
>   read : io=2048.0MB, bw=327578KB/s, iops=81894 , runt=  6402msec
>     clat (usec): min=4 , max=20418 , avg=10.15, stdev=28.54
>      lat (usec): min=4 , max=20418 , avg=10.27, stdev=28.54
>     clat percentiles (usec):
>      |  1.00th=[    5],  5.00th=[    6], 10.00th=[    6], 20.00th=[    8],
>      | 30.00th=[   10], 40.00th=[   10], 50.00th=[   10], 60.00th=[   11],
>      | 70.00th=[   11], 80.00th=[   11], 90.00th=[   12], 95.00th=[   13],
>      | 99.00th=[   22], 99.50th=[   31], 99.90th=[   77], 99.95th=[   95],
>      | 99.99th=[  145]
>     bw (KB/s)  : min=290024, max=520016, per=100.00%, avg=328490.00,
> stdev=63941.66
>     lat (usec) : 10=28.85%, 20=69.83%, 50=1.19%, 100=0.09%, 250=0.05%
>     lat (msec) : 50=0.01%
>   cpu          : usr=18.08%, sys=81.57%, ctx=144, majf=0, minf=1
>   IO depths    : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%,
>>=64=0.0%
>      submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%,
>>=64=0.0%
>      complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%,
>>=64=0.0%
>      issued    : total=r=524288/w=0/d=0, short=r=0/w=0/d=0
> 
> Run status group 0 (all jobs):
>    READ: io=2048.0MB, aggrb=327577KB/s, minb=327577KB/s,
> maxb=327577KB/s, mint=6402msec, maxt=6402msec
> 
> 
> Box B
> zpool status
>   pool: backup
>  state: ONLINE
>   scan: none requested
> config:
> 
>         NAME          STATE     READ WRITE CKSUM
>         backup        ONLINE       0     0     0
>           mfid0.nop   ONLINE       0     0     0
>           mfid1.nop   ONLINE       0     0     0
>           mfid2.nop   ONLINE       0     0     0
>           mfid3.nop   ONLINE       0     0     0
>           mfid4.nop   ONLINE       0     0     0
>           mfid5.nop   ONLINE       0     0     0
>           mfid6.nop   ONLINE       0     0     0
>           mfid7.nop   ONLINE       0     0     0
>           mfid8.nop   ONLINE       0     0     0
>           mfid9.nop   ONLINE       0     0     0
>           mfid10.nop  ONLINE       0     0     0
>           mfid11.nop  ONLINE       0     0     0
>           mfid12.nop  ONLINE       0     0     0
>           mfid13.nop  ONLINE       0     0     0
>           mfid14.nop  ONLINE       0     0     0
>           mfid15.nop  ONLINE       0     0     0
>           mfid16.nop  ONLINE       0     0     0
>           mfid17.nop  ONLINE       0     0     0
>           mfid18.nop  ONLINE       0     0     0
>           mfid19.nop  ONLINE       0     0     0
>           mfid20.nop  ONLINE       0     0     0
>           mfid21.nop  ONLINE       0     0     0
>           mfid22.nop  ONLINE       0     0     0
>           mfid23.nop  ONLINE       0     0     0
> 
> 
> 
> fio --direct=1 --rw=randwrite --bs=4k --size=2G --numjobs=1 --runtime=60
> --group_reporting --name=randwrite
> fio: this platform does not support process shared mutexes, forcing use
> of threads. Use the 'thread' option to get rid of this warning.
> randwrite: (g=0): rw=randwrite, bs=4K-4K/4K-4K/4K-4K, ioengine=sync,
> iodepth=1
> fio-2.0.15
> Starting 1 process
> Jobs: 1 (f=1): [w] [100.0% done] [0K/1948K/0K /s] [0 /487 /0  iops] [eta
> 00m:00s]
> randwrite: (groupid=0, jobs=1): err= 0: pid=101023: Thu Jul  4 03:03:05 2013
>   write: io=65592KB, bw=1093.2KB/s, iops=273 , runt= 60002msec
>     clat (usec): min=9 , max=157723 , avg=3654.65, stdev=5733.27
>      lat (usec): min=9 , max=157724 , avg=3654.98, stdev=5733.29
>     clat percentiles (usec):
>      |  1.00th=[   12],  5.00th=[   13], 10.00th=[   18], 20.00th=[   23],
>      | 30.00th=[   25], 40.00th=[  740], 50.00th=[  756], 60.00th=[ 4048],
>      | 70.00th=[ 5856], 80.00th=[ 7648], 90.00th=[ 9408], 95.00th=[10304],
>      | 99.00th=[11584], 99.50th=[19072], 99.90th=[96768], 99.95th=[117248],
>      | 99.99th=[140288]
>     bw (KB/s)  : min=  174, max= 2184, per=99.67%, avg=1089.37, stdev=392.80
>     lat (usec) : 10=0.21%, 20=11.34%, 50=25.24%, 100=0.04%, 750=9.51%
>     lat (usec) : 1000=5.17%
>     lat (msec) : 2=0.30%, 4=7.89%, 10=33.89%, 20=5.99%, 50=0.28%
>     lat (msec) : 100=0.05%, 250=0.10%
>   cpu          : usr=0.16%, sys=1.01%, ctx=10488, majf=0, minf=0
>   IO depths    : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%,
>>=64=0.0%
>      submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%,
>>=64=0.0%
>      complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%,
>>=64=0.0%
>      issued    : total=r=0/w=16398/d=0, short=r=0/w=0/d=0
> 
> Run status group 0 (all jobs):
>   WRITE: io=65592KB, aggrb=1093KB/s, minb=1093KB/s, maxb=1093KB/s,
> mint=60002msec, maxt=60002msec
> 
> fio --direct=1 --rw=randread --bs=4k --size=2G --numjobs=1 --runtime=60
> --group_reporting --name=randread
> fio: this platform does not support process shared mutexes, forcing use
> of threads. Use the 'thread' option to get rid of this warning.
> randread: (g=0): rw=randread, bs=4K-4K/4K-4K/4K-4K, ioengine=sync, iodepth=1
> fio-2.0.15
> Starting 1 process
> randread: Laying out IO file(s) (1 file(s) / 2048MB)
> Jobs: 1 (f=1): [r] [-.-% done] [608.5M/0K/0K /s] [156K/0 /0  iops] [eta
> 00m:00s]
> randread: (groupid=0, jobs=1): err= 0: pid=101025: Thu Jul  4 03:04:35 2013
>   read : io=2048.0MB, bw=637045KB/s, iops=159261 , runt=  3292msec
>     clat (usec): min=3 , max=83 , avg= 5.25, stdev= 1.39
>      lat (usec): min=3 , max=83 , avg= 5.32, stdev= 1.39
>     clat percentiles (usec):
>      |  1.00th=[    4],  5.00th=[    4], 10.00th=[    5], 20.00th=[    5],
>      | 30.00th=[    5], 40.00th=[    5], 50.00th=[    5], 60.00th=[    5],
>      | 70.00th=[    5], 80.00th=[    6], 90.00th=[    6], 95.00th=[    6],
>      | 99.00th=[   10], 99.50th=[   14], 99.90th=[   22], 99.95th=[   25],
>      | 99.99th=[   45]
>     bw (KB/s)  : min=621928, max=644736, per=99.72%, avg=635281.33,
> stdev=10139.68
>     lat (usec) : 4=0.05%, 10=98.94%, 20=0.86%, 50=0.14%, 100=0.01%
>   cpu          : usr=14.83%, sys=85.14%, ctx=60, majf=0, minf=1
>   IO depths    : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%,
>>=64=0.0%
>      submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%,
>>=64=0.0%
>      complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%,
>>=64=0.0%
>      issued    : total=r=524288/w=0/d=0, short=r=0/w=0/d=0
> 
> Run status group 0 (all jobs):
>    READ: io=2048.0MB, aggrb=637044KB/s, minb=637044KB/s,
> maxb=637044KB/s, mint=3292msec, maxt=3292msec

So if I interpret the above correctly, Box A can crank ~140MB/s random
write and ~300MB/s random read and Box B cranks ~1MB/s random write and
630MB/s random read?

A few thoughts:

- What's up with Box B's 1MB/s write bandwidth? I'm guessing something
fired up at the same time as your IO test and killed your random write
throughput.

- Random read/write is not really a useful test here as ftp is
effectively a sequential streaming read/write workload. The random
read/write throughput is irrelevant.

- I recall some advice that zpool's should not have more than about 8 or
10 disks in them, and you should instead create multiple zpools if you
have more disks. Perhaps investigate the source of that rumour and if
it's true, try create 2 x 8 disk zpools in Box A and 3 x 8 disk zpools
in box B and see if that changes things at all.

Cheers,
Lawrence