FreeBSD ZFS file server with SSD HDD

Wed Oct 11 17:30:10 UTC 2017

On 10/11/17 06:05, Kate Dawson wrote:
> Currently running a FreeBSD NFS server with a zpool comprising
> 12 x 1TB hard disk drives are arranged as pairs of mirrors in a strip set ( RAID 10 )

That should do 6+ Gb/s.

bonnie++ should be able to measure that.  (It's been a while, but I seem 
to recall that bonnie++ expects raw drives and nukes your data.  So, it 
could take some effort to use it.)

https://www.coker.com.au/bonnie++/

> An additional 2x 960GB SSD added. These two SSD are partitioned with a
> small partition begin used for a ZIL log, and larger partion arranged for
> L2ARC cache.

Assuming the ZIL is mirrored, that should do 5+ Gb/s.

Assuming the L2ARC is striped, that should do 10+ Gb/s.

I dont' know how to test ZIL and L2ARC in isolation, but dbench should 
be able to test what ZFS exposes, both locally and over NFS:

https://dbench.samba.org/

> Additionally the host has 64GB RAM and 16 CPU cores (AMD Opteron 2Ghz)

That should do 20+ Gb/s.

Memtest86+ will be to measure:

http://www.memtest.org/

> A dataset from the pool is exported via NFS to a number of Debian
> Gnu/Linux hosts running a xen hypervisor. These run several disk image
> based virtual machines
> 
> In general use, the FreeBSD NFS host sees very little read IO, which is to expected
> as the RAM cache  and L2ARC are designed to minimise the amount of read load
> on the disks.
> 
> However we're starting to see high load ( mostly IO WAIT ) on the Linux
> virtualisation hosts, and virtual machines - with kernel timeouts
> occurring resulting in crashes and instability.
> 
> I believe this may be due to the limited number of random write IOPS available
> on the zpool NFS export.
> 
> I can get sequential writes and reads to and from the NFS server at
> speeds that approach the maximum the network provides ( currently 1Gb/s
> + Jumbo Frames, and I could increase this by bonding multiple interfaces together. )
> 
> However day to day usage does not show network utilisation anywhere near
> this maximum.
> 
> If I look at the output of `zpool iostat -v tank 1 ` I see that every
> five seconds or so, the numner of write operation go to > 2k
> 
> I think this shows that the I'm hitting the limit that the spinning disk
> can provide in this workload.
> 
> As a cost effective way to improve this ( rather than replacing the
> whole chassis ) I was considering replacing the 1TB HDD with 1TB SSD,
> for the improved IOPS.
> 
> I wonder if there were any opinions within the community here, on
> 
> 1. What metrics can I gather to confirm the disk write IO as bottleneck?
> 
> 2. If the proposed solution will have the required effect?  That is an
> decrease in the IOWAIT on the GNU/Linux virtualization hosts.

I infer your network to be:

- 1 host running FreeBSD (freebsd-version? uname -a?) and an NFS server 
(version?).

- N (how many?) Debian GNU/Linux hosts (/etc/debian-version?  uname 
-a?), each running a Xen hypervisor (version?) and an NFS client.

- The VM's are configured to see their drives as local devices (e.g. the 
VM's are not running NFS clients connected to the FreeBSD NFS server).

- Gigabit switch (make? model?).

- 1 Gigabit connection between switch and each host.

As you have correctly stated, you need visibility on the relevant 
performance metrics to make informed decisions.  In addition to the 
above tools:

- For networking, I'd try netstat:

http://netstat.net/

- For drive I/O, I use nmon on Debian:

https://en.wikipedia.org/wiki/Nmon

- I believe iostat is available on both:

https://en.wikipedia.org/wiki/Iostat

- For CPU's, RAM, and swap, I use top.

https://en.wikipedia.org/wiki/Top_(software)

- You seem to have found at least one ZFS tool.

As others have stated, you will want to ensure that all the pieces are 
reasonably in tune -- VM, NFS client, Xen, Debian networking, switch, 
FreeBSD networking, NFS server, ZFS, etc..  I'd start by looking for 
errors and/or warnings in the usual places (dmesg, /var/log, etc.).  I 
typically leave the settings at the installer defaults, unless I have 
some compelling reason to make a change (at least one reader made a 
suggestion).  Be sure to keep good notes if you're going to muck with 
the settings.

As for 'zpool iostat -v tank 1', I suspect ZFS is telling you that it is 
flushing writes to the HDD's every five seconds.  If flushes always 
complete before the next scheduled flush, replacing the HDD's with SSD's 
probably will not help with the VM IO WAIT and kernel timeout problems. 
But, if the flushes are overrunning each other during peak usage, you 
may have found the bottleneck.

That said, I suspect that the root cause of the VM IO WAIT and kernel 
timeout problems is that the virtual machines need a low latency 
connection to their system drives, temporary file systems, and/or swap 
devices, and they aren't getting it.  I would not bet on NFS to provide 
this, even with SSD's instead of HDD's.  I would bet on local resources. 
  I suggest:

1.  Put 2 mirrored SSD's in each Xen server.

2.  Put VM system drives on the local SSD mirror.

3.  Put VM /tmp file systems on the local SSD mirror, or on RAM:

https://en.wikipedia.org/wiki/Tmpfs

4.  Put VM swap devices on the local SSD mirror, or on RAM:

https://en.wikipedia.org/wiki/Zram

5.  Put VM data drives on NFS.

I am unsure if it is better to do the "on RAM" and "on NFS" ideas at the 
Xen level or within each VM.  Performance is one consideration.  Others 
considerations are security and accountability -- e.g. do customers have 
root on the VM's?

To improve NFS performance:

1.  Enlarging the pipe between the NFS server and the switch -- bonding 
(your idea), upgrade to 10 Gb/s, etc..

2.  Enlarge the pipes between the Xen hosts and the switch.

3.  Add NIC's to the NFS server, add switches, and divide up the Xen 
hosts across the switches.

4.  Add NIC's to the NFS server, one per Xen host, and make direct 
connections between the NFS server and each Xen host.

Please let us know how it goes.  :-)

David