Reoccuring ZFS performance problems

Mon Mar 10 17:12:18 UTC 2014

Hi group,

(I have a lot of pastes in here, see http://pastebin.com/yjQnLryP for
this email in case the mail kills too long lines)

On a regular base I run into some very weird ZFS performance issues on
my server. When it happens file IO is terribly slow and even a simple ls
can take a long time (worst case up to minutes). Everything which relies
on file IO is basically dead in this mode so even starting top or other
tools is a PITA. This stage can stay from minutes to hours and goes back
to normal after some random time. A reboot does not necessarily fix it,
often I'm back in exactly this stage after reboot.

I do not see any patterns when it happens in my monitoring (munin), in
terms of when munin starts to time out because of it I do not see peaks
upfront in any of the system graphs I do.

When I run 'top' in this mode I see many processes in one of these states:

tx->tx
zfs
umtxn (mainly on mysql, which is unkillable in this mode)
uwait

Setup:
* FreeBSD 9.2-RELEASE-p3 in a KVM (SmartOS Solaris host, running ZFS itself)
* I'm using mfsbsd to do a ZFS only system
* No specific ZFS changes (I did play with some, see last part of email)
* There are 5 jails running
* I run various Apaches (PHP/SVN/TRAC etc), MySQL, LDAP daemon, a JVM
and some SIP servers (Freeswitch)
* Normal load is around 20-30% (2 cores)
* Swap is currently on 1% usage (4G available)
* I have 16GB of memory available, munin still shows around 1-2 GB as
free. It can be that the issue happens faster with less memory but I
cannot prove it either.
* Currently no dtrace enabled so can't get much further than standard
tools shipped with BSD
* zpool status does not report any failures

The issues are not new, they first appeared while the system was still
running on real hardware (FBSD 8.x) and not within a KVM. Back then I
assumed I have a hardware problem but the problem reappeared on the
virtualized install. This install was basically a 1:1 zfs send copy plus
some bootloader hacking of the old system so exact same software levels.
I switched to a new install on 9.x and had the issue on every single
release there as well. I did not try 10.x yet.

fstat | wc -l:

7068

(took forever to let it finish)

gstat gives me:

dT: 1.010s  w: 1.000s
 L(q)  ops/s    r/s   kBps   ms/r    w/s   kBps   ms/w   %busy Name
   10    575      1      0    0.2    574    653   17.8  100.4| vtbd0
    0      0      0      0    0.0      0      0    0.0    0.0|
PART/vtbd0/vtbd0
    0      0      0      0    0.0      0      0    0.0    0.0| vtbd0p1
    0      0      0      0    0.0      0      0    0.0    0.0| vtbd0p2
   10    575      1      0    0.3    574    653   18.0  100.4| vtbd0p3
    0      0      0      0    0.0      0      0    0.0    0.0|
DEV/vtbd0/vtbd0
    0      0      0      0    0.0      0      0    0.0    0.0| cd0
    0      0      0      0    0.0      0      0    0.0    0.0|
DEV/vtbd0p1/vtbd0p1
    0      0      0      0    0.0      0      0    0.0    0.0|
LABEL/vtbd0p1/vtbd0p1
    0      0      0      0    0.0      0      0    0.0    0.0|
gptid/e402ecce-89ca-11e2-a867-3264262b9894
    0      0      0      0    0.0      0      0    0.0    0.0|
DEV/vtbd0p2/vtbd0p2
    0      0      0      0    0.0      0      0    0.0    0.0|
LABEL/vtbd0p2/vtbd0p2
    0      0      0      0    0.0      0      0    0.0    0.0|
gptid/e4112d88-89ca-11e2-a867-3264262b9894
    0      0      0      0    0.0      0      0    0.0    0.0|
DEV/vtbd0p3/vtbd0p3
    0      0      0      0    0.0      0      0    0.0    0.0|
SWAP/swap/gptid/e4112d88-89ca-11e2-a867-3264262b9894
    0      0      0      0    0.0      0      0    0.0    0.0| DEV/cd0/cd0
    0      0      0      0    0.0      0      0    0.0    0.0|
DEV/gptid/e402ecce-89ca-11e2-a867-3264262b9894/gptid/e402ecce-89ca
-11e2-a867-3264262b9894
    0      0      0      0    0.0      0      0    0.0    0.0|
DEV/gptid/e4112d88-89ca-11e2-a867-3264262b9894/gptid/e4112d88-89ca
-11e2-a867-3264262b9894
    0      0      0      0    0.0      0      0    0.0    0.0|
ZFS::VDEV/zfs::vdev/vtbd0p3

ms/w changes a lot, highest I've seen right now was around 70

zfs iostat 2:

               capacity     operations    bandwidth
pool        alloc   free   read  write   read  write
----------  -----  -----  -----  -----  -----  -----
tank         173G  21.5G      3     56  30.9K   535K
tank         173G  21.5G     62    376  49.0K  1.22M
tank         173G  21.5G     46    340  84.3K   565K
tank         173G  21.5G     45    566  74.6K   800K
tank         173G  21.5G     32    222  92.0K   958K
tank         173G  21.5G     63    392   120K  1.10M
tank         173G  21.5G     16    286  14.2K   338K
tank         173G  21.5G     29    313  24.6K   831K
tank         173G  21.5G      0    289      0   445K
tank         173G  21.5G     27    244  32.6K   293K
tank         173G  21.5G     43    385  42.8K   477K
tank         173G  21.5G     31    329  15.7K   710K
tank         173G  21.5G     65    394  46.8K  1.50M
tank         173G  21.5G     80    320   127K   754K
tank         173G  21.5G     30    425   144K  1.09M
tank         173G  21.5G     13    399  25.9K   379K
tank         173G  21.5G     10    194  5.22K   162K
tank         173G  21.5G     18    311  45.5K  1.02M
tank         173G  21.5G     29    202  58.5K   344K
tank         173G  21.5G     32    375   108K   926K

on the host OS (SmartOS) zpool iostat 2 shows me (there is one other
FBSD box but there is almost no IO on this one):

               capacity     operations    bandwidth
pool        alloc   free   read  write   read  write
----------  -----  -----  -----  -----  -----  -----
zones        602G   230G      5    278  44.0K  2.42M
zones        602G   230G      2    154  19.9K  2.18M
zones        602G   230G      0  1.47K      0  10.3M
zones        602G   230G      0    128      0  1.44M
zones        602G   230G      0    270      0  2.61M
zones        602G   230G      0  1.39K      0  10.2M
zones        602G   230G      0    114  7.96K  2.10M
zones        602G   230G      0    979  7.96K  7.84M

When the guest system is not in this state the writes are lower and I
don't see modes where I have 0 on reads.

I was googling around about this topic forever and I do find people who
report similar issues. No one I've contacted found a real explanation
for it. Based on various guides I started adapting the basic config:

cat /boot/loader.conf:

vfs.zfs.zfetch.block_cap=64

# this one was horrible, bootup alone was dogslow
#vfs.zfs.write_limit_override=1048576

#vfs.zfs.txg.timeout="5"

# so far good results?
vfs.zfs.prefetch_disable="1"

First I thought disabling prefetch did solve the issue for a while. But
it looks like I was too optimistic with that one. However, ls feels
*much* faster when the system is happy since I disabled prefetch.

I'm really totally lost on this one so I would appreciate hints about
how to debug that.

I'm willing to test whatever it takes to figure out where this issue is.

thanks

Adrian