> I have a FreeBSD 9 system with ZFS root.  It is actually a VM under Xen on a beefy piece of HW (4 core Sandy Bridge 3ghz Xeon, total HW memory 32GB -- VM has 4vcpus and 6GB RAM).  Mirrored gpart partitions.  I am looking for data integrity more than performance as long as performance is reasonable (which it has more than been the last 3 months).
> The other "servers" on the same HW, the other VMs on the same, don't have this problem but are set up the same way.  There are 4 other FreeBSD VMs, one running email for a one man company and a few of his friends, as well as some static web pages and stuff for him, one runs a few low use web apps for various customers, and one runs about 30 websites with apache and nginx, mostly just static sites.  None are heavily used.  There is also one VM with linux running a couple low use FrontBase databases.   Not high use database -- low use ones.
> The troubleseome VM  has been running fine for over 3 months since I installed it.    Level of use has been pretty much constant.   The server runs 4 jails on it, each dedicated to a different bit of email processing for a small number of users.   One is a secondary DNS.  One runs clamav and spamassassin.  One runs exim for incoming and outgoing mail.  One runs dovecot for imap and pop.   There is no web server or database or anything else running.
> Total number of mail users on the system is approximately 50, plus or minus.  Total mail traffic is very low compared to "real" mail servers.
> Earlier this week things started "freezing up".  It might last a few minutes, or it might last 1/2 hour.   Processes become unresponsive.  This can last a few minutes or much longer.  It eventually resolves itself and things are good for another 10 minutes or 3 hours until it happens again.  When it happens,  lots of processes are listed in "top" as 
> zfs
> zio->i
> zfs
> tx->tx
> db->db
> state.   These processes only get listed in these states when there are problems.   What are these states indicative of?

Ok, after much reading of ZFS blog posts, forum postings, email list postings, and trying stuff out, I seem to have gotten stuff back down to normal and reasonable performance.

In case anyone has similar issues in a similar circumstance, here is what I did.  Some of these may have had little or no effect but this is what was changed.

The biggest effect was when I did the following:

vfs.zfs.zfetch.block_cap  from default 256 down to 64

This was like night and day.  The idea to try this from a post by user "madtrader" in the forum http://forums.sagetv.com/forums/showthread.php?t=43830&page=2  .  He was recording multiple streams of HD video and trying to play HD video off a stream from the same server/ZFS file system.  

Also, setting

vfs.zfs.write_limit_override   to something other than the default disabled "0" seems to have had a relatively significant effect.   Before I worked with the  "block_cap" above, I was focussing on this and had tried everything from 64M to 768M.  It is currently set to 576M and is around the area where I was having best results on my system with my amount of RAM (6GB).  I tried 512M and had good results and then 768M, which was still good but not quite as good as far as I could tell from testing.  So I went with 576M on my last attempt and then added in the block_cap and things really are pretty much back to normal.

I turned on vdev caching

vfs.zfs.vdev.cache.size   form 0 to 10M.   Don't know if it helped.  

I also lowered 

vfs.zfs.txg.timeout   from 5 to 3.   This seems to have had a slightly noticeable effect.

I also adjusted


The default of 0 (meaning system self set) seemed to result in an actual value of around 75-80% of RAM, which seemed high.   I ended up setting it at 3072M, which for me seems to work well.  Don't know what the overall effect on the problem was though.


