Strange slowdown when cache devices enabled in ZFS
Freddie Cash
fjwcash at gmail.com
Thu Mar 14 18:13:40 UTC 2013
3 storage systems are running this:
# uname -a
FreeBSD alphadrive.sd73.bc.ca 9.1-STABLE FreeBSD 9.1-STABLE #0 r245466M:
Fri Feb 1 09:38:24 PST 2013
root at alphadrive.sd73.bc.ca:/usr/obj/usr/src/sys/ZFSHOST
amd64
1 storage system is running this:
# uname -a
FreeBSD omegadrive.sd73.bc.ca 9.1-STABLE FreeBSD 9.1-STABLE #0 r247804M:
Mon Mar 4 10:27:26 PST 2013
root at omegadrive.sd73.bc.ca:/usr/obj/usr/src/sys/ZFSHOST
amd64
The last system has manually merged the ZFS "deadman" patch (r 247265 from
-CURRENT).
All 4 systems exhibit the same symptoms: if a cache device is enabled in
the pool, the l2arc_feed_thread of zfskern will spin until it takes up 100%
of a CPU core, at which point all I/O to the pool stops. "zpool iostat 1"
and "zpool iostat -v 1" show 0 reads and 0 writes to the pool. "gstat -I
1s -f gpt" shows 0 activity to the pool disks.
If I remove the cache device from the pool, I/O starts up right away
(although it takes several minutes for the remove operation to complete).
During the "0 I/O period", any attempt to access the pool "hangs". CTRL+T
shows either spa_namespace_lock or tx->tx_something or other (the one when
trying to write a transaction to disk). And it will stay like that until
the cache device is removed.
Hardware is almost the same in all 4 boxes:
3x storage boxes:
alphadrive:
SuperMicro H8DGi-F motherboard
AMD Opteron 6128 CPU (8 cores at 2.0 GHz)
64 GB of DDR3 ECC SDRAM in one box
32 GB SSD for the OS and cache device (GPT partitioned)
24x 2.0 TB WD and Seagate SATA harddrives (4x 6-drive raidz2 vdevs)
SuperMicro AOC-USAS-8i SATA controller using mpt driver
SuperMicro 4U chassis
betadrive:
SuperMicro H8DGi-F motherboard
AMD Opteron 6128 CPU (8 cores at 2.0 GHz)
48 GB of DDR3 ECC SDRAM in one box
32 GB SSD for the OS and cache device (GPT partitioned)
16x 2.0 TB WD and Seagate SATA harddrives (3x 5-drive raidz2 vdevs +
spare)
SuperMicro AOC-USAS2-8i SATA controller using mps driver
SuperMicro 3U chassis
zuludrive:
SuperMicro H8DGi-F motherboard
AMD Opteron 6128 CPU (8 cores at 2.0 GHz)
32 GB of DDR3 ECC SDRAM in one box
32 GB SSD for the OS and cache device (GPT partitioned)
24x 2.0 TB WD and Seagate SATA harddrives (4x 6-drive raidz2 vdevs)
SuperMicro AOC-USAS2-8i SATA controller using mps driver
SuperMicro 836 chassis
1x storage box:
omegadrive:
SuperMicro H8DG6-F motherboard
2x AMD Opteron 6128 CPU (8 cores at 2.0 GHz; 16 cores total)
128 GB of DDR3 ECC SDRAM in one box
2x 60 GB SSD for the OS (gmirror'd) and log devices (ZFS mirror)
2x 120 GB SSD for cache devices
45x 2.0 TB WD and Seagate SATA harddrives (7x 6-drive raidz2 vdevs + 3
spares)
LSI 9211-8e SAS controllers using mps driver
Onboard LSI 2008 SATA controller using mps driver for OS/log/cache
SuperMicro 4U JBOD chassis
SuperMicro 2U chassis for motherboard/OS
alphadrive, betadrive, and omegadrive all have dedup and lzjb compression
enabled.
zuludrive has lzjb compression enabled (no dedup).
alpha/beta/zulu do rsync backups every night from various local and remote
Linux and FreeBSD boxes, then ZFS send the snapshot to omegadrive during
the day. The "0 I/O periods" occur most often and most quickly on
omegadrive when receiving snapshots, but will eventually occur on all
systems during the rsyncs.
Things I've tried:
- limiting ARC to only 32 GB on each system
- limiting L2ARC to 30 GB on each system
- enabling the "deadman" patch in case it was I/O requests being lost by
the drives/controllers
- changing primarycache between all and metadata
- increasing arc_meta_limit to just shy of arc_max
- removing cache devices completely
So far, only the last option works. Without L2ARC, the systems are 100%
stable, and can push 200 MB/s of rsync writes and just shy of 500 MB/s of
ZFS recv (saturates gigabit link, bursts writes; usually hovers around
50-80 MB/s continuous writes).
I'm baffled. An L2ARC is supposed to make things faster, especially when
using dedup as the DDT can be cached.
--
Freddie Cash
fjwcash at gmail.com
More information about the freebsd-fs
mailing list