kern/177536: zfs livelock (deadlock) with high write-to-disk load

Thu Apr 4 16:05:22 UTC 2013

I'm not sure if I'm experiencing the same thing, but I'm chiming-in in case this helps someone.

We have a server that's configured as such:

9.1-RELEASE amd64, dual Opteron CPU, 64GB of memory.

nVidia nForce MCP55 SATA controller
* 1x 240GB SSD used for ZFS l2arc

mps LSI 9207-8e  (LSISAS2308 chip>
* Connected to 4 external enclosures, each with 24 3TB drives for a total of 96 3TB drives, running ZFS in a JBOD configuration

twa 3ware 9650SE-12i
* Connected 1:1 (no expander) to 12 internal 500GB drives, running UFS for / and a secondary UFS filesystem

When there's very heavy write load to the giant ZFS filesystem (>2gbps of total incoming data being written), eventually I reach some kind of deadlock, where I can't do anything that touches any of the block devices.

Processes that attempt to access any filesystem (ZFS or UFS) will get stuck in 'ufs', 'getblk', 'vnread', or 'tx->tx'. A shell is still responsive, and I can run commands as long as they're cached. Trying to run something that wasn't already cached prior to the problem will hang that shell.

'gstat' shows that most(all?) of the disk devices have outstanding requests waiting, but a busy percentage of 0% and no activity happening.

This only seems to happen under heavy ZFS writes. Heavy ZFS reads, or heavy UFS writes do not trigger this. Slowing down the ZFS writes will prevent the problem from occurring.

At first I thought this was a controller hang, but after seeing that devices on three different controllers are all ending up stuck with outstanding requests is making me a bit confused as to how this could even happen. Nothing gets logged to the console when this happens. 

Things I've tried already:

1) Remove the SSD entirely
2) zfs set sync=disabled fs
3) Letting the system wait (90 minutes) to see if this recovers.
4) Swapped the motherboard/CPUs/memory for an identically configured system
5) Switched from an LSI 9280 (mpt) to an LSI 9207 (mps)
6) Updated firmware on the storage cards, updated the BIOS on the motherboard

Fair disclosure, these Opterons do have the TLB bug (AMD errata 298), but the BIOS has a workaround for it which is enabled. We've got dozens of identical systems to this and aren't experiencing any weird hangs or anything elsewhere, so I'm assuming this is not it.

The problem is that this is a production system that doesn't give me a lot of time for troubleshooting before I'm forced to reboot it. I'm going to try to get procstat to stay in the cache so that next time this happens I can try running it. If there's anything else anyone would like me to capture when this happens again I'm happy to try. 

-- Kevin