zfs receive stalls whole system

Tue May 17 09:08:21 UTC 2016

Am 2016-05-17 10:27, schrieb Fabian Keil:
> Rainer Duffner <rainer at ultra-secure.de> wrote:
> 
>> I have two servers, that were running FreeBSD 10.1-AMD64 for a long 
>> time, one zfs-sending to the other (via zxfer). Both are NFS-servers 
>> and MySQL-slaves, the sender is actively used as NFS-server, the 
>> recipient is just a warm-standby, in case something serious happens 
>> and we don’t want to wait for a day until the restore is back in 
>> place. The MySQL-Slaves are actively used as read-only servers (at the 
>> application level, Python’s SQL-Alchemy does that, apparently).
>> 
>> They are HP DL380G8 (one CPU, hexacore) with over 128 GB RAM (I think 
>> one has 144, the other has 192).
>> While they were running 10.1, they used HP P420 RAID-controllers with 
>> individual 12 RAID0 volumes that I pooled into 6-disk RAIDZ2 vdevs.
>> I use zfsnap to do hourly, daily and weekly snapshots.
> [...]
>> Now, when I do a zxfer, sometimes the whole system stalls while the 
>> data is sent over, especially if the delta is large or if something 
>> else is reading from the disk at the same time (backup agent).
>> 
>> I had this before, on 10.0 (I believe, we didn’t have this in 9.1 
>> either, IIRC) and it went away in 10.1.
> 
> Do you use geli for swap device(s)?

Yes, I do.
/dev/mirror/swap.eli		none	swap	sw		0	0

Bad idea?

>> It’s very difficult (well, impossible) to debug, because the system 
>> totally hangs and doesn’t accept any keypresses.
> 
> You could try reducing ZFS's deadman timeout to get a panic.
> On systems with local disks I usually use:
> 
> vfs.zfs.deadman_enabled: 1
> vfs.zfs.deadman_checktime_ms: 5000
> vfs.zfs.deadman_synctime_ms: 10000

Too bad I don't have a spare-system I could use to test this ;-)