zfs receive stalls whole system

Tue May 17 10:47:00 UTC 2016

On Tue, 17 May 2016 12:44:50 +0200, Ronald Klop <ronald-lists at klop.ws>  
wrote:

> On Tue, 17 May 2016 01:07:24 +0200, Rainer Duffner  
> <rainer at ultra-secure.de> wrote:
>
>> Hi,
>>
>> I have two servers, that were running FreeBSD 10.1-AMD64 for a long  
>> time, one zfs-sending to the other (via zxfer). Both are NFS-servers  
>> and MySQL-slaves, the sender is actively used as NFS-server, the  
>> recipient is just a warm-standby, in case something serious happens and  
>> we don’t want to wait for a day until the restore is back in place. The  
>> MySQL-Slaves are actively used as read-only servers (at the application  
>> level, Python’s SQL-Alchemy does that, apparently).
>>
>> They are HP DL380G8 (one CPU, hexacore) with over 128 GB RAM (I think  
>> one has 144, the other has 192).
>> While they were running 10.1, they used HP P420 RAID-controllers with  
>> individual 12 RAID0 volumes that I pooled into 6-disk RAIDZ2 vdevs.
>> I use zfsnap to do hourly, daily and weekly snapshots.
>>
>> Sending worked well, especially after updating to 10.1
>>
>> Because the storage was over 90% full (and I really hate this  
>> RAID0-business we have with the HP RAID controllers), I rebuilt the  
>> servers with HPs OEMed H220/221 controllers (LSI 2308 in disguise) and  
>> an external disk shelf, hosting 12 additional disks was added- and I  
>> upgraded to FreeBSD 10.3.
>> Because we didn’t want to throw out the original disks, but increase  
>> available space a lot, the new disks are double the size of the  
>> original disks (600 vs. 1200 GB SAS).
>> I also created GPT-partitions on the disks and labeled them according  
>> to the disk’s position in the cages/shelf, created the pools with the  
>> got-partition-names instead of the daX-names.
>>
>> Now, when I do a zxfer, sometimes the whole system stalls while the  
>> data is sent over, especially if the delta is large or if something  
>> else is reading from the disk at the same time (backup agent).
>>
>> I had this before, on 10.0 (I believe, we didn’t have this in 9.1  
>> either, IIRC) and it went away in 10.1.
>>
>> It’s very difficult (well, impossible) to debug, because the system  
>> totally hangs and doesn’t accept any keypresses.
>>
>> Would a ZIL help in this case?
>> I always thought that NFS was the only thing that did SYNC writes…
>
> Databases love SYNC writes too. (But that doesn't say anything about the  
> unresponsive system).
> I think there is a statistic somewhere in FreeBSD to analyze the sync vs  
> async writes and decide if a ZIL will help or not. (But that doesn't say  
> anything about the unresponsive system either).
>
> Ronald.

One question. You did not enable dedup(lication)?

Ronald.