"zfs receive" lock time

Wed Dec 2 21:11:32 UTC 2009

I have two very very fast systems (12-disk 15krpm raid array, 16  
cores, etc). I'm using zfs send/receive to replicate a zfs volume from  
the "master" box to the "slave" box.

Every minute, the master takes a new snapshot, then uses "send -i" to  
send an incremental snapshot to the slave. Normally, no files are  
changed during the minute so the operation is very fast (<1 second,  
and most of that is ssh negotiation time).

If the slave is completely idle, "zfs receive" takes a fraction of a  
second. If the slave has been very busy (lots of read activity, no  
writes - the slave has everything mounted read only), suddenly "zfs  
receive" can take 30 seconds or more to complete, the whole time it  
has the filesystem locked. For example, I'd see:

49345 root        1  76    0 13600K  1956K zio->i  9   0:01  1.37% zfs
48910 www         1  46    0 36700K 21932K rrl->r  3   0:24  0.00%  
lighttpd
48913 www         1  46    0 41820K 26108K rrl->r  2   0:24  0.00%  
lighttpd
48912 www         1  46    0 37724K 23484K rrl->r  0   0:24  0.00%  
lighttpd
48911 www         1  46    0 41820K 26460K rrl->r 10   0:23  0.00%  
lighttpd
48909 www         1  46    0 39772K 24488K rrl->r  5   0:22  0.00%  
lighttpd
48908 www         1  46    0 36700K 21460K rrl->r 14   0:19  0.00%  
lighttpd
48907 www         1  45    0 30556K 16216K rrl->r 13   0:14  0.00%  
lighttpd
48906 www         1  44    0 26460K 11452K rrl->r  6   0:06  0.00%  
lighttpd

At first, I thought it was possibly cache pressure... when the system  
was busy, whatever data necessary to create a new snapshot was getting  
pushed out of the cache so it had to be re-read. I increased arc_max  
and arc_meta_limit to very high values, and it seemed to have no  
effect, even when arc_meta_used was far below arc_meta_limit.

Disabling cache flushes had no impact. Disabling zil cut the time in  
half, but it's still too long for this application.

ktrace on the "zfs receive" shows:

   1062 zfs      0.000024 CALL  ioctl(0x3,0xcc285a11 ,0x7fffffffa320)
   1062 zfs      0.000081 RET   ioctl 0
   1062 zfs      0.000058 CALL  ioctl(0x3,0xcc285a05 ,0x7fffffffa2f0)
   1062 zfs      0.000037 RET   ioctl 0
   1062 zfs      0.000019 CALL  ioctl(0x3,0xcc285a11 ,0x7fffffffa320)
   1062 zfs      0.000055 RET   ioctl 0
   1062 zfs      0.000031 CALL  ioctl(0x3,0xcc285a11 ,0x7fffffff9f00)
   1062 zfs      0.000053 RET   ioctl 0
   1062 zfs      0.000020 CALL  ioctl(0x3,0xcc285a1c ,0x7fffffffc930)
   1062 zfs      24.837084 RET   ioctl 0
   1062 zfs      0.000028 CALL  ioctl(0x3,0xcc285a11 ,0x7fffffff9f00)
   1062 zfs      0.000074 RET   ioctl 0
   1062 zfs      0.000037 CALL  close(0x6)
   1062 zfs      0.000006 RET   close 0
   1062 zfs      0.000007 CALL  close(0x3)
   1062 zfs      0.000005 RET   close 0

The 24 second call to 0xcc285a1c is ZFS_IOC_RECV, so whatever is going  
on is in the kernel, not a delay in getting the kernel any data.  
"systat" is showing that the drives are 100%  busy during the  
operation, so it's obviously doing something. :)

Does anyone know what "zfs receive" is doing while it has everything  
locked like this, and why a lot of read activity beforehand would  
drastically effect the performance of doing this?

-- Kevin