"zfs receive" lock time

Fri Dec 4 20:21:43 UTC 2009

On Wed, Dec 02, 2009 at 02:55:23PM -0600, Kevin wrote:
> 
> I have two very very fast systems (12-disk 15krpm raid array, 16  
> cores, etc). I'm using zfs send/receive to replicate a zfs volume from  
> the "master" box to the "slave" box.
> 
> Every minute, the master takes a new snapshot, then uses "send -i" to  
> send an incremental snapshot to the slave. Normally, no files are  
> changed during the minute so the operation is very fast (<1 second,  
> and most of that is ssh negotiation time).
> 
> If the slave is completely idle, "zfs receive" takes a fraction of a  
> second. If the slave has been very busy (lots of read activity, no  
> writes - the slave has everything mounted read only), suddenly "zfs  
> receive" can take 30 seconds or more to complete, the whole time it  
> has the filesystem locked. For example, I'd see:
> 
> 49345 root        1  76    0 13600K  1956K zio->i  9   0:01  1.37% zfs
> 48910 www         1  46    0 36700K 21932K rrl->r  3   0:24  0.00%  
> lighttpd
> 48913 www         1  46    0 41820K 26108K rrl->r  2   0:24  0.00%  
> lighttpd
> 48912 www         1  46    0 37724K 23484K rrl->r  0   0:24  0.00%  
> lighttpd
> 48911 www         1  46    0 41820K 26460K rrl->r 10   0:23  0.00%  
> lighttpd
> 48909 www         1  46    0 39772K 24488K rrl->r  5   0:22  0.00%  
> lighttpd
> 48908 www         1  46    0 36700K 21460K rrl->r 14   0:19  0.00%  
> lighttpd
> 48907 www         1  45    0 30556K 16216K rrl->r 13   0:14  0.00%  
> lighttpd
> 48906 www         1  44    0 26460K 11452K rrl->r  6   0:06  0.00%  
> lighttpd
> 
> At first, I thought it was possibly cache pressure... when the system  
> was busy, whatever data necessary to create a new snapshot was getting  
> pushed out of the cache so it had to be re-read. I increased arc_max  
> and arc_meta_limit to very high values, and it seemed to have no  
> effect, even when arc_meta_used was far below arc_meta_limit.
> 
> Disabling cache flushes had no impact. Disabling zil cut the time in  
> half, but it's still too long for this application.
> 
> ktrace on the "zfs receive" shows:
> 
>   1062 zfs      0.000024 CALL  ioctl(0x3,0xcc285a11 ,0x7fffffffa320)
>   1062 zfs      0.000081 RET   ioctl 0
>   1062 zfs      0.000058 CALL  ioctl(0x3,0xcc285a05 ,0x7fffffffa2f0)
>   1062 zfs      0.000037 RET   ioctl 0
>   1062 zfs      0.000019 CALL  ioctl(0x3,0xcc285a11 ,0x7fffffffa320)
>   1062 zfs      0.000055 RET   ioctl 0
>   1062 zfs      0.000031 CALL  ioctl(0x3,0xcc285a11 ,0x7fffffff9f00)
>   1062 zfs      0.000053 RET   ioctl 0
>   1062 zfs      0.000020 CALL  ioctl(0x3,0xcc285a1c ,0x7fffffffc930)
>   1062 zfs      24.837084 RET   ioctl 0
>   1062 zfs      0.000028 CALL  ioctl(0x3,0xcc285a11 ,0x7fffffff9f00)
>   1062 zfs      0.000074 RET   ioctl 0
>   1062 zfs      0.000037 CALL  close(0x6)
>   1062 zfs      0.000006 RET   close 0
>   1062 zfs      0.000007 CALL  close(0x3)
>   1062 zfs      0.000005 RET   close 0
> 
> The 24 second call to 0xcc285a1c is ZFS_IOC_RECV, so whatever is going  
> on is in the kernel, not a delay in getting the kernel any data.  
> "systat" is showing that the drives are 100%  busy during the  
> operation, so it's obviously doing something. :)
> 
> Does anyone know what "zfs receive" is doing while it has everything  
> locked like this, and why a lot of read activity beforehand would  
> drastically effect the performance of doing this?

Read activity is related to the dataset on the slave that is being
received? Is that right?

There are two operations that can suspend you file system this way:
rollback and receive. The suspend is done by acquiring write lock for
the given file system where every other operation acquires read lock.
In the end receive to acquire write lock has to wait for all read
operations to finish.

I'm not sure how your applications use it, but if files are open for
short period of time only and then closed, you could do something like
this:

	master# curtime=`date "+%Y%m%d%H%M%S"`
	master# zfs snapshot pool/fs@${curtime}
	master# zfs send -i pool/fs@${oldtime} pool/fs@${curtime} | \
		ssh slave zfs recv pool/fs
	slave# zfs clone pool/fs@${curtime} pool/fs_${curtime}
	slave# ln -fs /pool/fs_${curtime} /pool/usethis

Then point your application to use directory /pool/usethis/ (clone,
instead of received file system). And clean up clones as you wish.
Read activity on clones shouldn't affect received file system.

-- 
Pawel Jakub Dawidek                       http://www.wheel.pl
pjd at FreeBSD.org                           http://www.FreeBSD.org
FreeBSD committer                         Am I Evil? Yes, I Am!
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 187 bytes
Desc: not available
Url : http://lists.freebsd.org/pipermail/freebsd-fs/attachments/20091204/e208b403/attachment.pgp