ZFS on 10-STABLE r281159: programs, accessing ZFS pauses for minutes in state [*kmem arena]

Thu Jul 30 15:41:19 UTC 2015

On 30/07/2015 15:41, Paul Kraus wrote:
> On Jul 30, 2015, at 7:49, Steven Hartland <killing at multiplay.co.uk> wrote:
>
>> On 30/07/2015 12:30, Lev Serebryakov wrote:
>>>   Deduplication IS TURNED OFF. atime is turned off. Record size set to 1M as
>>> I have a lot of big files (movies, RAW photo from DSLR, etc). Compression is
>>> turned off.
>> You don't need to do that as record set size is a min not a max, if you don't force it large files will still be stored efficiently.
> Can you point to documentation for that ?
Ignore my previous comment there I was clearly having a special moment.

recordsize sets the suggested block size which is effectively the 
largest block size for a given file. Its generally not about efficient 
storage more efficient access, so that's what you usually want to 
consider except in extreme cases.

If you set recordsize to 1MB you get large block support which is 
detailed here:
https://reviews.csiden.org/r/51/

Key info from this:

Recommended uses center around improving performance of random reads of 
large blocks (>= 128KB): - files that are randomly read in large chunks 
(e.g. video files when streaming many concurrent streams such that 
prefetch can not effectively cache data); performance will be improved 
in this case because random 1MB reads from rotating disks has higher 
bandwidth than random 128KB reads. - typically, performance of 
scrub/resilver is improved, especially with RAID-Z

The tradeoffs to consider when using large blocks include: - accessing 
large blocks tends to increase latency of all operations, because even 
small reads will need to get in line benind large reads/writes - 
sub-block writes (i.e. write to 128KB of a 1MB block) will incur even 
larger read-modify-write penalty - the last, partially-filled block of 
each file will be larger, wasting memory, and if compression is not 
enabled, disk space (expected waste is 1/2 the recordsize per file, 
assuming random file length)

recordsize is documented in the man page:
https://www.freebsd.org/cgi/man.cgi?query=zfs&apropos=0&sektion=8&manpath=FreeBSD+10.2-stable&arch=default&format=html

> I really hope that the 128KB default is not a minimum record size or a 1KB file will take up 128 KB of FS space.
Setting the recordsize sets the suggested block size used so if you set 
1MB then the minimum size a file can occupy is 1MB even if its on a 512b 
file.
> As far as I know, zfs recordsize has always, since the very beginning of ZFS under Solaris, been the MAX recrodsize, but it is also a hint and not a fixed value. ZFS will write any size records (powers of 2) from 512 bytes (4 KB in the case of an shift = 4 pool) up to recordsize. Tuning of recordsize has been frowned upon since the beginning unless you _know_ the size of your writes and they are fixed (like 8 KB database records).
>
> Also note that ZFS will fit the write to the pool in the case of RAIDz<n>, see Matt Ahrens bloig entry here: http://blog.delphix.com/matt/2014/06/06/zfs-stripe-width/
Another nice article on this can be found here:
https://www.joyent.com/blog/bruning-questions-zfs-record-size

         Regards
     Steve