zfs raidz overhead

Wed Feb 22 21:50:05 UTC 2017

I can add to this, that this is not only seen on raidz, but also on
mirror pools, such as this:
# zpool status tank
  pool: tank
 state: ONLINE
  scan: scrub repaired 0 in 3h22m with 0 errors on Thu Feb  9 06:47:07 2017
config:

        NAME               STATE     READ WRITE CKSUM
        tank               ONLINE       0     0     0
          mirror-0         ONLINE       0     0     0
            gpt/tank1.eli  ONLINE       0     0     0
            gpt/tank2.eli  ONLINE       0     0     0

errors: No known data errors

When I createted test zvols:
# zfs create -V10gb -o volblocksize=8k tank/tst-8k
# zfs create -V10gb -o volblocksize=16k tank/tst-16k
# zfs create -V10gb -o volblocksize=32k tank/tst-32k
# zfs create -V10gb -o volblocksize=64k tank/tst-64k
# zfs create -V10gb -o volblocksize=128k tank/tst-128k

# zfs get used tank/tst-8k
NAME         PROPERTY  VALUE  SOURCE
tank/tst-8k  used      10.3G  -
root at kadlubek:~ # zfs get used tank/tst-16k
NAME          PROPERTY  VALUE  SOURCE
tank/tst-16k  used      10.2G  -
root at kadlubek:~ # zfs get used tank/tst-32k
NAME          PROPERTY  VALUE  SOURCE
tank/tst-32k  used      10.1G  -
root at kadlubek:~ # zfs get used tank/tst-64k
NAME          PROPERTY  VALUE  SOURCE
tank/tst-64k  used      10.0G  -
root at kadlubek:~ # zfs get used tank/tst-128k
NAME           PROPERTY  VALUE  SOURCE
tank/tst-128k  used      10.0G  -
root at kadlubek:~ #

So it might be related not only to raidz pools.

I also noted, that snapshots impact used stats far much, than
usedbysnapshot value:
zfs get volsize,used,referenced,compressratio,volblocksize,usedbysnapshots,usedbydataset,usedbychildren
tank/dkr-thinpool
NAME               PROPERTY         VALUE      SOURCE
tank/dkr-thinpool  volsize          10G        local
tank/dkr-thinpool  used             12.0G      -
tank/dkr-thinpool  referenced       1.87G      -
tank/dkr-thinpool  compressratio    1.91x      -
tank/dkr-thinpool  volblocksize     64K        -
tank/dkr-thinpool  usedbysnapshots  90.4M      -
tank/dkr-thinpool  usedbydataset    1.87G      -
tank/dkr-thinpool  usedbychildren   0          -

On a 10G volume, filled with 2G of data, and 90M used by snapshosts,
used is 2G. When I destroy the snapshots, used will drop to 10.0G.

Cheers,

Wiktor

2017-02-22 0:31 GMT+01:00 Eric A. Borisch <eborisch at gmail.com>:
> On Tue, Feb 21, 2017 at 2:45 AM, Eugene M. Zheganin <emz at norma.perm.ru>
> wrote:
>
>
>
> Hi.
>
> There's an interesting case described here:
> http://serverfault.com/questions/512018/strange-zfs-disk-
> space-usage-report-for-a-zvol
> [1]
>
> It's a user story who encountered that under some situations zfs on
> raidz could use up to 200% of the space for a zvol.
>
> I have also seen this. For instance:
>
> [root at san1:~]# zfs get volsize gamestop/reference1
>  NAME PROPERTY VALUE SOURCE
>  gamestop/reference1 volsize 2,50T local
>  [root at san1:~]# zfs get all gamestop/reference1
>  NAME PROPERTY VALUE SOURCE
>  gamestop/reference1 type volume -
>  gamestop/reference1 creation чт нояб. 24 9:09 2016 -
>  gamestop/reference1 used 4,38T -
>  gamestop/reference1 available 1,33T -
>  gamestop/reference1 referenced 4,01T -
>  gamestop/reference1 compressratio 1.00x -
>  gamestop/reference1 reservation none default
>  gamestop/reference1 volsize 2,50T local
>  gamestop/reference1 volblocksize 8K -
>  gamestop/reference1 checksum on default
>  gamestop/reference1 compression off default
>  gamestop/reference1 readonly off default
>  gamestop/reference1 copies 1 default
>  gamestop/reference1 refreservation none received
>  gamestop/reference1 primarycache all default
>  gamestop/reference1 secondarycache all default
>  gamestop/reference1 usedbysnapshots 378G -
>  gamestop/reference1 usedbydataset 4,01T -
>  gamestop/reference1 usedbychildren 0 -
>  gamestop/reference1 usedbyrefreservation 0 -
>  gamestop/reference1 logbias latency default
>  gamestop/reference1 dedup off default
>  gamestop/reference1 mlslabel -
>  gamestop/reference1 sync standard default
>  gamestop/reference1 refcompressratio 1.00x -
>  gamestop/reference1 written 4,89G -
>  gamestop/reference1 logicalused 2,72T -
>  gamestop/reference1 logicalreferenced 2,49T -
>  gamestop/reference1 volmode default default
>  gamestop/reference1 snapshot_limit none default
>  gamestop/reference1 snapshot_count none default
>  gamestop/reference1 redundant_metadata all default
>
> [root at san1:~]# zpool status gamestop
>  pool: gamestop
>  state: ONLINE
>  scan: none requested
>  config:
>
>  NAME STATE READ WRITE CKSUM
>  gamestop ONLINE 0 0 0
>  raidz1-0 ONLINE 0 0 0
>  da6 ONLINE 0 0 0
>  da7 ONLINE 0 0 0
>  da8 ONLINE 0 0 0
>  da9 ONLINE 0 0 0
>  da11 ONLINE 0 0 0
>
>  errors: No known data errors
>
> or, another server (overhead in this case isn't that big, but still
> considerable):
>
> [root at san01:~]# zfs get all data/reference1
>  NAME PROPERTY VALUE SOURCE
>  data/reference1 type volume -
>  data/reference1 creation Fri Jan 6 11:23 2017 -
>  data/reference1 used 3.82T -
>  data/reference1 available 13.0T -
>  data/reference1 referenced 3.22T -
>  data/reference1 compressratio 1.00x -
>  data/reference1 reservation none default
>  data/reference1 volsize 2T local
>  data/reference1 volblocksize 8K -
>  data/reference1 checksum on default
>  data/reference1 compression off default
>  data/reference1 readonly off default
>  data/reference1 copies 1 default
>  data/reference1 refreservation none received
>  data/reference1 primarycache all default
>  data/reference1 secondarycache all default
>  data/reference1 usedbysnapshots 612G -
>  data/reference1 usedbydataset 3.22T -
>  data/reference1 usedbychildren 0 -
>  data/reference1 usedbyrefreservation 0 -
>  data/reference1 logbias latency default
>  data/reference1 dedup off default
>  data/reference1 mlslabel -
>  data/reference1 sync standard default
>  data/reference1 refcompressratio 1.00x -
>  data/reference1 written 498K -
>  data/reference1 logicalused 2.37T -
>  data/reference1 logicalreferenced 2.00T -
>  data/reference1 volmode default default
>  data/reference1 snapshot_limit none default
>  data/reference1 snapshot_count none default
>  data/reference1 redundant_metadata all default
>  [root at san01:~]# zpool status gamestop
>  pool: data
>  state: ONLINE
>  scan: none requested
>  config:
>
>  NAME STATE READ WRITE CKSUM
>  data ONLINE 0 0 0
>  raidz1-0 ONLINE 0 0 0
>  da3 ONLINE 0 0 0
>  da4 ONLINE 0 0 0
>  da5 ONLINE 0 0 0
>  da6 ONLINE 0 0 0
>  da7 ONLINE 0 0 0
>  raidz1-1 ONLINE 0 0 0
>  da8 ONLINE 0 0 0
>  da9 ONLINE 0 0 0
>  da10 ONLINE 0 0 0
>  da11 ONLINE 0 0 0
>  da12 ONLINE 0 0 0
>  raidz1-2 ONLINE 0 0 0
>  da13 ONLINE 0 0 0
>  da14 ONLINE 0 0 0
>  da15 ONLINE 0 0 0
>  da16 ONLINE 0 0 0
>  da17 ONLINE 0 0 0
>
>  errors: No known data errors
>
> So my question is - how to avoid it ? Right now I'm experimenting with
> the volblocksize, making it around 64k. I'm also suspecting that such
> overhead may be the subsequence of the various resizing operations, like
> extening the volsize of the volume or adding new disks into the pool,
> because I have a couple of servers with raidz where the initial
> disk/volsize configuration didn't change, and the referenced/volsize
> numbers are pretty much close to each other.
>
> Eugene.
>
> Links:
> ------
> [1]
> http://serverfault.com/questions/512018/strange-zfs-disk-
> space-usage-report-for-a-zvol
>
>
> It comes down to the zpool's sector size (2^ashift) and the volblocksize --
> I'm guessing your old servers are at ashift=9 (512), and the new one is at
> 12 (4096), likely with 4k drives. This is the smallest/atomic size of reads
> & writes to a drive from ZFS.
>
> As described in [1]:
>  * Allocations need to be a multiple of (p+1) sectors, where p is your
> parity level; for raidz1, p==1, and allocations need to be in multiples of
> (1+1)=2 sectors, or 8k (for ashift=12; this is the physical size /
> alignment on drive).
>  * It also needs to have enough parity for failures, so it also depends [2]
> on number of drives in pool at larger block/record sizes.
>
> So considering those requirements, and your zvol with volblocksize=8k and
> compression=off, allocations for one logical 8k block are always composed
> physically of two (4k) data sectors, one (p=1) parity sector (4k), and one
> padding sector (4k) to satisfy being a multiple of (p+1=) 2, or 16k
> (allocated on disk space), hence your observed 2x data size being actually
> allocated. Each of these blocks will be on a different drive. This is
> different from the sector-level parity in RAID5
>
> As Matthew Ahrens [1] points out: "Note that setting a small recordsize
> with 4KB sector devices results in universally poor space efficiency --
> RAIDZ-p is no better than p-way mirrors for recordsize=4K or 8K."
>
> Things you can do:
>
>  * Use ashift=9 (and perhaps 512-byte sector drives). The same layout rules
> still apply, but now your 'atomic' size is 512b. You will want to test
> performance.
>  * Use a larger volblocksize, especially if the filesystem on the zvol uses
> a larger block size. If you aren't performance sensitive, use a larger
> volblocksize even if the hosted filesystem doesn't. (But test this out to
> see how performance sensitive you really are! ;) You'll need to use
> something like dd to move data between different block size zvols.
>  * Enable compression if the contents are compressible (some likely will
> be.)
>  * Use a pool created from mirrors instead of raidz if you need
> high-performance small blocks while retaining redundancy.
>
> You don't get efficient (better than mirrors) redundancy, performant small
> (as in small multiple of zpool's sector size) block sizes, and zfs's
> flexibility all at once.
>
>  - Eric
>
> [1] https://www.delphix.com/blog/delphix-engineering/zfs-rai
> dz-stripe-width-or-how-i-learned-stop-worrying-and-love-raidz
> [2] My spin on Ahren's spreadsheet: https://docs.google.com/spread
> sheets/d/13sJPc6ZW6_441vWAUiSvKMReJW4z34Ix5JSs44YXRyM/edit?usp=sharing
> _______________________________________________
> freebsd-fs at freebsd.org mailing list
> https://lists.freebsd.org/mailman/listinfo/freebsd-fs
> To unsubscribe, send any mail to "freebsd-fs-unsubscribe at freebsd.org"