ZFS extra space overhead for ashift=12 vs ashift=9 raidz2 pool?

Taylor j.freebsd-zfs at enone.net
Fri Mar 23 16:30:51 UTC 2012


Hello,

I'm bringing up a new ZFS filesystem and have noticed something strange with respect to the overhead from ZFS. When I create a raidz2 pool with 512-byte sectors (ashift=9), I have an overhead of 2.59%, but when I create the zpool using 4k sectors (ashift=12), I have an overhead of 8.06%. This amounts to a difference of 2.79TiB in my particular application, which I'd like to avoid. :)

(Assuming I haven't done anything wrong. :) ) Is the extra overhead for 4k sector (ashift=12) raidz2 pools expected? Is there any way to reduce this?

(In my very limited performance testing, 4K sectors do seem to perform slightly better and more consistently, so I'd like to use them if I can avoid the extra overhead.)

Details below.

Thanks in advance for your time,

-Taylor



I'm running:
FreeBSD host 9.0-RELEASE FreeBSD 9.0-RELEASE #0  amd64

I'm using Hitachi 4TB Deskstar 0S03364 drives, which are 4K sector devices. 

In order to "future proof" the raidz2 pool against possible variations in replacement drive size, I've created a single partition on each drive, starting at sector 2048 and using 100MB less than total available space on the disk. 
$ sudo gpart list da2
Geom name: da2
modified: false
state: OK
fwheads: 255
fwsectors: 63
last: 7814037134
first: 34
entries: 128
scheme: GPT
Providers:
1. Name: da2p1
  Mediasize: 4000682172416 (3.7T)
  Sectorsize: 512
  Stripesize: 0
  Stripeoffset: 1048576
  Mode: r1w1e1
  rawuuid: 71ebbd49-7241-11e1-b2dd-00259055e634
  rawtype: 516e7cba-6ecf-11d6-8ff8-00022d09712b
  label: (null)
  length: 4000682172416
  offset: 1048576
  type: freebsd-zfs
  index: 1
  end: 7813834415
  start: 2048
Consumers:
1. Name: da2
  Mediasize: 4000787030016 (3.7T)
  Sectorsize: 512
  Mode: r1w1e2

Each partition gives me 4000682172416 bytes (or 3.64 TiB). I'm using 16 drives.  I create the zpool with 4K sectors as follows:
$ sudo gnop create -S 4096 /dev/da2p1
$ sudo zpool create zav raidz2 da2p1.nop da3p1 da4p1 da5p1 da6p1 da7p1 da8p1 da9p1 da10p1 da11p1 da12p1 da13p1 da14p1 da15p1 da16p1 da17p1

I confirm ashift=12:
$ sudo zdb zav | grep ashift
               ashift: 12
               ashift: 12

"zpool list" approximately matches the expected raw capacity of 16*4000682172416 = 64010914758656 bytes (58.28 TiB). 
$ zpool list zav
NAME   SIZE  ALLOC   FREE    CAP  DEDUP  HEALTH  ALTROOT
zav     58T  1.34M  58.0T     0%  1.00x  ONLINE  -

For raidz2, I'd expect to see 4000682172416*14 = 56009550413824 bytes (50.94 TiB). However, I only get:
$ zfs list zav
NAME   USED  AVAIL  REFER  MOUNTPOINT
zav   1.10M  46.8T  354K  /zav

Or using df for greater accuracy:
$ df zav
Filesystem 1K-blocks   Used       Avail Capacity  Mounted on
zav        50288393472  354 50288393117     0%    /zav

A total of 51495314915328 bytes (46.83TiB). (This is for a freshly created zpool before any snapshots, etc. have been performed.)

I measure overhead as "expected - actual / expected", which in the case of 4k sector (ashift=12) raidz2 comes to 8.05%.

To create a 512-byte sector (ashift=9) raidz2 pool, I basically just replace "da2p1.nop" with "da2p1" when creating the zpool. I confirm ashift=9. zpool raw size is the same (as much as I can tell with such limited precision from zpool list). However, the available size according to zfs list/df is 54560512935936 bytes (49.62 TiB), which amounts to an overhead of 2.58%. There are some minor differences in ALLOC and USED size listings, so I repeat them here for the 512-byte sector raidz2 pool:
$ zpool list zav
NAME   SIZE  ALLOC   FREE    CAP  DEDUP  HEALTH  ALTROOT
zav     58T   228K  58.0T     0%  1.00x  ONLINE  -
$ zfs list zav
NAME   USED  AVAIL  REFER  MOUNTPOINT
zav    198K  49.6T  73.0K  /zav
$ df zav
Filesystem 1K-blocks   Used       Avail Capacity  Mounted on
zav        53281750914   73 53281750841     0%    /zav

I expect some overhead from ZFS and according to this blog post:
http://www.cuddletech.com/blog/pivot/entry.php?id=1013
(via http://mail.opensolaris.org/pipermail/zfs-discuss/2010-May/041773.html) 
there may be a 1/64 or 1.56% overhead baked into ZFS. Interestingly enough, when I create a pool with no raid/mirroring, I get an overhead of 1.93% regardless of ashift=9 or ashift=12 which is quite close to the 1/64 number. I have also tested raidz, which has similar behavior to raidz2, however the overhead is slightly less in each case: 1) ashift=9 raidz overhead is 2.33% and 2) ashift=12 raidz overhead is 7.04%.

In order to preserve space, I've put the zdb listings for both ashift=9 and ashift=12 radiz2 pools here:
http://pastebin.com/v2xjZkNw

There are also some differences in ZDB output, for example "SPA allocated" is higher for in the 4K sector raidz2 pool, which seems interesting, although I don't comprehend the significance of this.


More information about the freebsd-fs mailing list