Restructure a ZFS Pool

Thu Sep 24 14:20:10 UTC 2015

On Sep 24, 2015, at 9:48, Raimund Sacherer <rs at logitravel.com> wrote:

> Yes, I understood that it will only help preventing fragmentation in the future. I also read that performance is great when using async ZFS,

That is an overly general statement. I have seen zpools perform badly with async as well as sync writes when not configured to match the workload.

> would it be safe to use async ZFS if I have Battery Backed Hardware Raid Controller (1024G ram cache)?
> The server is a HP G8 and I have configured all discs as single disk mirrors (the only way to get a JBOD on this raid controller). In the event of a power outage, everything should be held in the raid controller by the battery and it should write on disk as soon as power is restored,

Turning off sync behavior violates Posix compliance and is not a very good idea. Also remember that async writes are cached in the ARC… so you need power for the entire server, not just the disk caches, until all activity has ceased _and_ all pending Transaction Groups (TXG) have been committed to non-volatile storage. TXGs are generally committed every 5 seconds, but if you are under heavy write load it may take more time than that.

> ... would that be safe environment to switch ZFS to async? 

No one can make that call but you. You know your environment, you know your workload, you know the fallout from lost writes _if_ something goes wrong.

> If I use async, is there still the *need* for a SLOG device, I read that running ZFS async and using the SLOG is comparable, because both let the writes be ordered and those prevent fragmentation? It is not a critical system (e.g. downtime during the day is possible), but if restores need to be done I'd rather have it run as fast as possible. 

If you disable sync writes (please do NOT say “use async” as that is determined by the application code), then you are disabling the ZIL (ZFS Intent Log) and the SLOG is a device to hold _just_ the ZIL separate from the data vdevs in the zpool. So, yes, disabling sync writes means that even if there is a SLOG it will never be used.

>> Yes, but unless you can stand loosing data in flight (writes that the system
>> says have been committed but have only made it to the SLOG), you really want
>> your SLOG vdev to be a mirror (at least 2 drives).

> Shouldn't this scenario be handled by ZFS (writes to SLOG, power out, power on, SLOG is transferred to data disks?)

Not if the single SLOG device _fails_ … In the case of a power failure, once the system comes back up ZFS will replay the TXGs on the SLOG and you will not have lost any writes,

> I thought the only dataloss would be writes which are currently in transit TO the SLOG in time of the power outage?

Once again, if the application requests sync writes, the application is not told that the write is complete _until_ it is committed to non-volatile backing storage, in this case the ZIL/SLOG device(s). So from the application’s perspective, no writes are lost because they were not committed when power failed. This is one of the use cases where claiming that disabling sync behavior and assuming UPS / battery backed up cache is just as good as a SLOG device is misleading. The application is asking for a sync write and it is being lied to.

> And I read somewhere that with ZFS since V28 (IIRC) if the SLOG dies it turns off the log and you loose the (performance) benefit of the SLOG, but the pools should still be operational?

There are separate versions for zpool and zfs, you are referring to zpool version 28. Log device removal was added in zpool version 19. `zpool upgrade -v` will tell you which versions / features your system supports. `zfs upgrade -v` will tell you the same thing for zfs versions. FreeBSD 10.1 has zfs version 5 and zpool version 28 plus lots of added features. Feature flags were a way to add features to zpools while not completely breaking compatibility.

So you can remove a failed SLOG device, and if they are mirrored you still don’t lose any data. I’m not sure what happens to a running zpool if a single (non-mirrored) SLOG device fails.

>> In a zpool of this size, especially a RAIDz<N> zpool, you really want a hot
>> spare and a notification mechanism so you can replace a failed drive ASAP.
>> The resilver time (to replace afield drive) will be limited by the
>> performance of a _single_ drive for _random_ I/O. See this post
>> http://pk1048.com/zfs-resilver-observations/ for one of my resilver
>> operations and the performance of such.

> Thank you for this info, I'l keep it in mind and bookmark your link.

Benchmark your own zpool, if you can. Do a zpool replace on a device and see how long it takes. That is a reasonable first approximation of how long it will take to replace a really failed device. I tend to stick with no bigger than 1 TB drives to keep resilver times reasonable (for me). I add more vdevs of mirrors as I need capacity.

--
Paul Kraus
paul at kraus-haus.org