ZFS...

Tue Apr 30 00:41:25 UTC 2019

Comments inline..

Michelle Sullivan
http://www.mhix.org/
Sent from my iPad

> On 30 Apr 2019, at 03:06, Alan Somers <asomers at freebsd.org> wrote:
> 
>> On Mon, Apr 29, 2019 at 10:23 AM Michelle Sullivan <michelle at sorbs.net> wrote:
>> 
>> I know I'm not going to be popular for this, but I'll just drop it here
>> anyhow.
>> 
>> http://www.michellesullivan.org/blog/1726
>> 
>> Perhaps one should reconsider either:
>> 
>> 1. Looking at tools that may be able to recover corrupt ZFS metadata, or
>> 2. Defaulting to non ZFS filesystems on install.
>> 
>> --
>> Michelle Sullivan
>> http://www.mhix.org/
> 
> Wow, losing multiple TB sucks for anybody.  I'm sorry for your loss.
> But I want to respond to a few points from the blog post.
> 
> 1) When ZFS says that "the data is always correct and there's no need
> for fsck", they mean metadata as well as data.  The spacemap is
> protected in exactly the same way as all other data and metadata. (to
> be pedantically correct, the labels and uberblocks are protected in a
> different way, but still protected).  The only way to get metadata
> corruption is due a disk failure (3-disk failure when using RAIDZ2),
> or due to a software bug.  Sadly, those do happen, and they're
> devilishly tricky to track down.  The difference between ZFS and older
> filesystems is that older filesystems experience corruption during
> power loss _by_design_, not merely due to software bugs.  A perfectly
> functioning UFS implementation will experience corruption during power
> loss, and that's why it needs to be fscked.  It's not just
> theoretical, either.  I use UFS on my development VMs, and they
> frequently experience corruption after a panic (which happens all the
> time because I'm working on kernel code).

I know, which is why I have ZVOLs with UFS filesystems in them for the development VMs...  in a perfect world the power would have been all good, the upses would not be damaged and the generator would not run out of fuel because of extended outage...  in fact if it was a perfect world I wouldn’t have my own mini dc at home.

> 
> 2) Backups are essential with any filesystem, not just ZFS.  After
> all, no amount of RAID will protect you from an accidental "rm -rf /".

You only do it once...  I did it back in 1995... haven’t ever done it again.

> 
> 3) ZFS hotspares can be swapped in automatically, though they don't be
> default.  It sounds like you already figured out how to assign a spare
> to the pool.  To use it automatically, you must set the "autoreplace"
> pool property and enable zfsd.  The latter can be done with "sysrc
> zfsd_enable="YES"".

The system was originally built on 9.0, and got upgraded through out the years... zfsd was not available back then.  So get your point, but maybe you didn’t realize this blog was a history of 8+ years?

> 
> 4) It sounds like you're having a lot of power trouble.  Have you
> tried sysutils/apcupsd from ports?

I did... Malta was notorious for it.  Hence 6kva upses in the bottom of each rack (4 racks), cross connected with the rack next to it and a backup generator...  Australia on the otherhand is a lot more stable (at least where I am)...  2 power issues in 2 years... both within 10 hours... one was a transformer, the other when some idiot took out a power pole (and I mean actually took it out, it was literally snapped in half... how they got out of the car and did a runner before the police or Ambos got there I’ll never know.)

>  It's fairly handy.  It can talk to
> a wide range of UPSes, and can be configured to do stuff like send you
> an email on power loss, and power down the server if the battery gets
> too low.
> 

They could help this... all 4 upses are toast now.  One caught fire, one no longer detects AC input, the other two I’m not even trying after the first catching fire... the lot are being replaced on insurance.

It’s a catalog of errors that most wouldn’t normally experience.  However it does show (to me) that ZFS on everything is a really bad idea... particularly for home users where there is unknown hardware and you know they will mistreat it... they certainly won’t have ECC RAM in laptops etc... unknown caching facilities etc.. it’s a recipe for losing the root drive...

Regards,

Michelle