[Fwd: Re: Large ZFS arrays?]

Tue Jun 17 15:47:32 UTC 2014

On Sun, 2014-06-15 at 11:00 -0500, Kevin Day wrote:
> On Jun 15, 2014, at 10:43 AM, Dennis Glatting <dg at pki2.com> wrote:
> > 
> > Total. I am looking at three pieces in total:
> > 
> > * Two 1PT storage "blocks" providing load sharing and 
> >  mirroring for failover.
> > 
> > * One 5PB storage block for on-line archives (3-5 years).
> > 
> > The 1PB nodes will divided into something that makes sense, such as
> > multiple SuperMicro 847 chassis with 3TB disks providing some number of
> > volumes. Division is a function of application, such as a 100TB RAIDz2
> > volumes for bulk storage whereas smaller 8TB volumes for active data,
> > such as iSCSI, databases, and home directories.
> > 
> > Thanks.
> 
> 
> We’re currently using multiples of the SuperMicro 847 chassis with 3TB
> and 4TB drives, and LSI 9207 controllers. Each 45 drive array is
> configured as 4 11 drive raidz2 groups, plus one hot spare. 
> 
> A few notes:
> 
> 1) I’d highly recommend against grouping them together into one giant
> zpool unless you really really have to. We just spent a lot of time
> redoing everything so that each 45 drive array is its own
> zpool/filesystem. You’re otherwise putting all your eggs into one very
> big basket, and if something went wrong you’d lose everything rather
> than just a subset of your data. If you don’t do this, you’ll almost
> definitely have to run with sync=disabled, or the number of sync
> requests hitting every drive will kill write performance.
> 
> 2) You definitely want a JBOD controller instead of a smart RAID
> controller. The LSI 9207 works pretty well, but when you exceed 192
> drives it complains on boot up of running out of heap space and makes
> you press a key to continue, which then works fine. There is a very
> recently released firmware update for the card that seems to fix this,
> but we haven’t completed testing yet. You’ll also want to increase
> hw.mps.max_chains. The driver warns you when you need to, but you need
> to reboot to change this, and you’re probably only going to discover
> this under heavy load.
> 

I had discovered the chains problem on some of my systems. Like most of
the people on this list, I have a small data center in my home that the
spouse had the noisy servers "relocated" to the garage. :)

> 3) We’ve played with L2ARC ssd devices, and aren’t seeing much gains.
> It appears that our active data set is so large that it’d need a huge
> SSD to even hit a small percentage of our frequently used files.
> setting “secondarycache=metadata” does seem to help a bit, but probably
> not worth the hassle for us. This probably will depend entirely on your
> workload though.
> 

I'm curios if you have you tried the TB or near TB SSDs? I haven't
looked to see if they are anything reliable, or fast.

> 4) “zfs destroy” can be excruciatingly expensive on large datasets.
> http://blog.delphix.com/matt/2012/07/11/performance-of-zfs-destroy/ 
> It’s a bit better now, but don’t assume you can “zfs destroy” without
> killing performance to everything.
> 

Is that still a problem? Both FreeBSD and ZFS-on-Linux had a significant
problem on destroy but I am under the impression that is now
backgrounded on FreeBSD (ZoL, however, destroyed the pool with dedup
data). It's been several months since I deleted TB files but I seem to
recall that non-dedup was now good but dedup will forever suck.

> If you have specific questions, I’m happy to help, but I think most of
> the advice I can offer is going to be workload specific. If I had to do
> it all over again, I’d probably break things down into many smaller
> servers than trying to put as much onto one.
> 

Replication for on-line fail over. HAST may be an option but I haven't
looked into it.

-- 
Dennis Glatting <dg at pki2.com>