Help me select hardware....Some real world data that might help

Paul Tice ptice at aldridge.com
Tue Jan 27 20:59:45 PST 2009


I just bumped up the kmem, arc.max, enabled zil and reenabled mdcomp. Prefetch is disabled.
Less than 1 minute into a backup run of only 4 machines, I've got a fresh ZFS wedgie. Ouch.
As I understand it, the ZIL is not as much of an integrity boost as a speed boost, especially since we already have checksum-per-block. I did see spikes of up to 140MB/s on the ZFS pool in the 2 minutes before ZFS wedged.
I think of ZIL as the equivalent of journaling for more traditional filesystems. I can certainly see where it would help with integrity for certain cases, but a good UPS seems to offer many of the same benefits. ;>
 
(As always, I'm  ready to be better informed) 
 
I'm intrigued by the ssh/rsync double process throughput bump, but it does require ssh as well as rsync.
Alas, many of the boxes being backed up belong to the dark side, and many of them are only managed by us. 
For some reason,many 'dark side' owners trust rsync more than ssh, to the point of disallowing ssh installation. Realistically, we could do a lot more with SSH/rsync, since we can start up a Volume Shadow copy, back it up, then remove the Shadow copy. 
 
Good to know there is a pretty high limit to kmem now. 
 
I'm not 100% sure about the gstat, I did no slicing/labeling on the disks, they are purely /dev/adX used by ZFS, would GEOM level even see this? zpool iostat -v will give the overall plus per device stats, I'm curious to see what difference there would be between gstat and zpool iostat -v, if any.
 
I suspected you could upsize raidz pools one disk at a time, I do wonder how the inner/outer track speed differences would affect throughput over a one-by-one disk full array replacement. And yes, I might wonder too much about corner cases. ;>
 
I assume that under 7.x, you could have your cron script take care of automating the zpool replace.
Built in hot sparing is much nicer, but warm spares would be the best (IMHO), a powered down drive isn't spinning towards failure. Of course, this is probably too much to ask of any multi-platform FS,  since the method of spinning a drive up (and access required to do so) varies widely. Sounds like another cron script possibility. :) 
 
If I can get the time on the box, I may try graphing 3 disk -> 17 disks with 2nd axis throughput and 3rd axis CPU usage.Drivewise,  8+1 is easily understandable as a theoretical best 8 bit+1 parity number of drives. Assuming you transfer multiple 8 bit bytes per cycle, and your controller/system/busses can keep up, 16+1 would in theory increase throughput, both from more bits read at a time, and from 1/2 the parity calculations per double-byte written to disk. Of course, this assumes the 'splitting' and parity generation code can optimize for mutiple 8 bit byte transfers. 
 
Anyway, yet another overly long ramble must come to a close.
 
Thanks
Paul
 
 
 

________________________________

From: owner-freebsd-current at freebsd.org on behalf of Freddie Cash
Sent: Tue 1/27/2009 1:42 PM
To: freebsd-current at freebsd.org
Subject: Re: Help me select hardware....Some real world data that might help



On January 27, 2009 10:41 am Paul Tice wrote:
> Excuse my rambling, perhaps something in this mess will be useful.
>
> I'm currently using 8 cores (2x Xeon E5405), 16G FB-DIMM, and 8 x 750GB
> drives on a backup system (I plan to add the other in the chassis one by
> one, testing the speed along the way) 8-current AMD64, ZFS, Marvell
> 88sx6081 PCI-X card (8 port SATA) + LSI1068E (8 port SAS/SATA) for the
> main Array, and the Intel onboard SATA for boot drive(s). Data is sucked
> down through 3 gigabit ports, with another available but not yet
> activated. Array drives all live on the LSI right now. Drives are  <ATA
> ST3750640AS K>.
>
> ZFS is stable _IF_ you disable the prefetch and ZIL, otherwise the
> classic ZFS wedge rears it's ugly head. I haven't had a chance to test
> just one yet, but I'd guess it's the prefetch that's the quick killer.

You probably don't want to disable the ZIL.  That's the journal, and an
important part of the data integrity setup for ZFS.

Prefetch has been shown to cause issues on a lot of systems, and can be a
bottleneck depending on the workload.  But the ZIL should be enabled.

> I've seen references to 8-Current having a kernel memory limit of 8G
> (compared to 2G for pre 8 from what I understand so far) and ZFS ARC

FreeBSD 8.x kmem_max has been bumped to 512 GB.

> Using rsync over several machines with this setup, I'm getting a little
> over  1GB/min to the disks. 'zpool iostat 60' is a wonderful tool.

gstat is even nicer, as it shows you the throughput to the individual
drives, instead of the aggregate that zpool shows.  This works at the GEOM
level.  Quite nice to see how the I/O is balanced (or not) across the drives
in the raidz datasets, and the pool as a whole.

> CPU usage during all this is suprisingly low.  rsync is running with -z,

If you are doing rsync over SSH, don't use -z as part of the rsync command. 
Instead, use -C with ssh.  That way, rsync is done in one process, and the
compression is done by ssh in another process, and it will use two
CPUs/cores instead of just one.  You'll get better throughput that way, as
the rsync process doesn't have to do the compression and reading/writing in
the same process.  We got about a 25% boost in throughput by moving the
compress out of the rsync, and CPU usage balanced across CPUs instead of
just hogging one.

> Random ZFS thoughts:
> You cannot shrink/grow a raidz or raidz2.

You can't add devices to a raidz/raidz2 dataset.  But you can replace the
drives with larger ones, do a resilver, and the extra space will become
available.  Just pull the small drive, insert the large drive, and do a "zfs
replace <poolname> <device> <device>".

And you can add extra raidz/raidz2 datasets to a pool, and ZFS will stripe
the data across the raidz datasets.  Basically, the pool becomes a RAID 5+0
or RAID 6+0, instead of just a RAID 5/RAID 6.

If you have lots of drives, the recommendations from the Solaris folks is to
use a bunch of raidz datasets comprised of <=9 disks each, instead of one
giant raidz dataset across all the drives.  ie:

zfs create pool raidz2 da0  da1  da2  da3  da4  da5
zfs add    pool raidz2 da6  da7  da8  da9  da10 da11
zfs add    pool raidz2 da12 da13 da14 da15 da16 da17

Will give you a single pool comprised of three raidz2 datasets, with data
being striped across the three datasets.

And you can add raidz datasets to the pool as needed.

> You can grow a stripe array,
> I'm don't know if you can shrink it successfully. You cannot promote a
> stripe array to raidz/z2, nor demote in the other direction. You can have
> hot spares, haven't seen a provision for warm/cold spares.

ZFS in FreeBSD 7.x doesn't support hot spares, in that a faulted drive won't
start a rebuild using a spare drive.  You have to manually "zfs replace" the
drive using the spare.

ZFS in FreeBSD 8.x does support auto-rebuild using spare drives (hot spare).

> /etc/default/rc.conf already has cron ZFS status/scrub checks, but not
> enabled.

periodic(8) does ZFS checks as part of the daily run.  See
/etc/defaults/periodic.conf.

However, you can whip up a very simple shell script that does the same, and
run it via cron at whatever interval you want.  We use the following, that
runs every 15 mins:

#!/bin/sh

status=$( zpool status -x )

if [ "${status}" != "all pools are healthy" ]; then
  echo "Problems with ZFS: ${status}" | mail -s "ZFS Issues on <server>" \
    <mail>
fi

exit 0

--
Freddie
fjwcash at gmail.com
_______________________________________________
freebsd-current at freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-current
To unsubscribe, send any mail to "freebsd-current-unsubscribe at freebsd.org"




More information about the freebsd-current mailing list