Help me select hardware....Some real world data that might help

Tue Jan 27 12:14:12 PST 2009

On January 27, 2009 10:41 am Paul Tice wrote:
> Excuse my rambling, perhaps something in this mess will be useful.
>
> I'm currently using 8 cores (2x Xeon E5405), 16G FB-DIMM, and 8 x 750GB
> drives on a backup system (I plan to add the other in the chassis one by
> one, testing the speed along the way) 8-current AMD64, ZFS, Marvell
> 88sx6081 PCI-X card (8 port SATA) + LSI1068E (8 port SAS/SATA) for the
> main Array, and the Intel onboard SATA for boot drive(s). Data is sucked
> down through 3 gigabit ports, with another available but not yet
> activated. Array drives all live on the LSI right now. Drives are  <ATA
> ST3750640AS K>.
>
> ZFS is stable _IF_ you disable the prefetch and ZIL, otherwise the
> classic ZFS wedge rears it's ugly head. I haven't had a chance to test
> just one yet, but I'd guess it's the prefetch that's the quick killer.

You probably don't want to disable the ZIL.  That's the journal, and an 
important part of the data integrity setup for ZFS.

Prefetch has been shown to cause issues on a lot of systems, and can be a 
bottleneck depending on the workload.  But the ZIL should be enabled.

> I've seen references to 8-Current having a kernel memory limit of 8G
> (compared to 2G for pre 8 from what I understand so far) and ZFS ARC

FreeBSD 8.x kmem_max has been bumped to 512 GB.

> Using rsync over several machines with this setup, I'm getting a little
> over  1GB/min to the disks. 'zpool iostat 60' is a wonderful tool.

gstat is even nicer, as it shows you the throughput to the individual 
drives, instead of the aggregate that zpool shows.  This works at the GEOM 
level.  Quite nice to see how the I/O is balanced (or not) across the drives 
in the raidz datasets, and the pool as a whole.

> CPU usage during all this is suprisingly low.  rsync is running with -z,

If you are doing rsync over SSH, don't use -z as part of the rsync command.  
Instead, use -C with ssh.  That way, rsync is done in one process, and the 
compression is done by ssh in another process, and it will use two 
CPUs/cores instead of just one.  You'll get better throughput that way, as 
the rsync process doesn't have to do the compression and reading/writing in 
the same process.  We got about a 25% boost in throughput by moving the 
compress out of the rsync, and CPU usage balanced across CPUs instead of 
just hogging one.

> Random ZFS thoughts:
> You cannot shrink/grow a raidz or raidz2.

You can't add devices to a raidz/raidz2 dataset.  But you can replace the 
drives with larger ones, do a resilver, and the extra space will become 
available.  Just pull the small drive, insert the large drive, and do a "zfs 
replace <poolname> <device> <device>".

And you can add extra raidz/raidz2 datasets to a pool, and ZFS will stripe 
the data across the raidz datasets.  Basically, the pool becomes a RAID 5+0 
or RAID 6+0, instead of just a RAID 5/RAID 6.

If you have lots of drives, the recommendations from the Solaris folks is to 
use a bunch of raidz datasets comprised of <=9 disks each, instead of one 
giant raidz dataset across all the drives.  ie:

zfs create pool raidz2 da0  da1  da2  da3  da4  da5
zfs add    pool raidz2 da6  da7  da8  da9  da10 da11
zfs add    pool raidz2 da12 da13 da14 da15 da16 da17

Will give you a single pool comprised of three raidz2 datasets, with data 
being striped across the three datasets.

And you can add raidz datasets to the pool as needed.

> You can grow a stripe array,
> I'm don't know if you can shrink it successfully. You cannot promote a
> stripe array to raidz/z2, nor demote in the other direction. You can have
> hot spares, haven't seen a provision for warm/cold spares.

ZFS in FreeBSD 7.x doesn't support hot spares, in that a faulted drive won't 
start a rebuild using a spare drive.  You have to manually "zfs replace" the 
drive using the spare.

ZFS in FreeBSD 8.x does support auto-rebuild using spare drives (hot spare).

> /etc/default/rc.conf already has cron ZFS status/scrub checks, but not
> enabled.

periodic(8) does ZFS checks as part of the daily run.  See 
/etc/defaults/periodic.conf.

However, you can whip up a very simple shell script that does the same, and 
run it via cron at whatever interval you want.  We use the following, that 
runs every 15 mins:

#!/bin/sh

status=$( zpool status -x )

if [ "${status}" != "all pools are healthy" ]; then
  echo "Problems with ZFS: ${status}" | mail -s "ZFS Issues on <server>" \
    <mail>
fi

exit 0

-- 
Freddie
fjwcash at gmail.com