ZFS

Freddie Cash fjwcash at gmail.com
Thu Oct 30 15:34:55 PDT 2008


On October 30, 2008 02:55 pm Lorenzo Perone wrote:
> On 22.10.2008, at 17:38, Freddie Cash wrote:
> > Personally, we use it in production for a remote backup box using
> > ZFS and Rsync (64-bit FreeBSD 7-Stable from August, 2x dual-core
> > Opteron 2200s, 8 GB DDR2 RAM, 24x 500 GB SATA disks attached to two
> > 3Ware 9650/9550 controllers as single-disks).  Works beautifully,
> > backing up 80 FreeBSD and Debian Linux servers every night, creating
> > snapshots with each run.

> > Restoring files from an arbitrary day is as simple as navigating to
> > the needed .zfs/snapshot/<snapname>/<path>/ and scping the file to
> > wherever.

> > And full system restores are as simple as "boot livecd, partition/
> > format disks, run rsync".

> So your system doesn't suffer panics and/or deadlocks, or you just
> cope with them as "collateral damage" (which, admitted, is less of
> a problem with a logging fs)?

Back in August, when we first started the implementation, we deadlocked it 
once or twice a week.  By the time it went live in September, we 
deadlocked it several times a week, but that turned out to be due to the 
CPU/RAM in the production machine not being able to keep up with even 20 
rsync runs (2x Opeteron 200, 8 GB DDR1-SDRAM).

We moved the harddrives over to another system with the Opteron 2200s 
(similar to the testing machine) and DDR2-SDRAM and have only deadlocked 
it 2x in 6 weeks.

Through all that, we noticed a pattern:  if we had more than 4 rsyncs 
running that were doing straight copies (ie added new servers to the 
backup, and this was their first run), then the server would deadlock.  
Had to be power-cycled.

But if the rsyncs are doing mostly incremental updates (file compares, 
updating changed files, writing new files), then we can run with all 80 
without issues.

So, we've taken to only adding 1 new server at a time to the backup 
process, and waiting for it to fully sync to the backup server before 
adding the next one.  (We stop the rsync run at 6:50am, so some servers 
can take up to three days for the initial sync to complete, as the remote 
end only has 768 Kbps ADSL upload speeds.)

It's only been up for a week now since the last deadlock, but now that 
we've discovered the issue (too many writes from too many rsyncs 
simultaneously), we think it will be a lot longer until the next one.  :)

We're anxiously awaiting the release of 8.0, with the much expanded 
kmem_max, so we can put 16 GB of RAM in here, give 4 GB to the ARC, and 
give the rest to rsync, which should speed things up and stabilise it 
more.

> If that's the case, would you share the details about what you're using
> on that machine (RELENG_7?, 7_0? HEAD?) and which patches
> /knobs You used? I have a similar setup on a host which
> backs up way fewer machines and locks up every... 3-9 weeks or so.
> That host only has about 2GB ram though.

All the gory details follow.  It's fairly long.  There are no custom or 
extra patches installed.

Hardware:
 Tyan h2000M motherboard (S3992)
 2x Opteron 2200 CPUs (dual-core) @ 2 GHz
 4x 2 GB DDR2-667 ECC SDRAM
 3Ware 9550SXU-ML16 PCI-X RAID controller
 3Ware 9650SE-ML12  PCIe  RAID controller
 12x 400 GB Seagate SATA harddrives
 12x 500 GB WD SATA harddrives
 2x 2 GB CompactFlash cards in CF-to-IDE adapters
 Chenbro 5U case with 4-way redundand PSUs and 24x hot-swappable bays

All 24 harddrives are configured as "SingleDisk" arrays, which makes them 
appear as individual, normal drives to the OS, but allows the RAID 
controller to use the disk write cache and the card write cache.

ad0 and ad1 are the CF cards, and are part of a gmirror (gm0)
da0 through da23 are part of a raidz2 pool called "storage"

/ is on gm0

The following are ZFS filesystems:
/usr
/usr/src			compressed (ljz)
/usr/obj
/usr/ports			compressed (ljz)
/usr/ports/distfiles
/usr/local
/home
/tmp
/var
/storage
/storage/backup			compressed (gzip)

swap is an 8 GB zvol

ZFS recordsize is set to 64K on storage, and inherited by the rest.

uname -a:
FreeBSD megadrive.sd73.bc.ca 7.0-STABLE FreeBSD 7.0-STABLE #0: Tue Aug 19 
10:39:29 PDT 2008     
root at megadrive.sd73.bc.ca:/usr/obj/usr/src/sys/ZFSHOST  amd64

/boot/loader.conf:
# Loader options
autoboot_delay="10"
beastie_disable="NO"
loader_logo="beastie"
module_path="/boot/kernel"

# Kernel modules to load at boot
zfs_load="YES"

# Kernel tunables to set at boot (mostly for ZFS tuning)
# Disable DMA for the CF disks
# Set kmem to 1.5 GB (the current max on amd64)
# Set ZFS Adaptive Read Cache (arc) to about half of kmem (leaving half 
for the OS)
hw.ata.ata_dma=0
kern.hz="100"
vfs.zfs.arc_min="512M"
vfs.zfs.arc_max="768M"
vfs.zfs.prefetch_disable="1"
vfs.zfs.zil_disable="0"
vm.kmem_size="1596M"
vm.kmem_size_max="1596M"

# Devices to disable at boot (mainly ISA/non-PnP devices)
hint.fd.1.disabled="1"
hint.sio.0.disabled="1"
hint.sio.1.disabled="1"
hint.sio.2.disabled="1"
hint.sio.3.disabled="1"
hint.ppc.0.disabled="1"

Kernel config is GENERIC minus a bunch of unneeded drivers, using 
SCHED_ULE.

/etc/sysctl.conf:
# General network settings
net.isr.direct=1            # Whether to enable Direct Dispatch for netisr


# IP options
net.inet.ip.forwarding=0          # Whether to enable packet forwarding
net.inet.ip.process_options=0     # Disable processing of IP options
net.inet.ip.random_id=1           # Randomise the IP header ID number
net.inet.ip.redirect=0            # Whether to allow redirect packets
#net.inet.ip.stealth=0            # Whether to appear in traceroute output


# ICMP options
net.inet.icmp.icmplim=200         # Limit ICMP packets to this many/s
net.inet.icmp.drop_redirect=1     # Drop ICMP redirect packets
net.inet.icmp.log_redirect=0      # Don't log ICMP redirect packets


# TCP options
net.inet.tcp.blackhole=1          # Drop packets destined to unused ports
net.inet.tcp.inflight.enable=1    # Use automatic TCP window-scaling
net.inet.tcp.log_in_vain=0        # Don't log the blackholed packets
net.inet.tcp.path_mtu_discovery=1 # Use ICMP type 3 to find the MTU to use
net.inet.tcp.recvspace=131072 # Size in bytes of the receive buffer
net.inet.tcp.sack.enable=1    # Enable Selective ACKs
net.inet.tcp.sendspace=131072 # Size in bytes of the send buffer
net.inet.tcp.syncookies=1     # Enable SYN cookie protection


# UDP options
net.inet.udp.blackhole=1      # Drop packets destined to unused ports
net.inet.udp.checksum=1       # Enable UDP checksums
net.inet.udp.log_in_vain=0    # Don't log the blackholed packets
net.inet.udp.recvspace=65536  # Size in bytes of the receive buffer


# Debug options
debug.minidump=1            # Enable the small kernel core dump
debug.mpsafevfs=1           # Enable threaded VFS subsystem


# Kernel options
kern.coredump=0             # Disable kernel core dumps
kern.ipc.somaxconn=512      # Expand the IP listen queue
kern.maxvnodes=250000       # Bump up the max number of vnodes


# PCI bus options
hw.pci.enable_msix=1        # Enable Message Signalled Interrupts Extended
hw.pci.enable_msi=1         # Enable Message Signalled Interrupts
hw.pci.enable_io_modes=1    # Enable alternate I/O access modes


# Other options
vfs.usermount=1             # Enable non-root users to mount filesystems

-- 
Freddie Cash
fjwcash at gmail.com


More information about the freebsd-stable mailing list