RFC: Suggesting ZFS "best practices" in FreeBSD

Tue Jan 22 11:13:11 UTC 2013

(Scott, I hope you don't mind to be CC'd, I'm not sure you read the -FS mailing list, and this is a SCSI//FS issue)

Hi :)

Hope nobody will hate me too much, but ZFS usage under FreeBSD is still chaotic. We badly need a well proven "doctrine" in order to avoid problems. Especially, we need to avoid the braindead Linux HOWTO-esque crap of endless commands for which no rationale is offered at all, and which mix personal preferences and even misconceptions as "advice" (I saw one of those howtos which suggested disabling checksums "because they are useless").

ZFS is a very different beast from other filesystems, and the setup can involve some non-obvious decisions. Worse, Windows oriented server vendors insist on bundling servers with crappy raid controllers which tend to make things worse.

Since I've been using ZFS on FreeBSD (from the first versions) I have noticed several serious problems. I try to explain some of them, and my suggestions for a solution. We should collect more use cases and issues and try to reach a consensus. 

1- Dynamic disk naming -> We should use static naming (GPT labels, for instance)

ZFS was born in a system with static device naming (Solaris). When you plug a disk it gets a fixed name. As far as I know, at least from my experience with Sun boxes, c1t3d12 is always c1t3d12. FreeBSD's dynamic naming can be very problematic.

For example, imagine that I have 16 disks, da0 to da15. One of them, say, da5, dies. When I reboot the machine, all the devices from da6 to da15 will be renamed to the device number -1. Potential for trouble as a minimum.

After several different installations, I am preferring to rely on static naming. Doing it with some care can really help to make pools portable from one system to another. I create a GPT partition in each drive, and Iabel it with a readable name. Thus, imagine I label each big partition (which takes the whole available space) as pool-vdev-disk, for example, pool-raidz1-disk1.

When creating a pool, I use these names. Instead of dealing with device numbers. For example: 

% zpool status
  pool: rpool
 state: ONLINE
  scan: scrub repaired 0 in 0h52m with 0 errors on Mon Jan  7 16:25:47 2013
config:

	NAME             STATE     READ WRITE CKSUM
	rpool            ONLINE       0     0     0
	  mirror-0       ONLINE       0     0     0
	    gpt/rpool-disk1       ONLINE       0     0     0
	    gpt/rpool-disk2       ONLINE       0     0     0
	logs
	  gpt/zfs-log    ONLINE       0     0     0
	cache
	  gpt/zfs-cache  ONLINE       0     0     0

Using a unique name for each disk within your organization is important. That way, you can safely move the disks to a different server, which might be using ZFS, and still be able to import the pool without name collisions. Of course  you could use gptids, which, as far as I know, are unique, but they are difficult to use and in case  of a disk failure it's not easy to determine which disk to replace.

2- RAID cards.

Simply: Avoid them like the pest. ZFS is designed to operate on bare disks. And it does an amazingly good job. Any additional software layer you add on top will compromise it. I have had bad experiences with "mfi" and "aac" cards. 

There are two solutions adopted by RAID card users. None of them is good. The first an obvious one is to create a RAID5 taking advantage of the battery based cache (if present). It works, but it loses some of the advantages of ZFS. Moreover, trying different cards, I have been forced to reboot whole servers in order to do something trivial like replacing a failed disk. Yes, there are software tools to control some of the cards, but they are at the very least cumbersome and confusing.

The second "solution" is to create a RAID0 volume for each disk (some RAID card manufacturers even dare to call it JBOD). I haven't seen a single instance of this working flawlessly. Again, a replaced disk can be a headache. At the very least, you have to deal with a cumbersome and complicated management program to replace a disk, and you often have to reboot the server.

The biggest reason to avoid these stupid cards, anyway, is plain simple: Those cards, at least the ones I have tried bundled by Dell as PERC(insert a random number here) or Sun, isolate the ASC/ASCQ sense codes from the filesystem. Pure crap.

Years ago, fighting this issue, and when ZFS was still rather experimental, I asked for help and Scott Long sent me a "don't try this at home" simple patch, so that the disks become available to the CAM layer, bypassing the RAID card. He warned me of potential issues and lost sense codes, but, so far so good. And indeed the sense codes are lost when a RAID card creates a volume, even if in the misnamed "JBOD" configuration. 

http://www.mavetju.org/mail/view_message.php?list=freebsd-scsi&id=2634817&raw=yes
http://comments.gmane.org/gmane.os.freebsd.devel.scsi/5679

Anyway, even if there might be some issues due to command handling, the end to end verification performed by ZFS should ensure that, as a minimum, the data on the disks won't  be corrupted and, in case it happens, it will be detected. I rather prefer to have ZFS deal with it, instead of working on a sort of "virtual" disk implemented on the RAID card.

Another *strong* reason to avoid those cards, even "JBOD" configurations, is disk portability. The RAID labels the disks. Moving one disk from one machine to another one will result on a funny situation of confusing "import foreign config/ignore" messages when rebooting the destination server (mandatory in order to be able to access the transferred disk). Once again, additional complexity, useless layering and more reboots. That may be acceptable for Metoosoft crap, not for Unix systems.

Summarizing: I would *strongly* recommend to avoid the RAID cards and get proper host adapters without any fancy functionalities instead. The one sold by Dell as H200 seems to work very well. No need to create any JBOD or fancy thing at all. It will just expose the drivers as normal SAS/SATA ones. A host adapter without fancy firmware is the best guarantee about failures caused by fancy firmware.

But, in case that´s not possible, I am still leaning to the kludge of bypassing the RAID functionality, and even avoiding the JBOD/RAID0 thing by patching the driver. There is one issue, though. In case of reboot, the RAID cards freeze, I am not sure why. Maybe that could be fixed,  it happens on machines on which I am not using the RAID functionality at all. They should become "transparent" but they don't. 

Also, I think that  the so-called JBOD thing would impair the correct performance of a zfs health daemon doing things such as automatic failed disk replacement by hot-spares, etc. And there won't be a real ASC/ASCQ log message for diagnosis.

(See at the bottom to read about a problem I have just had with a "JBOD" configuration)

3- Installation, boot, etc.

Here I am not sure. Before zfsboot became available, I used to create a zfs-on-root system by doing, more or less, this:

- Install base system on a pendrive. After the installation, just /boot will be used  from the pendrive, and /boot/loader.conf will 

- Create the ZFS pool.

- Create and populate the root hierarchy. I used to create something like:

pool/root
pool/root/var
pool/root/usr
pool/root/tmp

Why pool/root instead of simply "pool"? Because it's easier to understand, snapshot, send/receive, etc. Why in a hierarchy? Because, if needed, it's possible to snapshot the whole "system" tree atomically. 

I also set the mountpoint of the "system" tree as legacy, and rely on /etc/fstab. Why? In order to avoid an accidental "auto mount"  of critical filesystems in case, for example, I boot off a pendrive in order to tinker. 

For the last system I installed, I tried with zfsboot instead of booting off the /boot directory of a FFS partition.

(*) An example of RAID/JBOD induced crap and the problem of not using static naming follows, 

I am using a Sun server running FreeBSD. It has 16 160 GB SAS disks, and one of those cards I worship: this particular example is controlled by the aac driver. 

As I was going to tinker a lot, I decided to create a raid-based mirror for the system, so that I can boot off it and have swap even with a failed disk, and use the other 14 disks as a pool with two raidz vdevs of 6 disks, leaving two disks as hot-spares. Later  I removed one of the hot-spares and I installed a SSD disk with two partitions to try and make it work as L2ARC  and log. As I had gone for the jbod pain, of course replacing that disk meant rebooting the server in order to do something as illogical as creating a "logical" volume on top of it. These cards just love to be rebooted.

  pool: pool
 state: ONLINE
  scan: resilvered 7.79G in 0h33m with 0 errors on Tue Jan 22 10:25:10 2013
config:

	NAME             STATE     READ WRITE CKSUM
	pool             ONLINE       0     0     0
	  raidz1-0       ONLINE       0     0     0
	    aacd1        ONLINE       0     0     0
	    aacd2        ONLINE       0     0     0
	    aacd3        ONLINE       0     0     0
	    aacd4        ONLINE       0     0     0
	    aacd5        ONLINE       0     0     0
	    aacd6        ONLINE       0     0     0
	  raidz1-1       ONLINE       0     0     0
	    aacd7        ONLINE       0     0     0
	    aacd8        ONLINE       0     0     0
	    aacd9        ONLINE       0     0     0
	    aacd10       ONLINE       0     0     0
	    aacd11       ONLINE       0     0     0
	    aacd12       ONLINE       0     0     0
	logs
	  gpt/zfs-log    ONLINE       0     0     0
	cache
	  gpt/zfs-cache  ONLINE       0     0     0
	spares
	  aacd14         AVAIL   

errors: No known data errors

The fun begun when a disk failed. When it happened, I offlined it, and replaced it by the remaining hot-spare. But something had changed, and the pool remained in this state:

% zpool status
  pool: pool
 state: DEGRADED
status: One or more devices has been taken offline by the administrator.
	Sufficient replicas exist for the pool to continue functioning in a
	degraded state.
action: Online the device using 'zpool online' or replace the device with
	'zpool replace'.
  scan: resilvered 192K in 0h0m with 0 errors on Wed Dec  5 08:31:57 2012
config:

	NAME                        STATE     READ WRITE CKSUM
	pool                        DEGRADED     0     0     0
	  raidz1-0                  DEGRADED     0     0     0
	    spare-0                 DEGRADED     0     0     0
	      13277671892912019085  OFFLINE      0     0     0  was /dev/aacd1
	      aacd14                ONLINE       0     0     0
	    aacd2                   ONLINE       0     0     0
	    aacd3                   ONLINE       0     0     0
	    aacd4                   ONLINE       0     0     0
	    aacd5                   ONLINE       0     0     0
	    aacd6                   ONLINE       0     0     0
	  raidz1-1                  ONLINE       0     0     0
	    aacd7                   ONLINE       0     0     0
	    aacd8                   ONLINE       0     0     0
	    aacd9                   ONLINE       0     0     0
	    aacd10                  ONLINE       0     0     0
	    aacd11                  ONLINE       0     0     0
	    aacd12                  ONLINE       0     0     0
	logs
	  gpt/zfs-log               ONLINE       0     0     0
	cache
	  gpt/zfs-cache             ONLINE       0     0     0
	spares
	  2388350688826453610       INUSE     was /dev/aacd14

errors: No known data errors
% 

ZFS was somewhat confused by the JBOD volumes, and it was impossible to end this situation. A reboot revealed that the card,  apparently, had changed volume numbers. Thanks to the resiliency of ZFS, I didn't lose a single bit of data, but the situation seemed to be risky. Finally I could fix it by replacing the failed disk, rebooting the whole server, of course, and doing a zpool replace. But the card added some confusion, and I still don't know what was the disk failure. No traces of a meaningful error message. 

Best regards,

Borja.