RFC: Suggesting ZFS "best practices" in FreeBSD

Sat Jan 26 05:31:52 UTC 2013

On Fri, Jan 25, 2013 at 08:48:51PM -0700, Warren Block wrote:
> On Fri, 25 Jan 2013, Jeremy Chadwick wrote:
> 
> >On Fri, Jan 25, 2013 at 12:58:15PM -0700, Warren Block wrote:
> >>On Thu, 24 Jan 2013, Jeremy Chadwick wrote:
> >>
> >>>>>#1.  Map the physical drive slots to how they show up in FBSD so if a
> >>>>>disk is removed and the machine is rebooted all the disks after that
> >>>>>removed one do not have an 'off by one error'.  i.e. if you have
> >>>>>ada0-ada14 and remove ada8 then reboot - normally FBSD skips that
> >>>>>missing ada8 drive and the next drive (that used to be ada9) is now
> >>>>>called ada8 and so on...
> >>>>
> >>>>How do you do that?  If I'm in that situation, I think I could find the
> >>>>bad drive, or at least the good ones, with diskinfo and the drive serial
> >>>>number.  One suggestion I saw somewhere was to use disk serial numbers
> >>>>for label values.
> >>>
> >>>The term FreeBSD uses for this is called "wiring down" or "wired down",
> >>>and is documented in CAM(4).  It's come up repeatedly over the years but
> >>>for whatever reason people overlook it or can't find it.
> >>
> >>I was aware of it, it just seems like there ought to be a better way
> >>to identify drives than by messing with the hardware configuration.
> >
> >I understand what you mean, but it's actually messing with a software
> >configuration (specifically CAM).
> >
> >It's a one-time change that solves the dilemma; it only has to be
> >adjusted if you change controller brands or models, which is a lot less
> >often than changing disks.
> >
> >>Something more elegant, less tied to changing the hardware
> >>configuration of the host.  Assigning the drive serial number as a
> >>label, for example.
> >
> >Hmm...  all this does is change the nature of the problem, no?  You
> >still have the issue of "having to know some magical number" to
> >determine out what path name refers to what physical disk in your system.
> >Can you expand on how this would solve it?
> 
> It's not so much a solution as in the right domain.  The point, as I
> see it, is being able to identify individual disks uniquely.
> Forcing static devices names does that, sort of.  But plug a
> different disk into the same port as an existing one, and that disk
> is now identified as the old one.

Identifying individual disks is a separate subject, as I see it, from
that of what the original concern was.  Quoting that concern:

> >>>>>#1.  Map the physical drive slots to how they show up in FBSD so if a
> >>>>>disk is removed and the machine is rebooted all the disks after that
> >>>>>removed one do not have an 'off by one error'.  i.e. if you have
> >>>>>ada0-ada14 and remove ada8 then reboot - normally FBSD skips that
> >>>>>missing ada8 drive and the next drive (that used to be ada9) is now
> >>>>>called ada8 and so on...

How I interpret that: "when I have a drive bay that's not populated, or
a SATA port that has nothing on it, the /dev/adaX numbering changes or
shifts by N.  That's frustrating!"  He's trying to ensure 1:1 static
device numbering.  The way to do that in CAM is "wiring down".  There
are lots of methods to avoid using the adaX/daX/etc. nomenclature --
labels of course! -- but like I said those just exchange one problem for
another.

I do wish there was some intelligent way in software to accomplish the
"wiring down" method without having to do loader.conf modifications, I
just don't know how software would be able to make those kinds of
decisions.

<removed part about ATA_STATIC_ID and recent bge(4) SMBIOS device
numbering thread :D>

> Using a unique identifier already built into those drives helps.
> Serial numbers are unique, built into the drive, and even printed on
> the paper label.  They can be queried through software and take no
> disk space.  If a drive fails electronically to the point it can't
> be queried, that serial number can be identified from a current list
> of all the drive serial numbers in the array--it's the one not
> there.

How does that serial number correlate with anything physical though?
CAM's "wired down" method allows you to correlate a number (device
number) with something physical?

If what you're saying is "we should have something *like* labels, but
instead not a label at all, instead use what's already there (WWN or
serial number or something generated off of serno+modelstring+etc.)"
then yeah, I'm hearing you on FM.  :-)

> There are problems, they aren't like LEDs on each drive that could
> flash to identify it.  Some enclosures don't make drive labels easy
> to see. Some of that can be addressed with labels.  Er, sticky
> labels, on the outside of the drive or enclosure.  And serial
> numbers are often inconveniently long.

On my Supermicro SC733T chassis, I ended up using a label maker to print
out labels that read "ada0", "ada1", etc. and placed them next to each
respective hot-swap bay.

Not a lot of people know about /dev/led/ (see ahci(4), search for
"LED").  SGPIO does what you want (see SFF-8485 per Seagate).  It's
wonderful except when someone doesn't play nice -- you know what they
say about standards.....

> >As for a unique number per disk, disks within the past ~5 years (SATA,
> >SAS, and some SCSI) all tend to have this: it's called a WWN:
> >
> >http://en.wikipedia.org/wiki/World_Wide_Name
> >
> >But older ATA disks (and by older I don't mean ancient, I mean even
> >semi-old) may not have this, which means you get to use something else.
> >UUIDs come to mind, but then the question becomes what do you base the
> >generation off of?  Model string + serial number + firmware?
> >
> >There are also complexities depending on HBAs (disk controllers) as
> >well; I've seen references, at least on Solaris, of people having one
> >disk showing up twice across 2 separate controllers (i.e. only 1
> >physical disk in the machine, but showing up as both c8d0 and c9d0, both
> >with the same model string and serial number).  I imagine some RAID
> >controllers would do this (when a drive isn't part of an array; it might
> >show up as both /dev/adaX and /dev/somedriverX).  I know at some point I
> >saw this with FreeBSD too during an OS install, I just can't remember
> >what the names were that I saw.
> 
> Surely that ought to be considered a bug.  Any drive ID system is
> going to be vulnerable to certain

I think we just see differently on the matter.  Here's a sort of
inverted example using famous Intel ICHxxR controllers:

You enable RAID mode in your PC BIOS, knowing FreeBSD has GEOM_RAID
support.  You have 2 disks, and you want to RAID-0 them.  You go into
the option ROM and assign disk 0 and disk 1 to a volume.  Now you boot
FreeBSD to install the OS and are greeted to install it on one of 3
disks: raid/r0, ada0, and ada1.

Surprise!

On the opposite side, I've seen HBAs which to accomplish JBOD capability
require you to assign each disk as a RAID-0 array.  5 disks, each
RAID-0, resulting in FreeBSD showing 5 separate volumes.  Failure to
assign them as RAID-0 (i.e. leaving them blank or "------") depends on
the controller and driver too (some show no drives on the bus in this
case).

Surprise!  Again!  :-)

Most of this is just the nature of the beast with storage.

> >Linux has by-uuid and by-id (the latter is what you'd like), but there
> >are caveats to that too:
> >
> >https://wiki.archlinux.org/index.php/Persistent_block_device_naming
> >http://www.terabyteunlimited.com/kb/article.php?id=389
> >
> >So at the end of the day I prefer CAM's "wired down" method -- the
> >reason is that by modifying loader.conf I **know for sure** bay/cable X
> >maps to /dev/adaX, and it's a one-time deal until I decide to move from
> >my ICH9 controller to, say, an Areca.
> 
> That illustrates one problem with making the configuration specific
> to host hardware as compared to drive specific.

You might have missed "by-id" on the first link -- it's based off some
kind of number (I can't figure out if it's truly disk serial number or
WWN).

> As far as "best practices", situations vary so much that I don't
> know if any drive ID method can be recommended.  For a FreeBSD ZFS
> document, a useful sample configuration is going to be small enough
> that anything would work.  A survey of the techniques in use at
> various data centers would be interesting.

I'd agree here wholeheartedly.  And I would find such a survey very
interesting too.

My money would be on on /dev/adaX or similar being used as a majority.
This certainly stems from a) lack of education about labels (people
often don't know they exist), b) too many choices (UFS labels, GEOM
labels, GPT labels, and I think I'm missing one), and c) confusion over
what label and utility correlates with what.  For (c), example: gpart(8)
talks about label support "for partitions that support them", lists off
partitioning methods it supports, but doesn't tell you which ones
support labels.  Those are **partition** labels, by the way; don't
confuse those with, say, UFS labels.

I'll give you one reason why /dev/adaX or similar conventions win out,
and I'll bring ZFS into the picture:

Say you have a raidz1 pool of ada1/2/3.  ada2 fails (you no longer can
read ANY data off the drive).  You go out to the datacenter with a
replacement disk, yank the old, insert the new, and issue "zpool replace
pool ada2".  You leave.  (And if you use autoreplace=yes you don't even
need the last step).  Done.

Now let's say you use a "labelled" setup (vs. raw disks), so using GEOM
or GPT labels.  You do the exact same thing, but instead of the last
step and leaving you have to go fiddling around with glabel/gpart,
naming things the same, "hoping" you have that documented somewhere.
You end up dumping data from another (working) disk in the pool, saying
"ahh right", do you best to mimic it -- all while hoping you don't make
typos since the datacenter is freezing cold, hoping you get it right;
get it wrong and you might have to start over from the beginning ("oh
god, gpart destroy and...").  And if your machine has no network access
at that time, firing up lynx/w3m to look online is out of the question.

Now pretend for a moment we have something like /dev/wwn/xxx for WWN
support (or something similar for serial numbers -- doesn't matter).
Yank old, insert new, and issue "zpool replace pool... uhh, wait".
You then have to go fiddling around to find the WWN.  Let's say you
find it quickly.  "zpool replace pool /dev/wwn/old /dev/wwn/new" (note
the additional parameter).  You leave.

KISS principle wins out for me, as someone who did co-located hosting
for over 17 years.  But I'm certain there are outfits who heavily use
labelling for a lot of reasons (/dev naming consistency across multiple
hardware systems comes to mind; "I want /dev/label/snakes regardless if
I'm using an aac(4) controller or a siis(4) controller!").

Anyway, taken too much time to write this mail, have other things to do
tonight.  :-)

-- 
| Jeremy Chadwick                                   jdc at koitsu.org |
| UNIX Systems Administrator                http://jdc.koitsu.org/ |
| Mountain View, CA, US                                            |
| Making life hard for others since 1977.             PGP 4BD6C0CB |