[RFC] [patch] periodic status-zfs: list pools in daily emails

Wed Jun 29 11:19:18 UTC 2011

On Wed, Jun 29, 2011 at 06:37:32AM -0400, Glen Barber wrote:
> Hi Alexander,
> 
> On 6/29/11 4:46 AM, Alexander Leidinger wrote:
> >> I added a default behavior to list the pools on the system, in
> >> addition to
> >> checking if the pool is healthy.  I think it might be useful for
> >> others to
> >> have this as the default behavior, for example on systems where dedup is
> >> enabled to track the dedup statistics over time.
> > 
> > I do not think this is a bad idea to be able to see the pools... but
> > IMHO it should be configurable (no strong opinion about "enabled or
> > disabled by default").
> > 
> 
> Agreed.  I can add this in.
> 
> >> The output of the the script after my changes follows:
> > 
> > Info to others: this is the default output, there is no special option
> > to track DEDUP.
> > 
> >> Checking status of zfs pools:
> >> NAME     SIZE  ALLOC   FREE    CAP  DEDUP  HEALTH  ALTROOT
> >> zroot    456G   147G   309G    32%  1.00x  ONLINE  -
> >> zstore   928G   258G   670G    27%  1.00x  ONLINE  -
> >> all pools are healthy
> >>
> >> Feedback would be appreciated.  A diff is attached.
> > 
> > Did you test it with an unhealthy pool? If yes, how does the result look
> > like?
> > 
> 
> I have not, yet.  I can do this later today by breaking a mirror.
> 
> > For the healthy case we have redundant info (but as the brain is good at
> > pattern matching, I would object to replace the status with the list
> > output, in case someone would suggest this). In the unhealthy case we
> > will surely have more info, my inquiry about it is if an empty line
> > between the list and the status would make it more readable or not.
> > 
> 
> I will reply later today with of the script with an unhealthy pool, and
> will make listing the pools configurable.  I imagine an empty line would
> certainly make it more readable in either case.  I would be reluctant to
> replace 'status' output with 'list' output for healthy pools mostly to
> avoid headaches for people parsing their daily email, specifically
> looking for (or missing) 'all pools are healthy.'

At my workplace we use a heavily modified version of Netsaint, with bits
and pieces Nagios-like created.  I happened to write the perl code used
to monitor our production Solaris systems (~2000+ servers) for ZFS pool
status.  It parses "zpool status -x" output, monitoring read, write, and
checksum errors per pool, vdev, and device, in addition to general pool
status.  I tested too many conditions, not to mention had to deal with
parsing pains as a result of ZFS code changes, plus supporting
completely different revisions of Solaris 10 in production.  And before
someone asks: no, I cannot provide the source (employee agreements, LCA,
etc...).  I did have to dig through ZFS source code to figure out a
bunch of necessary bits too, so don't be surprised if you have to too.

My recommendation: just look for pools which are in any state other than
ONLINE (don't try to be smart with an OR regex looking for all the
combos; it doesn't scale when ZFS changes), and you should also handle
situations where a device is currently undergoing manual or automatic
device replacement (specifically regex '^[\t\s]+replacing\s+DEGRADED'),
which will be important to people who keep spares in pools.  This might
be difficult with just standard BSD sh, but BSD awk should be able to
handle this.

-- 
| Jeremy Chadwick                                jdc at parodius.com |
| Parodius Networking                       http://www.parodius.com/ |
| UNIX Systems Administrator                   Mountain View, CA, US |
| Making life hard for others since 1977.               PGP 4BD6C0CB |