[Bug 259607] sysutils/node_exporter / OpenZFS: need better handling for ZFS statistic numbers

From: <bugzilla-noreply_at_freebsd.org>
Date: Tue, 02 Nov 2021 06:36:24 UTC
https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=259607

            Bug ID: 259607
           Summary: sysutils/node_exporter / OpenZFS: need better handling
                    for ZFS statistic numbers
           Product: Ports & Packages
           Version: Latest
          Hardware: Any
                OS: Any
            Status: New
          Severity: Affects Some People
          Priority: ---
         Component: Individual Port(s)
          Assignee: dor.bsd@xm0.uk
          Reporter: delphij@FreeBSD.org
                CC: dor.bsd@xm0.uk, fs@FreeBSD.org
                CC: dor.bsd@xm0.uk
             Flags: maintainer-feedback?(dor.bsd@xm0.uk)

node_exporter is spamming syslog with:

level=error ts=2021-11-02T06:18:08.285Z caller=stdlib.go:105 msg="error
gathering metrics: 11 error(s) occurred:\n* [from Gatherer #2] collected metric
\"sysctl_vfs_zfs_l2arc_write_boost\" { untyped:<value:8.388608e+06 > } was
collected before with the same name and label values\n* [from Gatherer #2]
collected metric \"sysctl_vfs_zfs_arc_max\" { untyped:<value:1.2884901888e+10 >
} was collected before with the same name and label values\n* [from Gatherer
#2] collected metric \"sysctl_vfs_zfs_l2arc_feed_min_ms\" { untyped:<value:200
> } was collected before with the same name and label values\n* [from Gatherer
#2] collected metric \"sysctl_vm_uma_tcp_log_bucket_size\" { untyped:<value:30
> } was collected before with the same name and label values\n* [from Gatherer
#2] collected metric \"sysctl_vfs_zfs_l2arc_write_max\" {
untyped:<value:8.388608e+06 > } was collected before with the same name and
label values\n* [from Gatherer #2] collected metric
\"sysctl_vfs_zfs_l2arc_feed_again\" { untyped:<value:1 > } was collected before
with the same name and label values\n* [from Gatherer #2] collected metric
\"sysctl_vfs_zfs_l2arc_norw\" { untyped:<value:0 > } was collected before with
the same name and label values\n* [from Gatherer #2] collected metric
\"sysctl_vfs_zfs_l2arc_noprefetch\" { untyped:<value:1 > } was collected before
with the same name and label values\n* [from Gatherer #2] collected metric
\"sysctl_vfs_zfs_arc_min\" { untyped:<value:0 > } was collected before with the
same name and label values\n* [from Gatherer #2] collected metric
\"sysctl_vfs_zfs_l2arc_feed_secs\" { untyped:<value:1 > } was collected before
with the same name and label values\n* [from Gatherer #2] collected metric
\"sysctl_vfs_zfs_l2arc_headroom\" { untyped:<value:2 > } was collected before
with the same name and label values"

The problem is that some ZFS values are exported from two OIDs, for example:

vfs.zfs.l2arc_feed_min_ms: 200
vfs.zfs.l2arc.feed_min_ms: 200

But node_exporter is aliasing "." to "_" unconditionally.

To get a list of all affected sysctl OIDs, one can use:

sysctl -da | grep -E \("$(sysctl -Na | sed -e s,\\.,_,g | sort | uniq -c | sort
-n | awk '{ if ($1 >1) print $2; }' | sed -e s,_,.,g | paste -sd \| -)"\):

and on FreeBSD 13.0-RELEASE-p4, I got:

vm.uma.tcp_log_bucket.size: Allocation size
vm.uma.tcp_log.bucket_size: Desired per-cpu cache size
vfs.zfs.arc_max: max arc size (LEGACY)
vfs.zfs.arc_min: min arc size (LEGACY)
vfs.zfs.l2arc_norw: no reads during writes (LEGACY)
vfs.zfs.l2arc_feed_again: turbo warmup (LEGACY)
vfs.zfs.l2arc_noprefetch: don't cache prefetch bufs (LEGACY)
vfs.zfs.l2arc_feed_min_ms: min interval milliseconds (LEGACY)
vfs.zfs.l2arc_feed_secs: interval seconds (LEGACY)
vfs.zfs.l2arc_headroom: number of dev writes (LEGACY)
vfs.zfs.l2arc_write_boost: extra write during warmup (LEGACY)
vfs.zfs.l2arc_write_max: max write size (LEGACY)
vfs.zfs.l2arc.norw: No reads during writes
vfs.zfs.l2arc.feed_again: Turbo L2ARC warmup
vfs.zfs.l2arc.noprefetch: Skip caching prefetched buffers
vfs.zfs.l2arc.feed_min_ms: Min feed interval in milliseconds
vfs.zfs.l2arc.feed_secs: Seconds between L2ARC writing
vfs.zfs.l2arc.headroom: Number of max device writes to precache
vfs.zfs.l2arc.write_boost: Extra write bytes during device warmup
vfs.zfs.l2arc.write_max: Max write bytes per interval
vfs.zfs.arc.max: Max arc size
vfs.zfs.arc.min: Min arc size

So it's not just ZFS.  I *think* the proper fix should be to change the
translation code to first replace all _ with __, then replace . with _.

-- 
You are receiving this mail because:
You are on the CC list for the bug.