Using sysctl(1) to gather resource consumption data

Sat Sep 13 00:15:08 UTC 2008

At $work, I've been trying to gather information on "interesting
patterns" of resource consumption during moderately long-running (5 - 8
hour) tasks; the hosts in question usually run FreeBSD 6.2, though
there's an occasional 6.x that's more recent, as well as a bit of
7-STABLE.

I wanted to have a low impact on the system being measured (of course),
and I was unwilling to require that a system to be measured had any
software installed on it other than base FreeBSD.  (Yes, that means I
didn't assume Perl, though in practice in this environment, each does.)

I also wanted the data to be transferred reasonably securely, even if
part of that transit was over facilities over which I had no control.
(Some of the machines being measured happen to be in a continent other
than where I am.)

So I cobbled up a Perl script to run on a data-gathering machine (that
one was mine, so I could require that it had any software I wanted on
it); it acts (if you will) as a "shepherd," watching over child
processes, one of which is created for each host to be measured.

A given child process copies over a shell script to the remote machine,
then redirects STDOUT to append to a file on the data-gathering machine,
and exec()s ssh(1), telling it to run the shell script on the remote
machine.

The shell script fabricates a string (depending on the arguments with
which it was invoked), then sits in a loop:

* eval the string
* sleep for the amount of time remaining

indefinitely.  (In practice, the usual nominal time between successive
eval()s is 5 minutes.  I have recently been doing some experiments at a
10-second interval.)

Periodically, back on the data-gathering machine, a couple of different
things happen:

* The "shepherd" script wakes up and checks the mtime on the file for
  each per-host process (to see if it's been updated "sufficiently
  recently").  Acttually, it first checks the file that lists the hosts
  to watch; if its mtime has changed, it's re-read, and the list of
  hosts is modified as appropriate.  Anyway, if a given per-host file is
  "too old," the corresponding child process is killed.  The the
  script runs through the list of hosts that should be checked,
  creating a per-host process for each one for which that's necessary.

  There's a fair amount of detail I'm eliding (such as limited
  exponential backoff for unresponsive hosts).

  In practice, this runs every 2 minutes at the moment.

* There's a cron(8)-initiated make(1) process that runs, reading the
  files created by the per-host processes and writing to a corresponding
  RRD.  (I cobbled up a Perl script to do this.)

While I tried to externalize a fair amount of this -- e.g., the list of
sysctl(1) OIDs to use is read from an external file -- it turns out that
certain types of change are a bit ... painful.  In particular, adding a
new "data source" to the RRD qualifies (as "painful").

I recently modified the scripts involved to allow them to also be used
to gather per-NIC statistics (via invocation of "netstat -nibf inet").

I'm about to implement that change over the weekend, so it occurred to
me that this might be a good time to add some more sysctl(1) OIDs.

So I'm asking for suggestions -- ideally, for OIDs that are fairly
easily parseable.  (I started being limited to only OIDs that were
presented as a single numeric value per line, then figured out how to
handle kern.cp_time (which is an ordered quintuple); later I figured out
how to cope with vm.loadavg (which is an order triplet ... surrounded by
curly braces).  I don't currently have logic to cope with anything more
complicated than those.)

Here's a list of the OIDs I'm currently using:

debug.dir_entry
debug.direct_blk_ptrs
debug.numcache
debug.numcachehv
debug.numneg
debug.to_avg_depth
debug.to_avg_gcalls
debug.to_avg_mpcalls
hw.usermem
kern.cp_time
kern.ipc.max_datalen
kern.ipc.max_hdr
kern.ipc.maxsockbuf
kern.ipc.msgmax
kern.ipc.msgmnb
kern.ipc.msgmni
kern.ipc.msgtql
kern.ipc.nmbclusters
kern.ipc.nmbjumbo16
kern.ipc.nmbjumbo9
kern.ipc.nmbjumbop
kern.ipc.nsfbufs
kern.ipc.nsfbufspeak
kern.ipc.nsfbufsused
kern.ipc.numopensockets
kern.ipc.pipekva
kern.ipc.pipes
kern.kstack_pages
kern.malloc_count
kern.maxfiles
kern.maxusers
kern.nselcoll
kern.openfiles
net.isr.count
net.isr.deferred
net.isr.directed
net.isr.drop
net.isr.queued
vfs.bufdefragcnt
vfs.buffreekvacnt
vfs.bufmallocspace
vfs.bufreusecnt
vfs.bufspace
vfs.cache.dotdothits
vfs.cache.dothits
vfs.cache.numcache
vfs.cache.numcalls
vfs.cache.numchecks
vfs.cache.numfullpathcalls
vfs.cache.numfullpathfail1
vfs.cache.numfullpathfail2
vfs.cache.numfullpathfail4
vfs.cache.numfullpathfound
vfs.cache.nummiss
vfs.cache.nummisszap
vfs.cache.numneg
vfs.cache.numneghits
vfs.cache.numnegzaps
vfs.cache.numposhits
vfs.cache.numposzaps
vfs.dirtybufferflushes
vfs.dirtybufthresh
vfs.flushwithdeps
vfs.freevnodes
vfs.getnewbufcalls
vfs.getnewbufrestarts
vfs.hibufspace
vfs.hidirtybuffers
vfs.hirunningspace
vfs.lobufspace
vfs.lodirtybuffers
vfs.lorunningspace
vfs.maxbufspace
vfs.maxmallocbufspace
vfs.nfs.downdelayinitial
vfs.nfs.downdelayinterval
vfs.nfs.realign_count
vfs.nfs.realign_test
vfs.nfs.reconnects
vfs.nfs4.access_cache_timeout
vfs.numdirtybuffers
vfs.numfreebuffers
vfs.numvnodes
vfs.read_max
vfs.reassignbufcalls
vfs.wantfreevnodes
vfs.write_behind
vm.loadavg
vm.stats.misc.cnt_prezero
vm.stats.misc.zero_page_count
vm.stats.sys.v_intr
vm.stats.sys.v_soft
vm.stats.sys.v_swtch
vm.stats.sys.v_syscall
vm.stats.sys.v_trap
vm.stats.vm.v_active_count
vm.stats.vm.v_cow_faults
vm.stats.vm.v_cow_optim
vm.stats.vm.v_forkpages
vm.stats.vm.v_forks
vm.stats.vm.v_free_count
vm.stats.vm.v_inactive_count
vm.stats.vm.v_intrans
vm.stats.vm.v_kthreads
vm.stats.vm.v_ozfod
vm.stats.vm.v_pdpages
vm.stats.vm.v_pdwakeups
vm.stats.vm.v_pfree
vm.stats.vm.v_reactivated
vm.stats.vm.v_rforks
vm.stats.vm.v_swapin
vm.stats.vm.v_swapout
vm.stats.vm.v_swappgsin
vm.stats.vm.v_swappgsout
vm.stats.vm.v_tfree
vm.stats.vm.v_vforkpages
vm.stats.vm.v_vforks
vm.stats.vm.v_vm_faults
vm.stats.vm.v_vnodein
vm.stats.vm.v_vnodeout
vm.stats.vm.v_vnodepgsin
vm.stats.vm.v_vnodepgsout
vm.stats.vm.v_wire_count
vm.stats.vm.v_zfod
vm.swap_idle_threshold1
vm.swap_idle_threshold2

I admit that I don't know what several of those actually mean: I figured
I'd capture what I can, then try to make sense of it.  It's very easy to
ignore data that I've captured, but don't need; it's a little harder to take
appropriate corrective action if I determine that there was some
information I should have captured, but didn't.  :-}

Still, if something's in there that's just silly, I wouldn't mind knowing
about it.  :-)

Thanks!

Peace,
david
-- 
David H. Wolfskill				david at catwhisker.org
Depriving a girl or boy of an opportunity for education is evil.

See http://www.catwhisker.org/~david/publickey.gpg for my public key.
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 195 bytes
Desc: not available
Url : http://lists.freebsd.org/pipermail/freebsd-performance/attachments/20080913/8b465c96/attachment.pgp