Log analysis server suggestions? [long]

Thu Feb 16 07:30:37 PST 2006

Ashley Moran <ashley.moran at codeweavers.net> wrote:
[ ... ]

I'm not sure who the original poster was, but whoever is interested in this
topic might benefit by reading a thread from the firewall-wizards mailing list:

-------- Original Message --------
Subject: [fw-wiz] parsing logs ultra-fast inline
Date: Wed, 01 Feb 2006 16:03:38 -0500
From: Marcus J. Ranum <mjr at ranum.com>
To: firewall-wizards at honor.icsalabs.com

OK, based on some offline discussion with a few people, about
doing "large amounts" of system log processing inline at high
speeds, I thought I'd post a few code fragments and some ideas
that have worked for me in the past.

First off, if you want to handle truly ginormous amounts of log
data quickly, you need to build a structure wherein you're making
decisions quickly at a broad level, then "drilling down" based on
the results of the decision. This allows you to parallelize infinitely
because all you do is make the first branch in your decision tree
"stripe" across all your analysis engines. So, hypothetically,
let's say we were handling typical UNIX syslogs at a ginormous
volume, we might have one engine (CPU/process or even a
separate box/bus/backplane/CPU/drive array) responsible for
(sendmail | named) and another one responsible for (apache | imapd)
etc. If you put some simple counters in your analysis routines
(hits versus misses) you can "load balance" your first tree-branch
appropriately using a flat percentage. Also, remember, if you
standardize your processing, it doesn't matter where it happens;
it can happen at the edge/source or back in a central location
or any combination of the two. Simply design your analysis as
an always 3-stage process consisting of:
- weeding out and counting instances of uninteresting events
- selecting, parsing sub-fields of, and processing interesting events
- retaining events that fell through the first two steps as "unusual"
The results of these 3 stages are
- a set of event-IDs and counts
- a set of event-IDs and interesting fields and counts
- residual data in raw form
back-haul the event-IDS and counts and fields and graph them or stuff
them into a database, and bring the residual data to the attention of a human
being.

I suppose if you needed to you could implement a "log load
balancer" in the form of a box that had N interfaces that
collected a fat stream of log data, ran a simple program
that sorted the stream into 1/N sub-streams and forwarded
them to backend engines for more involved processing. You
could scale your logging architecture to very very large
loads this way. It works for Google and it'd work for you, too.

The first phase of processing is to stripe across engines if
necessary, then inside each engine you stripe the processing
into functional sub-parsers that deal with a given message
format. The implementation is language-irrelevant though your
language choice will affect performance. Typically you write a
main loop that looks like:
while ( get a message ) {
	if(message is a sendmail message)
		parse sendmail message
	if(message is an imap message)
		parse imap message
	...
}

Once your system has run on a sample dataset you will be able
to determine which messages come most frequently and you can
put that test at the top of the loop. This can result in an enormous
performance boost.

Each sub-parse routine follows the same structure as the
main loop, performing a sorted series of checks to sub-parse
the fields of the message-specific formats. I.e.:

parse sendmail message( )  {
	if(message is a stat=sent message) {
		pull out recipient;
		pull out sender;
		increment message sent count;
		add message size to sender score;
		done
	}
	if(message is a stat=retry message) {
		ignore; //done
	}
	if(message is a whatever) {
		whatever;
		done
	}

	// if we fell through to here we have a new message structure
	// we have never seen before!!
	vector message to interesting folder;
}

Variant messages are a massive pain in the butt; you need to decide
whether to deal with variants as separate cases or to make the
sub-parser smarter in order to deal with them. This is one of the
reasons I keep saying that system log analysis is highly site
specific! If your site doesn't get system logs from a DECStation
5000 running ULTRIX 3.1D then you don't need to parse that data.
Indeed, if you build your parse tree around that notion then, if you
suddenly start getting ULTRIX format log records in your data
stream, that'd be - shall we say - interesting and you want to know
about it. I remember when I was looking at some log data at one
site (Hi Abe!) we found one log message that was about 400K long
on a single line. It appeared that a fairly crucial piece of software
decided to spew a chunk of its memory in a log message, for no
apparent reason. This is what I mean by "interesting."

A philosophical digression: Why not use regexps? Other than the
fact that they look like modem line noise (kids these days probably
don't even know what that looks like, I bet... arrgh!) it's hard to look
quickly at a regexp and a string and tell if it'll match. So most of
the log processing approaches that I've seen resort to having a
loop like:
while( get a message) {
	if(regexp #1 matches) {
		sub_parse format #1;
	} else
	if(regexp #2 matches) {
		...
}

What you wind up doing is parsing the string twice or, worse,
having loads of regexps in the main loop, which makes your
parse tree too broad to execute quickly. The worst case of this
is really really ugly because you wind up executing every
regexp in your pattern-set against every line of the input,
which can suck quite badly if your pattern-set is large.

If you have to use regexps for this, use a pre-parse switch, i.e.:

while( get a message) {
	if(message begins with "sendmail")
		if(regexp #1 matches) {
			sub_parse format #1;
		} else
		...
	}
	...
}

I think that's pretty pointless, though, since by the time you've
done the first match you can just jump straight into the sub-parse
and forget the regexp entirely.

So here's a code-fragment of evil that demonstrates this kind of
technique.
------------------------
#include	<stdio.h>

#define	B_DATEOFF	16      	/* the end of the date/time stamp */

main()
{
	char	in[BUFSIZ * 8];

	/* hypothetical fields we want to skin out of the messages */
	char	b_ho[256];		/* hostname */
	char	b_prog[256];		/* program name */
	int	b_nstore1;		/* stash a number */
	char	b_cstore1[256];		/* stash a char */

	int	p;			/* offset */
	int	r;

	while(fgets(in,sizeof(in),stdin)!= NULL) {
		/* pluck the program name from the message */
		/* magic here is that p stores the match offset into the string */
		if((r = sscanf(in + B_DATEOFF,"%s %s%n",b_ho,b_prog,&p)) != 2)
			continue;

		/* branch on the program name */
		if(strcmp(b_prog,"/bsd:"))
			continue;

		/* -------------------------------------------------------
		from here on demux on next term(s) with individual matches

		you should match with the case that is most likely to
		occur first, then the second most likely, etc.
		*/

		/* RESIDUAL:  dkcsum: wd0 matched BIOS disk 80 */
		r = sscanf(in + B_DATEOFF + p," dkcsum: %s matched BIOS disk
%d",b_cstore1,&b_nstore1);
		if(r == 2) {
			printf("found BIOS disk b_cstore1=%s b_nstore1=%d\n",b_cstore1,b_nstore1);
			continue;
		}

		/* RESIDUAL:  root on wd0a */
		r = sscanf(in + B_DATEOFF + p," root on %s",b_cstore1);
		if(r == 1) {
			printf("found root on b_cstore1=%s\n",b_cstore1);
			continue;
		}
	}
}

You can see how this structure could be nested a bit deeper - in fact,
the more you nest it, in terms of logical tokens left-to-right the faster
it will run because you have to perform fewer tests. So in the example
above you could branch on ("root" | "dkcsum") if there were more than
a few cases.

This approach runs faster than hell on even low-end hardware and
can crunch through a lot of logs extremely rapidly. It has the disadvantage
that it's hard-coded, but the fact that it's a hand-constructed hard-coded
parse tree is what makes it so fast. So suppose you were getting
8 squadjillion firewall permit/deny log messages and wanted to do
something with them - filtering in this manner you could pretty
easily rip statistics into memory, and periodically checkpoint
counts to a BSD DB b-tree or even (if you like slow!) an SQL
database. Just don't do the updates in-line; collect statistics in-line
and checkpoint them to the SQL every minute or something
like that. If your logging system crashes the fact that you lost
a minute's worth of your log data is kind of irrelevant because you'll
lose several minutes' worth by the time it's down and back up.
With something like a firewall that's spewing out scads of
permit/deny your parse routine is very small, right? It's about
the size of that little program I posted up above. And if you
get something other than permit/deny - well, that's interesting,
so bounce it out to a human's attention.

The "gotcha" with this approach is that it is data-centric and
the level of work expands linearly with the number of different
types of data that you want to deal with. So if you want to
rapidly parse 400 different kinds of log messages, you're going
to write some code. Maybe a thousands lines, by the time
you're done. That's not scary!!

What _is_ scary is if you wanted to handle _all_ the variant
forms of log messages. My summer project 2 summers ago
("what I did on my summer vacation", by Marcus Ranum)
was going out to San Diego Supercomputer Center and
plowing around in SDSC's syslogs with Abe and a couple
bottles of tequila and a ton of gummi bears. The goal of the
project was to attempt to automatically generate a parse
tree engine that could recognize all the variants of syslog
messages seen. So first I wrote something that attempted
to enumerate the total # of variants seen. After it ran for a
day (much data, but the computer I was using was fast!)
my analysis program concluded there were 50,000+ different
forms of syslog messages in 10 years of syslog data.
Arrrgh. That's a lot. What the hell impels programmers to
keep changing the structure of their log messages? There
were lots of variants of sendmail messages (dang it!) - the
biggest culprit. Anyhow, you might only want a few of them
so it's not like you'd need parse rules for all 50,000+
forms of message. The important part of the process is
the sitting down and seeing what you've got and deciding
what you want to do with it.

It comes back to what I was saying in my last few postings on this
topic:
	- look at your data
	- then decide what you want to collect from it
	- then collect it
	- count the stuff you throw away
	- vector the rest to a human's attention

mjr.

_______________________________________________
firewall-wizards mailing list
firewall-wizards at honor.icsalabs.com
http://honor.icsalabs.com/mailman/listinfo/firewall-wizards

-------- Original Message --------
Subject: Re: [fw-wiz] parsing logs ultra-fast inline
Date: Thu, 02 Feb 2006 13:00:14 -0500
From: Chuck Swiger <chuck at codefab.com>
Organization: CodeFab
To: Marcus J. Ranum <mjr at ranum.com>
CC: firewall-wizards at honor.icsalabs.com
References: <6.2.0.14.2.20060201142844.06e4c5a8 at ranum.com>

Marcus J. Ranum wrote:
> OK, based on some offline discussion with a few people, about
> doing "large amounts" of system log processing inline at high
> speeds, I thought I'd post a few code fragments and some ideas
> that have worked for me in the past.

Thank you, there is good stuff lurking here for those who spend the time to
think about your posting carefully.  One of the most important points you've
made is that you need to pay attention and handle the common cases efficiently,
but you should also look for and notice exceptional events which are not expected.

Those should stand out from the routine information even if they only happen
once.  Perhaps especially if the event only happens once, uniqueness can be
quite interesting.  I have one analysis tool to offer which can do histograms:

   http://www.pkix.net/~chuck/histogram.py

This makes a really good tool to point at a large mass of logdata and see what
kind of patterns appear how often.  If you've got, say, an httpd logfile in
combined format, try the following:

  histogram.py -F\" -f 2 -n 20 httpd-access.log
  histogram.py -F\" -f 6 -n 20 httpd-access.log
  histogram.py -F\" -f 2 -n 20 -P jpg httpd-access.log
  histogram.py -F\" -f 2 -n 20 -X HTTP/1.0 httpd-access.log
  histogram.py -F\" -f 2 -n 20 -X HTTP/1.1 httpd-access.log

Well, I'm sure you get the idea.  Add the -r flag, and you can see the least
common events instead.

> [... ]  If you put some simple counters in your analysis routines
> (hits versus misses) you can "load balance" your first tree-branch
> appropriately using a flat percentage. Also, remember, if you
> standardize your processing, it doesn't matter where it happens;
> it can happen at the edge/source or back in a central location
> or any combination of the two. Simply design your analysis as
> an always 3-stage process consisting of:
> - weeding out and counting instances of uninteresting events
> - selecting, parsing sub-fields of, and processing interesting events
> - retaining events that fell through the first two steps as "unusual"

This is a fine explanation of event parsing.

Note that the order is not necessarily best done in sequential fashion: ie,
while you want to weed out common uninteresting events quickly, do the most
common matches first and have tests matching common interesting events, before
you test for and discard uncommon uninteresting events.

[ ... ]	
> 	// if we fell through to here we have a new message structure
> 	// we have never seen before!!
> 	vector message to interesting folder;
> }

...and then you go back a week later and add matching cases for the messages
you've never seen before.  Lather, rinse, repeat, until you're done for the
moment, or until a new version of the software comes out, or someone decides to
make syslogd format the date field differently.

> Variant messages are a massive pain in the butt; you need to decide
> whether to deal with variants as separate cases or to make the
> sub-parser smarter in order to deal with them. This is one of the
> reasons I keep saying that system log analysis is highly site
> specific!

I think I can tell which software has been written by people who have had to
update log analysis tools, and which has not: the former never change logging
output, especially of important messages, between minor program versions.  Other
software seems to fiddle with the output details from line to line and from
version to version.

Add fields at the end or add new log lines instead.  Please!

[ ... ]
> A philosophical digression: Why not use regexps? Other than the
> fact that they look like modem line noise (kids these days probably
> don't even know what that looks like, I bet... arrgh!) it's hard to look
> quickly at a regexp and a string and tell if it'll match. So most of
> the log processing approaches that I've seen resort to having a
> loop like:
> while( get a message) {
> 	if(regexp #1 matches) {
> 		sub_parse format #1;
> 	} else
> 	if(regexp #2 matches) {
> 		...
> }
>
> What you wind up doing is parsing the string twice or, worse,
> having loads of regexps in the main loop, which makes your
> parse tree too broad to execute quickly. The worst case of this
> is really really ugly because you wind up executing every
> regexp in your pattern-set against every line of the input,
> which can suck quite badly if your pattern-set is large.

Absolutely, you should try to use static string matching instead of regex'es for
your most common test cases if at all possible.  You can make life easy for your
own log analysis to do this if your program log messages of something like:

   EVENTTYPE: ...message...

...where "...message..." is either a human-readable string, or perhaps a set of
key-value pairs.

Most of your logging messages should correspond to the stuff the program is
supposed to be doing.  An individual log line should provide enough context to
be understandable or at least integratable with related log messages if the
logfile is truncated or rotated, or if the log data is processed in a
non-sequential fashion.  Sendmail queue IDs or PIDs in general logfiles are
common examples of data which can be used to make such correlations.

The closer your events are to the significant real things or changes in state
that are happening in your program, the more effective your logging will be
towards the people using the software.  (I only repeat the obvious because there
is a vast amount of software which misses this point.)

If you need to add debugging code to figure out what's going on in some weird or
exceptional case that you don't understand, use an event type of "DEBUG",
"INFO", "FATAL" or whatever is appropriate followed by a human-oriented message.
 Once you understand the situation better, classify it and change the event type
to the message and add it to your parser.

If you use this consistently, you can easily process the messages your code
generates into discrete events, and you can easily separate out "normal" logging
output from the stuff that a human should review out-of-band.

[ ... ]
> What _is_ scary is if you wanted to handle _all_ the variant
> forms of log messages. My summer project 2 summers ago
> ("what I did on my summer vacation, by Marcus Ranum)
> was going out to San Diego Supercomputer Center and
> plowing around in SDSC's syslogs with Abe and a couple
> bottles of tequila and a ton of gummi bears. The goal of the
> project was to attempt to automatically generate a parse
> tree engine that could recognize all the variants of syslog
> messages seen. So first I wrote something that attempted
> to enumerate the total # of variants seen. After it ran for a
> day (much data, but the computer I was using was fast!)
> my analysis program concluded there were 50,000+ different
> forms of syslog messages in 10 years of syslog data.
> Arrrgh. That's a lot. What the hell impels programmers to
> keep changing the structure of their log messages? There
> were lots of variants of sendmail messages (dang it!) - the
> biggest culprit. Anyhow, you might only want a few of them
> so it's not like you'd need parse rules for all 50,000+
> forms of message. The important part of the process is
> the sitting down and seeing what you've got and deciding
> what you want to do with it.

Agreed, agreed.  But it's never too late to try to become more consistent.
(Starting with not breaking things which work now.)

-- 
-Chuck
_______________________________________________
firewall-wizards mailing list
firewall-wizards at honor.icsalabs.com
http://honor.icsalabs.com/mailman/listinfo/firewall-wizards

-- 
-Chuck

PS: And you should review the usage of the code with regard to security.  You
might even induce Marcus to buy you a Guinness Stout.  :-)