awk programming question

Thu Jan 23 16:30:37 UTC 2014

On Thu, 23 Jan 2014, Paul Schmehl wrote:

> I'm kind of stubborn.  There's lots of different ways to skin a cat, but I 
> like to force myself to use the built-in utilities to do things so I can 
> learn more about them and better understand how they work.
>
> So, I'm trying to parse a file of snort rules, extract two string values and 
> insert a double pipe between them to create a sig-msg.map file
>
> Here's a typical rule:
>
> alert udp $HOME_NET any -> $EXTERNAL_NET 69 (msg:"E3[rb] ET POLICY Outbound 
> TFTP Read Request"; content:"|00 01|"; depth:2; classtype:bad-unknown; 
> sid:2008120; rev:1;)
>
> Here's a typical sig-msg.map file entry:
>
> 9624 || RPC UNIX authentication machinename string overflow attempt UDP
>
> So, from the above rule I would want to create a single line like this:
>
> 2008120 || E3[rb] ET POLICY Outbound TFTP Read Request
>
> There are several ways I can extract one or the other value, and I've figured 
> out how to extract the sid and add the double pipe, but for the life of me I 
> can't figure out how to extract and print out sid || msg.
>
> This prints out the sid and the double pipe:
>
> echo `awk 'match($0,/sid:[0-9]*;/) {print substr($0,RSTART,RLENGTH)" || "}' 
> /tmp/mtc.rules | tr -d ";sid"
>
> It seems I could put the results into a variable rather than printing them 
> out, and then print var1 || var2, but my google foo hasn't found a useful 
> example.
>
> Surely there's a way to do this using awk?  I can use tr for cleanup.  I just 
> need to get close to the right result.
>
> How about it awk experts?  What's the cleanest way to get this done?

Not an awk expert, but you can do math on the start and length variables 
to get just the date part:

echo "sid:2008120;" \
   | awk '{ match($0, /sid:[0-9]*;/) ; \
 	ymd=substr($0, RSTART+4, RLENGTH-5) ; print ymd }'

Closer to what you want:

echo 'msg:"E3[rb] ET POLICY Outbound TFTP Read Request"; sid:2008120;' \
   | awk '{ match($0, /sid:[0-9]*;/) ; \
 	ymd=substr($0, RSTART+4, RLENGTH-5) ; \
 	match($0, /msg:.*;/) ; \
 	msg = substr($0, RSTART+4, RLENGTH-5) ; \
 	print ymd, "||", msg }'

Note the error that the too-greedy regex creates, and the inability of 
awk to capture regex sub-expressions.  awk does not have a way to reduce 
the greediness, at least that I'm aware.  You may be able to work around 
that, like if the message is always the same length.

sed, despite its many weaknesses, can capture subexpressions:

echo "sid:2008120;" | sed -e 's/^.*sid:\([0-9]*\);.*$/\1/'

I don't think sed has a non-greedy modifier either.  Basically, sed and 
awk are frozen in the early 1970s, back before it became popular to do 
useful things.  That was one reason Perl came along, and later, Python 
and Ruby.

echo 'msg:"E3[rb] ET POLICY Outbound TFTP Read Request"; sid:2008120;' \
   | perl -ne 'if ( /msg:"(.*)?";.*sid:(\d*?);/ ) { print "$2 || $1\n" };'

The regex uses the ? to reduce greediness, Perl's "\d" instead of the 
longer [0-9], and the pattern capturing parens, which fill in $1 and $2. 
The "if" statement is not required, but it's bad practice to print the 
contents of pattern capture variables unless the match actually 
succeeded.