sed/awk, instead of Perl

Sun Aug 24 15:27:00 UTC 2008

On Sat, 2008-08-23 at 15:16 -0700, Walt Pawley wrote:
> At 10:01 AM +0100 8/23/08, Matthew Seaman wrote:
> >Walt Pawley wrote:
> >>
> >> At the risk of beating this to death, I just happened to
> >> stumble on a real world example of why one might want to use
> >> Perl for sed-ly stuff.
> >>  ... snip ...
> >> wump$ ls -l Desktop/klog
> >> -rw-r--r--  1 wump  1001  52753322 22 Aug 16:37 Desktop/klog
> >> wump$ time sed "s/ .*//" Desktop/klog > kadr1
> >>
> >> real    0m10.800s
> >> user    0m10.580s
> >> sys     0m0.250s
> >> wump$ time perl -pe 's/ .*//' Desktop/klog > kadr2
> >>
> >> real    0m0.975s
> >> user    0m0.700s
> >> sys     0m0.270s
> >> wump$ cmp kadr1 kadr2
> >> wump$
> >>
> >> Why disparity in execution speed? ...
> >
> >Careful now.  Have you accounted for the effect of the klog file
> >being cached in VM rather than having to be read afresh from disk?
> >It makes a very big difference in how fast it is processed.
> 
> No, I hadn't done any such accounting. So, wrote a little script
> you can surmise from the following output:
> 
> wump$ sh -v spdtst
> time perl -pe 's/ .*//' Desktop/klog > /dev/null
> 
> real    0m0.961s
> user    0m0.740s
> sys     0m0.230s
> time sed "s/ .*//" Desktop/klog > /dev/null
> 
> real    0m10.506s
> user    0m10.270s
> sys     0m0.250s
> time awk '{print $1}' Desktop/klog > /dev/null
> 
> real    0m2.333s
> user    0m2.140s
> sys     0m0.180s
> time sed "s/ .*//" Desktop/klog > /dev/null
> 
> real    0m10.489s
> user    0m10.250s
> sys     0m0.230s
> time perl -pe 's/ .*//' Desktop/klog > /dev/null
> 
> real    0m0.799s
> user    0m0.580s
> sys     0m0.220s
> 
I see similar results on all of four systems I tried here - an order of
magnitude difference between perl (fastest) and sed, and awk slightly
slower than perl. All running perl 5.8.8. I did a handful of manual runs
and took the most consistent-looking results. Source file was a 62MB
apache log with 232k records.

Interestingly an Ubuntu system exhibited a similar difference between
perl and sed, but its awk was slightly faster than perl.

> >In order to get meaningful data for this sort of test you should
> >do a dummy run or two of each command in fairly quick succession,
> >and then repeat your test runs a number of times and look at the
> >average and standard deviation of the execution times. ...
> 
> Yeah, Hoyle would like that. But for me, I think the results
> are clear enough without all the messing with statistical
> computations. 10 to 1 or better is good enough for me to think
> there's some major difference. That said, it would appear that
> caching can make a difference - which is why I put the Perl
> invocation first ... so it would be running without the benefit
> of caching. But I don't believe I was entirely successful in
> that effort. The very first time I ran this, which was also the
> very first time in a whole day that the klog file had been
> accessed, the first Perl invocation took about 2 seconds of
> real time and still only 0.7 seconds of user time. I don't
> believe caching explains the execution speed disparity.
> 
> It was mentioned that this function is made for awk, so I tried
> that as well. It is also evidently not as quick as Perl at
> doing the job. The time shown above is quite consistent with a
> number of other runs I've tried with awk.
> 
Keep in mind that awk, while producing a comparable result, likely uses
quite a different parsing strategy. While the comparison is interesting
for this particular test-case, different circumstances could produce
very different results.

> I suspect a real Perl internals maven could explain this. I
> have some ideas but they're conjecture. Perhaps some effort to
> improve execution efficiency in sed and awk would not be wasted?

My conjecture is this: the implementation of pcre that perl uses most
likely has good optimisation for the "ends with .*" part of the pattern
(vs sed). While the result is certainly interesting and perhaps
surprising[1], it is for a single, simple pattern which is far too
little to draw much in the way of conclusions from - except perhaps that
extracting the first field from a data source with many records can
possibly be effected more rapidly with perl or awk than sed.

Nevertheless, I've always dismissed perl as being "heavy and slow"
through anecdotal "evidence" and the results you found are a pertinent
reminder that assumptions like that are never worthy.

Wayne

[1] particularly in light of studies such as this one:
http://swtch.com/~rsc/regexp/regexp1.html