sed/awk, instead of Perl

Sat Aug 23 22:17:33 UTC 2008

At 10:01 AM +0100 8/23/08, Matthew Seaman wrote:
>Walt Pawley wrote:
>>
>> At the risk of beating this to death, I just happened to
>> stumble on a real world example of why one might want to use
>> Perl for sed-ly stuff.
>>  ... snip ...
>> wump$ ls -l Desktop/klog
>> -rw-r--r--  1 wump  1001  52753322 22 Aug 16:37 Desktop/klog
>> wump$ time sed "s/ .*//" Desktop/klog > kadr1
>>
>> real    0m10.800s
>> user    0m10.580s
>> sys     0m0.250s
>> wump$ time perl -pe 's/ .*//' Desktop/klog > kadr2
>>
>> real    0m0.975s
>> user    0m0.700s
>> sys     0m0.270s
>> wump$ cmp kadr1 kadr2
>> wump$
>>
>> Why disparity in execution speed? ...
>
>Careful now.  Have you accounted for the effect of the klog file
>being cached in VM rather than having to be read afresh from disk?
>It makes a very big difference in how fast it is processed.

No, I hadn't done any such accounting. So, wrote a little script
you can surmise from the following output:

wump$ sh -v spdtst
time perl -pe 's/ .*//' Desktop/klog > /dev/null

real    0m0.961s
user    0m0.740s
sys     0m0.230s
time sed "s/ .*//" Desktop/klog > /dev/null

real    0m10.506s
user    0m10.270s
sys     0m0.250s
time awk '{print $1}' Desktop/klog > /dev/null

real    0m2.333s
user    0m2.140s
sys     0m0.180s
time sed "s/ .*//" Desktop/klog > /dev/null

real    0m10.489s
user    0m10.250s
sys     0m0.230s
time perl -pe 's/ .*//' Desktop/klog > /dev/null

real    0m0.799s
user    0m0.580s
sys     0m0.220s

>In order to get meaningful data for this sort of test you should
>do a dummy run or two of each command in fairly quick succession,
>and then repeat your test runs a number of times and look at the
>average and standard deviation of the execution times. ...

Yeah, Hoyle would like that. But for me, I think the results
are clear enough without all the messing with statistical
computations. 10 to 1 or better is good enough for me to think
there's some major difference. That said, it would appear that
caching can make a difference - which is why I put the Perl
invocation first ... so it would be running without the benefit
of caching. But I don't believe I was entirely successful in
that effort. The very first time I ran this, which was also the
very first time in a whole day that the klog file had been
accessed, the first Perl invocation took about 2 seconds of
real time and still only 0.7 seconds of user time. I don't
believe caching explains the execution speed disparity.

It was mentioned that this function is made for awk, so I tried
that as well. It is also evidently not as quick as Perl at
doing the job. The time shown above is quite consistent with a
number of other runs I've tried with awk.

I suspect a real Perl internals maven could explain this. I
have some ideas but they're conjecture. Perhaps some effort to
improve execution efficiency in sed and awk would not be wasted?
-- 

Walter M. Pawley <walt at wump.org>
Wump Research & Company
676 River Bend Road, Roseburg, OR 97470
         541-672-8975