Analyzing Log files of very large size

Mon Jul 12 06:44:34 UTC 2021

Yes, Perl is perhaps the best solution for this job, but only if
you don`t need to start it from zero. It can take some time to get into
basics. 

About sizes: I processed 60Gb file with shell utilities,
containing the bunch of pipes, in reasonable time (like several hours).
10K rpm HDD, not SSD. Of course, one should be aware about what is he
doing and optimize data processing pipeline. 

About indexing approach:
once again, i don`t know what exactly need to be extracted from file,
but if it is a table contaning some aggregated results, for example,
then indexing may be an overkill. 

To topic author: maybe if you show a
piece of file and explain desired result, then advise could be more
precise. 

On Mon, 12 Jul 2021 02:20:58 -0400, Paul Procacci wrote: 

>
On Mon, Jul 12, 2021 at 1:44 AM Korolev Sergey wrote:
> 
>> I think,
that proper tools usually highly depends on desired result, so my
reasoning is quite general. People here advise to use Perl and also
split one large file into managable pieces - all that is very good, I
vote for that. But I don`t know Perl at all, so I usually get along with
standard shell utilities: grep, tr, awk, sed, etc. I used to parse big
maillogs with them successfully.
> 
> Most standard shell utilities can
certainly get the job done if the file
> sizes are
> of a size that's
manageable. That is most likely the vast majority of
> cases. No
>
question about that.
> 
> There's certainly a point however when the
sizes become so unmanageable
> that their
> completion will be on your
150th birthday. ;) An exaggeration undoubtedly.
> 
> There's obviously
options for this, but you'll seldom find the answer in any
> standard
install of any userland. Sometimes you can get away with xargs,
>
depending
> on what the data is that you're working with, but that's all
that comes to
> mind.
> 
> The "promotion" from there in my mind is
going the perl route (or any other
> interpreted
> language) capable of
threading ... and from there as necessary ... C (or
> other compiled
>
language).
> 
> Someone made mention of Elasticsearch and that's a good
option too. All
> the work
> of indexing the data has already been done
for you. You just don't have to
> mind paying
> for it. ;)
> 
> Hell,
I've used postgresql with their fulltext search for similar things as
>
well and I'd argue
> if that's already in your stack, to at the very
least try that first.
> You'd be surprised at
> how darn well it does.
>

> Goodnight!
> 
> ~Paul

Links:
------
[1] mailto:serejk at febras.net