Batch file question - average size of file in directory

Wed Jan 3 10:55:22 PST 2007

On 1/2/07, Giorgos Keramidas <keramida at ceid.upatras.gr> wrote:
> On 2007-01-02 10:20, Kurt Buff <kurt.buff at gmail.com> wrote:
> You can probably use awk(1) or perl(1) to post-process the output of
> gzip(1).
>
> The gzip(1) utility, when run with the -cd options will uncompress the
> compressed files and send the uncompressed data to standard output,
> without actually affecting the on-disk copy of the compressed data.
>
> It is easy then to pipe the uncompressed data to wc(1) to count the
> 'bytes' of the uncompressed data:
>
>         for fname in *.Z *.z *.gz; do
>                 if test -f "${fname}"; then
>                         gzip -cd "${fname}" | wc -c
>                 fi
>         done
>
> This will print the byte-size of the uncompressed output of gzip, for
> all the files which are currently compressed.  Something like the
> following could be its output:

I put together this one-liner after perusing 'man zcat':

find /local/amavis/virusmails -name "*.gz" -print | xargs zcat -l >> out.txt

It puts out multiple instances of stuff like this:

compressed  uncompr. ratio uncompressed_name
     1508      3470  57.0% stuff-7f+BIOFX1-qX
     1660      3576  54.0% stuff-bsFK-yGcWyCm
     9113     17065  46.7% stuff-os1MKlKGu8ky
...
...
...
 10214796  17845081  42.7% (totals)
compressed  uncompr. ratio uncompressed_name
     7790     14732  47.2% stuff-Z3UO7-uvMANd
     1806      3705  51.7% stuff-9ADk-DSBFQGQ
     9020     16638  45.8% stuff-Caqfgao-Tc5F
     7508     14361  47.8% stuff-kVUWa8ua4zxc

I'm thinking that piping the output like so:

find /local/amavis/virusmails -name "*.gz" -print | xargs zcat -l |
grep -v compress | grep-v totals

will do to suppress extraneous header/footer info

> This can be piped into awk(1) for further processing, with something
> like this:
>
>         for fname in *.Z *.gz; do
>                 if test -f "$fname"; then
>                         gzip -cd "$fname" | wc -c
>                 fi
>         done | \
>         awk 'BEGIN {
>             min = -1; max = 0; total = 0;
>         }
>         {
>             total += $1;
>             if ($1 > max) {
>                 max = $1;
>             }
>             if (min == -1 || $1 < min) {
>                 min = $1;
>             }
>         }
>         END {
>             if (NR > 0) {
>                 printf "min/avg/max file size = %d/%d/%d\n",
>                     min, total / NR, max;
>             }
>         }'
>
> With the same files as above, the output of this would be:
>
>         min/avg/max file size = 220381/1750650/3280920
>
> With a slightly modified awk(1) script, you can even print a running
> min/average/max count, following each line.  Mmodified lines marked with
> a pipe character (`|') in their leftmost column below.  The '|'
> characters are *not* part of the script itself.
>
>         for fname in *.Z *.gz; do
>                 if test -f "$fname"; then
>                         gzip -cd "$fname" | wc -c
>                 fi
>         done | \
>         awk 'BEGIN {
>             min = -1; max = 0; total = 0;
> |           printf "%10s %10s %10s %10s\n",
> |               "SIZE", "MIN", "AVERAGE", "MAX";
>         }
>         {
>             total += $1;
>             if ($1 > max) {
>                 max = $1;
>             }
>             if (min == -1 || $1 < min) {
>                 min = $1;
>             }
> |           printf "%10d %10d %10d %10d\n",
> |               $1, min, total/NR, max;
>         }
>         END {
>             if (NR > 0) {
> |               printf "%10s %10d %10d %10d\n",
> |                   "TOTAL", min, total / NR, max;
>             }
>         }'
>
> When run with the same set of two compressed files this will print:
>
>       SIZE        MIN    AVERAGE        MAX
>     220381     220381     220381     220381
>    3280920     220381    1750650    3280920
>      TOTAL     220381    1750650    3280920
>
> Please note though that with a sufficiently large set of files, awk(1)
> may fail to count the total number of bytes correctly.  If this is the
> case, it should be easy to write an equivalent Perl or Python script,
> to take advantage of their big-number support.

I'll try to parse and understand this, and see if I can modify it to
suit the output I'm currently generating.

Many thanks for the help!

Kurt