Batch file question - average size of file in directory

Wed Jan 3 04:55:52 PST 2007

On 2007-01-02 10:20, Kurt Buff <kurt.buff at gmail.com> wrote:
> All,
>
> I don't even have a clue how to start this one, so am looking for a
> little help.
>
> I've got a directory with a large number of gzipped files in it (over
> 110k) along with a few thousand uncompressed files.
>
> I'd like to find the average uncompressed size of the gzipped files,
> and ignore the uncompressed files.
>
> How on earth would I go about doing that with the default shell (no
> bash or other shells installed), or in perl, or something like that.
> I'm no scripter of any great expertise, and am just stumbling over
> this trying to find an approach.

You can probably use awk(1) or perl(1) to post-process the output of
gzip(1).

The gzip(1) utility, when run with the -cd options will uncompress the
compressed files and send the uncompressed data to standard output,
without actually affecting the on-disk copy of the compressed data.

It is easy then to pipe the uncompressed data to wc(1) to count the
'bytes' of the uncompressed data:

        for fname in *.Z *.z *.gz; do
                if test -f "${fname}"; then
                        gzip -cd "${fname}" | wc -c
                fi
        done

This will print the byte-size of the uncompressed output of gzip, for
all the files which are currently compressed.  Something like the
following could be its output:

          220381
         3280920

This can be piped into awk(1) for further processing, with something
like this:

        for fname in *.Z *.gz; do
                if test -f "$fname"; then
                        gzip -cd "$fname" | wc -c
                fi
        done | \
        awk 'BEGIN {
            min = -1; max = 0; total = 0;
        }
        {
            total += $1;
            if ($1 > max) {
                max = $1;
            }
            if (min == -1 || $1 < min) {
                min = $1;
            }
        }
        END {
            if (NR > 0) {
                printf "min/avg/max file size = %d/%d/%d\n",
                    min, total / NR, max;
            }
        }'

With the same files as above, the output of this would be:

        min/avg/max file size = 220381/1750650/3280920

With a slightly modified awk(1) script, you can even print a running
min/average/max count, following each line.  Mmodified lines marked with
a pipe character (`|') in their leftmost column below.  The '|'
characters are *not* part of the script itself.

        for fname in *.Z *.gz; do
                if test -f "$fname"; then
                        gzip -cd "$fname" | wc -c
                fi
        done | \
        awk 'BEGIN {
            min = -1; max = 0; total = 0;
|           printf "%10s %10s %10s %10s\n",
|               "SIZE", "MIN", "AVERAGE", "MAX";
        }
        {
            total += $1;
            if ($1 > max) {
                max = $1;
            }
            if (min == -1 || $1 < min) {
                min = $1;
            }
|           printf "%10d %10d %10d %10d\n",
|               $1, min, total/NR, max;
        }
        END {
            if (NR > 0) {
|               printf "%10s %10d %10d %10d\n",
|                   "TOTAL", min, total / NR, max;
            }
        }'

When run with the same set of two compressed files this will print:

      SIZE        MIN    AVERAGE        MAX
    220381     220381     220381     220381
   3280920     220381    1750650    3280920
     TOTAL     220381    1750650    3280920

Please note though that with a sufficiently large set of files, awk(1)
may fail to count the total number of bytes correctly.  If this is the
case, it should be easy to write an equivalent Perl or Python script,
to take advantage of their big-number support.