Batch file question - average size of file in directory
Giorgos Keramidas
keramida at ceid.upatras.gr
Wed Jan 3 04:55:52 PST 2007
On 2007-01-02 10:20, Kurt Buff <kurt.buff at gmail.com> wrote:
> All,
>
> I don't even have a clue how to start this one, so am looking for a
> little help.
>
> I've got a directory with a large number of gzipped files in it (over
> 110k) along with a few thousand uncompressed files.
>
> I'd like to find the average uncompressed size of the gzipped files,
> and ignore the uncompressed files.
>
> How on earth would I go about doing that with the default shell (no
> bash or other shells installed), or in perl, or something like that.
> I'm no scripter of any great expertise, and am just stumbling over
> this trying to find an approach.
You can probably use awk(1) or perl(1) to post-process the output of
gzip(1).
The gzip(1) utility, when run with the -cd options will uncompress the
compressed files and send the uncompressed data to standard output,
without actually affecting the on-disk copy of the compressed data.
It is easy then to pipe the uncompressed data to wc(1) to count the
'bytes' of the uncompressed data:
for fname in *.Z *.z *.gz; do
if test -f "${fname}"; then
gzip -cd "${fname}" | wc -c
fi
done
This will print the byte-size of the uncompressed output of gzip, for
all the files which are currently compressed. Something like the
following could be its output:
220381
3280920
This can be piped into awk(1) for further processing, with something
like this:
for fname in *.Z *.gz; do
if test -f "$fname"; then
gzip -cd "$fname" | wc -c
fi
done | \
awk 'BEGIN {
min = -1; max = 0; total = 0;
}
{
total += $1;
if ($1 > max) {
max = $1;
}
if (min == -1 || $1 < min) {
min = $1;
}
}
END {
if (NR > 0) {
printf "min/avg/max file size = %d/%d/%d\n",
min, total / NR, max;
}
}'
With the same files as above, the output of this would be:
min/avg/max file size = 220381/1750650/3280920
With a slightly modified awk(1) script, you can even print a running
min/average/max count, following each line. Mmodified lines marked with
a pipe character (`|') in their leftmost column below. The '|'
characters are *not* part of the script itself.
for fname in *.Z *.gz; do
if test -f "$fname"; then
gzip -cd "$fname" | wc -c
fi
done | \
awk 'BEGIN {
min = -1; max = 0; total = 0;
| printf "%10s %10s %10s %10s\n",
| "SIZE", "MIN", "AVERAGE", "MAX";
}
{
total += $1;
if ($1 > max) {
max = $1;
}
if (min == -1 || $1 < min) {
min = $1;
}
| printf "%10d %10d %10d %10d\n",
| $1, min, total/NR, max;
}
END {
if (NR > 0) {
| printf "%10s %10d %10d %10d\n",
| "TOTAL", min, total / NR, max;
}
}'
When run with the same set of two compressed files this will print:
SIZE MIN AVERAGE MAX
220381 220381 220381 220381
3280920 220381 1750650 3280920
TOTAL 220381 1750650 3280920
Please note though that with a sufficiently large set of files, awk(1)
may fail to count the total number of bytes correctly. If this is the
case, it should be easy to write an equivalent Perl or Python script,
to take advantage of their big-number support.
More information about the freebsd-questions
mailing list