Batch file question - average size of file in directory
Kurt Buff
kurt.buff at gmail.com
Wed Jan 3 10:55:22 PST 2007
On 1/2/07, Giorgos Keramidas <keramida at ceid.upatras.gr> wrote:
> On 2007-01-02 10:20, Kurt Buff <kurt.buff at gmail.com> wrote:
> You can probably use awk(1) or perl(1) to post-process the output of
> gzip(1).
>
> The gzip(1) utility, when run with the -cd options will uncompress the
> compressed files and send the uncompressed data to standard output,
> without actually affecting the on-disk copy of the compressed data.
>
> It is easy then to pipe the uncompressed data to wc(1) to count the
> 'bytes' of the uncompressed data:
>
> for fname in *.Z *.z *.gz; do
> if test -f "${fname}"; then
> gzip -cd "${fname}" | wc -c
> fi
> done
>
> This will print the byte-size of the uncompressed output of gzip, for
> all the files which are currently compressed. Something like the
> following could be its output:
I put together this one-liner after perusing 'man zcat':
find /local/amavis/virusmails -name "*.gz" -print | xargs zcat -l >> out.txt
It puts out multiple instances of stuff like this:
compressed uncompr. ratio uncompressed_name
1508 3470 57.0% stuff-7f+BIOFX1-qX
1660 3576 54.0% stuff-bsFK-yGcWyCm
9113 17065 46.7% stuff-os1MKlKGu8ky
...
...
...
10214796 17845081 42.7% (totals)
compressed uncompr. ratio uncompressed_name
7790 14732 47.2% stuff-Z3UO7-uvMANd
1806 3705 51.7% stuff-9ADk-DSBFQGQ
9020 16638 45.8% stuff-Caqfgao-Tc5F
7508 14361 47.8% stuff-kVUWa8ua4zxc
I'm thinking that piping the output like so:
find /local/amavis/virusmails -name "*.gz" -print | xargs zcat -l |
grep -v compress | grep-v totals
will do to suppress extraneous header/footer info
> This can be piped into awk(1) for further processing, with something
> like this:
>
> for fname in *.Z *.gz; do
> if test -f "$fname"; then
> gzip -cd "$fname" | wc -c
> fi
> done | \
> awk 'BEGIN {
> min = -1; max = 0; total = 0;
> }
> {
> total += $1;
> if ($1 > max) {
> max = $1;
> }
> if (min == -1 || $1 < min) {
> min = $1;
> }
> }
> END {
> if (NR > 0) {
> printf "min/avg/max file size = %d/%d/%d\n",
> min, total / NR, max;
> }
> }'
>
> With the same files as above, the output of this would be:
>
> min/avg/max file size = 220381/1750650/3280920
>
> With a slightly modified awk(1) script, you can even print a running
> min/average/max count, following each line. Mmodified lines marked with
> a pipe character (`|') in their leftmost column below. The '|'
> characters are *not* part of the script itself.
>
> for fname in *.Z *.gz; do
> if test -f "$fname"; then
> gzip -cd "$fname" | wc -c
> fi
> done | \
> awk 'BEGIN {
> min = -1; max = 0; total = 0;
> | printf "%10s %10s %10s %10s\n",
> | "SIZE", "MIN", "AVERAGE", "MAX";
> }
> {
> total += $1;
> if ($1 > max) {
> max = $1;
> }
> if (min == -1 || $1 < min) {
> min = $1;
> }
> | printf "%10d %10d %10d %10d\n",
> | $1, min, total/NR, max;
> }
> END {
> if (NR > 0) {
> | printf "%10s %10d %10d %10d\n",
> | "TOTAL", min, total / NR, max;
> }
> }'
>
> When run with the same set of two compressed files this will print:
>
> SIZE MIN AVERAGE MAX
> 220381 220381 220381 220381
> 3280920 220381 1750650 3280920
> TOTAL 220381 1750650 3280920
>
> Please note though that with a sufficiently large set of files, awk(1)
> may fail to count the total number of bytes correctly. If this is the
> case, it should be easy to write an equivalent Perl or Python script,
> to take advantage of their big-number support.
I'll try to parse and understand this, and see if I can modify it to
suit the output I'm currently generating.
Many thanks for the help!
Kurt
More information about the freebsd-questions
mailing list