Optimize shell

Norberto Meijome freebsd at meijome.net
Mon Feb 6 21:03:00 PST 2006


Olivier Nicole wrote:
> Hello,
> 
> I am setting up a machine to work as a mail back-up. It receives copy
> of every email for every user. When the disk is almost full, I want to
> delete older messages up to a total size of 4000000000.
> 
> Messages are stored in /home/sub_home/user/Maildir/cur in maildir
> format. 
> 
> Message name is of the form 1137993135.86962_0.machine.cs.ait.ac.th
> where the first number is a Unix time stamp.
> 
> I came up with the following sheel to find the messages of all users,
> sort them by date and compute the total size up to 4gB.
> 
> for i in `/usr/bin/find /home -mindepth 5 -ls | /usr/bin/grep /Maildir/cur/ | /usr/bin/sed -E 's/^ *[0-9]+ +[0-9]+ +[-rwx]+ +[0-9]+ +[^ ]+ +[^ ]+ +([0-9]+) +.*(\/home\/.*\/)([0-9]+)(\..*)$/\3 \1 \2\3\4/' | /usr/bin/sort -n +0 -1 | /usr/bin/awk '{sum+=$2; if (sum < 4000000000) print $3;}'`; do
>     /bin/rm $i
> done
> 
> find /home -mindepth 5 -ls makes a list of all files and directory at
>      a depth of 5 and more because my directory structure is so that
>      messages are store at level 6
> 
> grep /Maildir/cur/ because courrierimapo tends to put things in other
>      directories it creates when it needs too
> 
> These two commads give me a list of the form:
> 
> 1397490    8 -rw-------    1 on               staff            3124 Jan 27 15:23 /home/java/on/Maildir/cur/1138350182.1413_1.mackine.cs.ait.ac.th
> 
> where 3124 is the size
> 
> The sed command transforms the line into date, size, filname:
> 
> 1137994623 2466 /home/java/on/Maildir/cur/1137994623.87673_0.mail.cs.ait.ac.th
> 
> Then it sorts on the date field and awk is used to sum on the size
> field and print the filename until the total of 4gB is reached.
> 
> That works OK, but it is damn slow: for 200 users, 7800 messages and
> 302MB it takes something like 3+ minutes... For 25 GB of email it
> should take more than 4 hours, this is too much.
> 
> It sems that the long part is the sort:
> 
> without sort
> time /usr/bin/find /home -mindepth 5 -ls | /usr/bin/grep /Maildir/cur/ | /usr/bin/sed -E 's/^ *[0-9]+ +[0-9]+ +[-rwx]+ +[0-9]+ +[^ ]+ +[^ ]+ +([0-9]+) +.*(\/home\/.*\/)([0-9]+)(\..*)$/\3 \1 \2\3\4/' |  cat /dev/null
> 0.026u 0.035s 0:07.67 0.6%      51+979k 0+0io 0pf+0w
> 
> with sort
> time /usr/bin/find /home -mindepth 5 -ls | /usr/bin/grep /Maildir/cur/ | /usr/bin/sed -E 's/^ *[0-9]+ +[0-9]+ +[-rwx]+ +[0-9]+ +[^ ]+ +[^ ]+ +([0-9]+) +.*(\/home\/.*\/)([0-9]+)(\..*)$/\3 \1 \2\3\4/' | /usr/bin/sort -n +0 -1 | cat /dev/null
> 0.281u 0.366s 3:44.75 0.2%      39+1042k 0+0io 0pf+0w
> 
> Any idea how to speed up the things?

Assuming the issue with sort being slow is the amount of items to
handle, it may help if you reduced the number of items in the list.
i.e., can you set a limit such as delete the oldest x months / keep only
 3 months of recent mail in the cur folder? (in which case you may just
do a search by timestamp and forget about sorting and awking....)

I have also found that sort is much slower than purpose built sorting
utilities (sort is much much slower than zmergelog when sorting large
(several GB) of apache log files) - maybe you can write or use some
other tool for this?

Beto


More information about the freebsd-questions mailing list