copying milllions of small files and millions of dirs

Tue Aug 20 09:19:35 UTC 2013

On 20/08/2013 08:32, krad wrote:
> When i migrated a large mailspool in maildir format from the old nfs server
> to the new one in a previous job, I 1st generated a list of the top level
> maildirs. I then generated the rsync commands + plus a few other bits and
> pieces for each maildir to make a single transaction like function. I then
> pumped all this auto generated scripts into xjobs and ran them in parallel.
> This vastly speeded up the process as sequentially running the tree was far
> to slow. THis was for about 15 million maildirs in a hashed structure btw
> so a fair amount of files.
>
>
> eg
>
> find /maildir -type d -maxdepth 4 | while read d
> do
> r=$(($RANDOM*$RANDOM))
> echo rsync -a $d/ /newpath/$d/ > /tmp/scripts/$r
> echo some other stuff >> /tmp/scripts/$r
> done
>
> ls /tmp/scripts/| while read f
> echo /tmp/scripts/$f
> done | xjobs -j 20
>

This isn't what I'd have expected, as running operations in parallel on 
mechanical drives would normally result in superfluous head movements 
and thus exacerbate the I/O bottleneck. The system must be optimising 
the requests from 20 parallel jobs better than I thought it would to 
climb out from that hole far enough to get a net benefit. Did you 
remember how any other approaches performed?