Fast diff command for large files?

Charles Swiger cswiger at mac.com
Fri Nov 4 19:39:25 GMT 2005


On Nov 4, 2005, at 12:29 PM, Kirk Strauser wrote:
>> Multigigabyte?  Find another approach to solving the problem, a  
>> text-base
>> diff is going to require excessive resources and time.  A 64-bit  
>> platform
>> with 2 GB of RAM & 3GB of swap requires ~1000 seconds to diff ~400MB.
>
> There really aren't many options.  For the patient, here's what's  
> happening:
> [ ... ]
> And that's why I need a fast diff.  Even if it takes as long as the  
> database
> bulk loads, we can run it on another server and use 20 seconds of  
> CPU for
> PostgreSQL instead of 45 minutes.  The practical upshot is that the
> database will never get sluggish, even if the other "diff server"  
> is loaded
> to the gills.

OK, but even if only one line out of 1000 changes, you still can't  
make either diff or Colin Percival's bsdiff run on gigabyte sized  
files and have it fit into MAXDSIZE on 32-bit address space.  From  
the latter's website:

"bsdiff is quite memory-hungry. It requires max(17*n,9*n+m)+O(1)  
bytes of memory, where n is the size of the old file and m is the  
size of the new file.  bspatch requires n+m+O(1) bytes.  bsdiff runs  
in O((n+m) log n) time; on a 200MHz Pentium Pro, building a binary  
patch for a 4MB file takes about 90 seconds. bspatch runs in O(n+m)  
time; on the same machine, applying that patch takes about two seconds."

Some time ago, I wrote a quick test harness for diff here:

    http://www.pkix.net/~chuck/difftest.py

On a 5.4 machine with kern.dfldsiz="1G" set in /boot/loader.conf, you  
can only manage to run diff on files up to about 120 MB in size:

31-pi% ./difftest.py -v
INFO: beginning diff trial run with ratio = 100
filea_size=10485760 (aka 10.000 MB)
time=1.370 filea_size=10MB diff_size=818KB
filea_size=15728640 (aka 15.000 MB)
time=2.305 filea_size=15MB diff_size=1229KB
filea_size=23592960 (aka 22.500 MB)
time=5.443 filea_size=22MB diff_size=1844KB
filea_size=35389440 (aka 33.750 MB)
time=7.195 filea_size=33MB diff_size=2768KB
filea_size=53084160 (aka 50.625 MB)
time=16.771 filea_size=50MB diff_size=4163KB
filea_size=79626240 (aka 75.938 MB)
time=43.525 filea_size=75MB diff_size=6257KB
filea_size=119439360 (aka 113.906 MB)
time=78.346 filea_size=113MB diff_size=9MB
filea_size=179159040 (aka 170.859 MB)
diff: memory exhausted
NOTICE: diff exitted with errno 2
time=36.896 filea_size=170MB diff_size=0KB
272.58s real  154.73s user  13.23s system  61%

On a 64-bit SPARC box mentioned above, you can get sizes up to ~400 MB:

[ ... ]
filea_size=119439360 (aka 113.906 MB)
time=140.650 filea_size=115MB diff_size=9MB
filea_size=179159040 (aka 170.859 MB)
time=424.586 filea_size=172MB diff_size=15MB
filea_size=268738560 (aka 256.289 MB)
time=546.334 filea_size=258MB diff_size=22MB
filea_size=403107840 (aka 384.434 MB)
time=957.059 filea_size=388MB diff_size=33MB
filea_size=604661760 (aka 576.650 MB)
diff: memory exhausted
NOTICE: diff exitted with errno 2
time=105.728 filea_size=582MB diff_size=0KB
5610.90s real  3268.63s user  1761.90s system  89%

Roughly, you need about an order of magnitude more RAM or virtual  
memory available then the size of the files you are trying to diff,  
even if the files are very similar.

-- 
-Chuck



More information about the freebsd-questions mailing list