Fast diff command for large files?

Kirk Strauser kirk at strauser.com
Mon Nov 7 15:48:41 GMT 2005


On Sunday 06 November 2005 07:39, Andrew P. wrote:

> Note, that the difference must be kept in RAM, so it won't work if there 
> are multi-gig diffs, but it will work very fast if the diffs are only 
> 10-100Mb, it will work at close to I/O speed if the diff is under 10Mb.  

Thanks, Andrew!  My Python script runs that algorithm in 17 seconds on a 
400MB file with 10% CPU.

For anyone interested, here's my implementation.  Note that the readline() 
method in Python always returns something, even at EOF (at which point you 
get an empty string).  Also, empty strings evaluate as "false", which is 
why the "if not (oldline or newline): break" code exits at the end.

    old_records = []
    new_records = []

    while 1:
        oldline, newline = oldfile.readline(), newfile.readline()
        if not (oldline or newline):
            break
        if oldline == newline:
            continue

        try:
            new_records.remove(oldline)
        except ValueError:
            if oldline:
                old_records.append(oldline)

        try:
            old_records.remove(newline)
        except ValueError:
            if newline:
                new_records.append(newline)

> Hope this gives you some idea.

It did.  It must've been a long work week, because that all seems so obvious 
in retrospect but was completely opaque at the time.  Thanks again!
-- 
Kirk Strauser
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 155 bytes
Desc: not available
Url : http://lists.freebsd.org/pipermail/freebsd-questions/attachments/20051107/028a57a8/attachment.bin


More information about the freebsd-questions mailing list