Fast diff command for large files?
Andrew P.
infofarmer at gmail.com
Sun Nov 6 13:39:13 GMT 2005
On 11/6/05, Kirk Strauser <kirk at strauser.com> wrote:
> On Friday 04 November 2005 02:04 pm, you wrote:
>
> > Does the overall order of lines change every time you dump the tables?
>
> No, although an arbitrary number of lines might get deleted.
>
> > If it does/can, then there's a trivial solution (a few lines in perl, or a
> > hundred lines in C) that'll make the speed roughly similar to that of I/O.
>
> Could you elaborate? That's been bugging me all weekend. I know I should
> know this, but I can't quite put my finger on it.
> --
> Kirk Strauser
>
>
>
while (there are more records) {
a = read (line from old file)
b = read (line from new file)
if (a == b) then next
if (a <> b) {
if (a in new_records) {
get a out of new_records
next
}
if (b in old_records) {
get b out of old_records
next
}
put a in old_records
put b in new_records
}
after that old_records will contain records present in old
file, but not in new file, and new_records will contain
records present in new file, but not old one.
Note, that the difference must be kept in RAM, so it
won't work if there are multi-gig diffs, but it will work
very fast if the diffs are only 10-100Mb, it will work at
close to I/O speed if the diff is under 10Mb.
If the records can be ordered in a known order (e.g.
alphabetically), we don't need to keep anything in
RAM then and make any checks at all. Let's
assume an ascending order (1-2-5-7-31-...):
while (there are more records) {
a = read (line from old file)
b = read (line from new file)
while (a <> b) {
if (a < b) then {
write a to old_records
read next a
}
if (a > b) then {
write b to new_records
read next b
}
}
}
If course, you've got to add some checks to
deal with EOF correctly.
Hope this gives you some idea.
More information about the freebsd-questions
mailing list