bin/157244: dump/restore: unknown tape header type -230747966

Sun May 22 13:10:22 UTC 2011

I wrote a program to compare the blocks in another copy of one of the
large files in the dump with the version extracted from restore after
applying my header reordering program.  The program read each of the
files in blocks of TP_BSIZE bytes, computed the SHA1 hash of each
block, stored the resulting <hash, offset> pairs in a hash map for
each file, unioned the key sets of the two hash maps to obtain a
single master list of block hashes, traversed the master key set
to construct a map <offset, <offset0, offset1>> that gave the
correspondence between the blocks in the two files, and printed out
the contents of that map in increasing order of offset, showing the
differences between the two files.  Here is the initial part of the
result:

Lectures.zip.bad: 52469795 bytes
Lectures.zip.good: 52469795 bytes
11612   11622   10
11613   11623   10
11614   11624   10
11615   11625   10
11616   11626   10
11617   11627   10
11618   11628   10
11619   11629   10
11620   11630   10
11621   11631   10
11622   11632   10
11623   11633   10
11624   11634   10
11625   11635   10
11626   11636   10
11627   11637   10
11628   11638   10
11629   11639   10
11630   11640   10
11631   11641   10
11632   11612   -20
11633   11613   -20
11634   11614   -20
11635   11615   -20
11636   11616   -20
11637   11617   -20
11638   11618   -20
11639   11619   -20
11640   11620   -20
11641   11621   -20
11642   11652   10
11643   11653   10
11644   11654   10
11645   11655   10
11646   11656   10
11647   11657   10
11648   11658   10
11649   11659   10
11650   11660   10
11651   11661   10
11652   11662   10
11653   11663   10
11654   11664   10
11655   11665   10
11656   11666   10
11657   11667   10
11658   11668   10
11659   11669   10
11660   11670   10
11661   11671   10
11662   11642   -20
11663   11643   -20
11664   11644   -20
11665   11645   -20
11666   11646   -20
11667   11647   -20
11668   11648   -20
11669   11649   -20
11670   11650   -20
11671   11651   -20
11672   11682   10
11673   11683   10

The pattern repeats this way for *almost* the entire file.
There are sets of 20 blocks that occur 10 blocks ahead of the
corresponding blocks in the other file, and then a set of 10
blocks that occur 20 blocks behind the corresponding blocks
in the other file.  There are occasional values of 9 and 19
for the differences, which I don't have a ready explanation for,
except that my header reordering relied on the magic number
to identify the header blocks and it is possible there were
a few blocks that were misidentified as headers that were actually
data blocks.  At the end of the files there are a few blocks
that do not correspond; these are probably due to alignment
at the end which caused some of the last data blocks to be used
as the first blocks for the next file in the dump.

To test my suspicion that it is a concurrency issue in dump,
I recompiled dump after setting #define SLAVES 1 in tape.c
(rather than the value 3 it had before).  I then was able to
complete two rounds of "dump 0f - /mail | restore rfN -"
without any errors, whereas if I use /sbin/dump it fails out
very quickly as indicated in the original PR.

I am not familiar with the locking features, etc. being used in
dump, so I don't know if I will be able to go farther than this
with a reasonable expenditure of time.  However, I strongly
suggest that the "concurrency modifications" in dump be turned
off (perhaps by setting SLAVES to 1 as I did) until somebody
can get to the bottom of this.  If this is happening to me,
then I suspect there are *massive* numbers of bad dumps out there
that people think are actually good.  It will really be a rude
awakening when people try to read them back.  Since the data
blocks don't contain any tape address information in them,
it is not possible to recover.