bin/157244: dump/restore: unknown tape header type -230747966

Sun May 22 03:30:11 UTC 2011

>Number:         157244
>Category:       bin
>Synopsis:       dump/restore: unknown tape header type  -230747966
>Confidential:   no
>Severity:       critical
>Priority:       high
>Responsible:    freebsd-bugs
>State:          open
>Quarter:        
>Keywords:       
>Date-Required:
>Class:          sw-bug
>Submitter-Id:   current-users
>Arrival-Date:   Sun May 22 03:30:10 UTC 2011
>Closed-Date:
>Last-Modified:
>Originator:     Gene Stark
>Release:        8.0-RELEASE
>Organization:
>Environment:
FreeBSD home.starkeffect.com 8.0-RELEASE-p2 FreeBSD 8.0-RELEASE-p2 #10: Fri Jul 16 12:32:08 EDT 2010     root at home.starkeffect.com:/huge/src/sys/i386/compile/STARKHOME-SMP_8_0  i386

>Description:
I made an 18 gig dump of a (gvinum) filesystem using the command:
"dump 0f - /dev/gvinum/A > A.dump".  There were no problems reported during
the dump, and the volume fsck'ed clean beforehand.  I newfs'ed the volume and
attempted to restore via "restore rf - < /A.dump" and it failed with the
error message: unknown tape header type -230747966.  This was quite irritating,
as I have grown to trust dump/restore over many years and due to the size
I had already destroyed the original volume without trying to to read through
the dump file with restore.

I spent substantial time analyzing the dump to try to determine the failure
mode.  It turns out that header blocks actually occur out-of-order in
the dump file, as indicated by comparing the actual offset of the block
in the dump file with the spcl.c_tapea fields of the headers.
Once the problem started (during a large file near the beginning of the
dump), the difference between the actual offset (in units of TP_BSIZE)
and the claimed offset in the spcl.c_tapea block was either -10, 0, or 20.
That is, sometimes the header blocks came earlier than expected, sometimes
they came on time, and sometimes they came later, and there were only a few
possibilities.

I wrote a program to read the dump records and reorder them so that the
headers were emitted at their claimed offsets.  This was done by queueing
the headers and data blocks separately, emitting headers when they were due,
and emitting data blocks otherwise.  This program could then verify that
the correct number of data blocks were present to match the information in
the headers.  However, when I pipe the reordered block stream into restore,
there are still some issues.  For one thing, there is no way to verify the
order of the data blocks, and it seems that reordering might also have
occurred on those.  I have other copies of some of the large files that
were in the dump, and I will attempt to determine how the data blocks
have been reordered, but I haven't done that yet.

I was at a loss to explain how this kind of reordering could have occurred,
until I read some of the source to dump and saw that it is using multiple
processes to write the dump file.  I am running on a 2-core system (4 CPUs
after hyperthreading).  I strongly suspect a concurrency issue in the way
the dump tape is written, otherwise I don't see how the header blocks could
have been reordered in the way I observed.

>How-To-Repeat:
Although I have (unfortunately) already destroyed the original filesystem,
I was able to repeat the behavior on another filesystem using the following
command:

home# dump 0f - /mail | restore rfN -
  DUMP: Date of this level 0 dump: Sat May 21 22:57:40 2011
  DUMP: Date of last level 0 dump: the epoch
  DUMP: Dumping /dev/gvinum/mail_new (/mail) to standard output
  DUMP: mapping (Pass I) [regular files]
  DUMP: mapping (Pass II) [directories]
  DUMP: estimated 11291623 tape blocks.
  DUMP: dumping (Pass III) [directories]
  DUMP: dumping (Pass IV) [regular files]
unknown tape header type 1781888358
abort? [yn] y
dump core? [yn] n
  DUMP: Broken pipe
  DUMP: The ENTIRE dump is aborted.

This problem really needs to be looked into, because it is a disaster to
create an apparently successful dump with the idea of doing a simple
filesystem volume rebuild and then find out that it fails in the restore.

The reordering of the dump stream to put the header blocks back in their
proper positions helps quite a bit, but I have not been able to recover
my data at this time, because the data blocks are also apparently
reordered.  If there is a systematic mechanism to the reordering, I might
still be able to recover, but if it is a concurrency/synchronization thing
it might well be hopeless.

>Fix:

>Release-Note:
>Audit-Trail:
>Unformatted: