Data corruption in cd9660 on FreeBSD 4.11?

Stephen McKay smckay at internode.on.net
Fri Jun 24 12:32:08 GMT 2005


Hi!

I'm experiencing data corruption when reading CDs and DVDs on FreeBSD 4.11.
My best theory so far is that cd9660 or perhaps the VFS layer is mishandling
2048 byte buffers (since they are smaller than one virtual memory page),
occasionally writing them to the wrong location in RAM.  Read on for why
I think so.

First up, I don't think this is the usual hardware problem since the machine
has done huge numbers of buildworlds (in 4.x and -current) without any of
the telltale signs (eg bus errors and segmentation violations).  There are
no error messages in /var/log/messages.  Also, it moonlights as a games
machine and plays Doom 3, Battlefield 1942, Neverwinter Nights and so forth
like a champ.  Memory, cpu, video, disk, networking are all just fine 100%
of the time.

The hardware is an ASUS P4P800 mobo (including onboard Marvell Yukon gigabit
ethernet) with a P4 2.8GHz cpu, 1GB RAM, Maxtor 120GB disk, Pioneer 103S
DVD-ROM, LiteOn SOHW-1673S DVD burner in an Antec Sonata case.

Now that I have a DVD burner, I make backups of my main machines (over NFS)
but have found that they often don't verify as 100% correct.  The symptom
is that, for some files, an entire 2048 DVD sector is replaced with
different (non-zero) data.  This occurs both when reading with the Pioneer
DVD-ROM and when reading with the LiteOn burner (though I don't test with
the Pioneer much as it is slower).

I emphasise that all burns have been 100% correct (ie the burning process
worked and this can be verified by reading on, say, my iBook), so all of
the hardware seems to be operating correctly (and swiftly, I might add).
The problem is that reading the iso9660 file system is not safe.

After some experimenting, I've found that the problem also occurs when
reading CDs, and I built a test CD (of photos of a recent wedding) and in
testing I read this CD over and over.  I compare the CD with the original
files (via NFS) using diff.  When diff finds a difference, I save copies
of the differing files before they can be flushed from the cache.

I have calculated checksums for all 2048 blocks on the CD, so I can know
if any given block of 2048 bytes came from the CD and if so which file it
came from.  In all cases so far, the 2048 byte error has been a block from
another file, not a random corruption.

I am starting to believe that, under high load, the cd9660 file system
code tells the ata driver to put a 2K block in the wrong spot in memory,
leaving some old junk in the gap in the file being read, and blasting some
other 2K block of memory.  It may not be cd9660 code per se that is wrong,
but a bug in the complex buffer handling code (getblk, getnewbuf, allocbuf,
etc).

Why do I believe it is writing to the wrong memory, rather than any number
of other flaws?  In two runs (out of many), unusual things occurred that
are consistent with memory being overwritten, rather than, say, a 2K block
just not being read at all: In one, an innocent sshd core-dumped (which
is something that has never happened except when running my cd9660 tests),
and in another, a previously OK cached NFS file became corrupted.

Explaining that last case further: I had been running a test script that
would mount the CD, compare files, unmount the CD, and repeat.  This meant
that the NFS copy of the files was read over and over and hence became
memory resident (there being enough space in 1GB of RAM for one copy of
the files, plus my normal programs).  Several tests passed without fault
(hence all the NFS files were cached and correct), when suddenly there
were multiple corruptions; call them file A and file B.  File A was the
usual corruption where a 2K block of another file was unexpectedly present
in the copy read from the CD, but in file B it was the NFS file that was
wrong.  In fact it contained the missing block from file A!  In short, the
fully memory resident NFS file B had been corrupted by reading file A from
the CD.

It's been pretty interesting hunting this problem, but now I'm sort of
stuck.  I believe that some 2K reads from DVDs and CDs end up in the wrong
place in RAM, but I can't find where this happens in the code (it's pretty
hard to work out just by reading it), and I can't rule out the possibility
that there's a hardware error here that I've just never run across before.

So, can anyone suggest any more tests I could try?  Or is there a kind of
hardware fault that could cause this substitution of whole blocks read from
CDs without causing any other problems?

And does anyone know of any commits made anywhere in the 5 years since
4.x split off from 5.x that may be relevant?  Yep.  5 years.  I have
started looking, but there's a fair bit of stuff in there...

Stephen.


More information about the freebsd-stable mailing list