Analysis of disk file block with ZFS checksum error

Tue Mar 4 13:40:54 UTC 2008

Joe Peterson wrote:
> Gavin Atkinson wrote:
>> Are the datestamps (Thu Jan 24 23:20:58 2008) found within the corrupt
>> block before or after the datestamp of the file it was found within?
>> i.e. was the corrupt block on the disk before or after the mp3 was
>> written there?
> 
> Hi Gavin, those dated are later than the original copy (I do not have
> the file timestamps to prove this, but according to my email record, I
> am pretty sure of this).  So the corrupt block is later than the
> original write.
> 
> If this is the case, I assume that the block got written, by mistake,
> into the middle of the mp3 file.  Someone else suggested that it could
> be caused by a bad transfer block number or bad drive command (corrupted
> on the way to the drive, since these are not checksummed in the
> hardware).  If the block went to the wrong place, AND if it was a HW
> glitch, I suppose the best ZFS could then do is retry the write (if its
> failure was even detected - still not sure if ZFS does a re-check of the
> disk data checksum after the disk write), not knowing until the later
> scrub that the block had corrupted a file.
> 
> I think that anything is possible, but I know I was getting periodic DMA
> timeouts, etc. around that time.  I hesitate, although it is tempting,
> to use this evidence to focus blame purely on bad HW, given that others
> seem to be seeing DMA problems too, and there is reasonable doubt
> whether their problems are HW related or not.  In my case, I have been
> free of DMA errors (cross your fingers) after re-installed FreeBSD
> completely (giving it a larger boot partition and redoing the ZFS slice
> too), and before this, I changed the IDE cable just to eliminate one
> more variable.  Therefore, there are too many variables to reach a firm
> conclusion, since even if the cable was "bad", I never saw one DMA error
> or other indication of anything wrong with HW from the Linux side (and
> I've been using that HW with both Linux and FreeBSD 6.2 for months now -
> no apparent flakiness of any kind on either system).  So either it *was*
> bad and FreeBSD 7.0 was being more "honest", FreeBSD's drivers and/or
> ZFS was stressing the HW and revealing weaknesses in the cable, or it
> was a SW issue that got cleared somehow when I re-installed.
> 
> Is it possible that the problem lies in the ATA drivers in FreeBSD or
> even in ZFS and just looks like HW issues?  I do not have enough
> info/expertise to know.  If not, then it may very well be true that HW
> problems are pretty widespread (and that disk HW cannot, in fact, be
> trusted), and there really *is* a strong need for ZFS *now* to protect
> our data.  If there is a possibility that SW could be involved, any
> hints on how to further debug this would be of great help to those still
> experiencing recent DMA errors.  I just want to be more sure one way or
> the other, but I know this issue is not an easy one (however, it's the
> kind of problem that should receive the highest priority, IMHO).

I'm not sure what happened to this thread, but I also had a lot of 
similar issues.  I was using SATA, and using a mirrored pair of SATA 
drives, brand new.  It was suggested that my controller was junk.

I'm starting to think there is a timing issue or some such problem with 
ZFS, since I can use the same drives in a gmirror with UFS, and never 
have any data problems (md5 checksums confirm it over-and-over).  I 
highly doubt that everyone is seeing similar issues and it just is 
because ZFS is so intense.  I've had plenty of systems under severe disk 
load that have never exhibited corrupt files because of something like 
this.

I wish we could get our hands on this issue..  Seems like some common 
threads are ATA/SATA disks.  Is your setup running 32bit or 64bit 
FreeBSD?  (if you already mentioned it, I'm sorry, I missed it)

Eric