WRITE_DMA48 error causing loss of ZFS array

Mon Oct 29 11:24:11 PDT 2007

   I've experienced several dma errors over the past few months with 
various incarnations of 7.0 which were all fixed.
   Seems I have a new one. Don't know if there was a connection, but this 
only occured after updating to 7.0-BETA1 last weekend.
   I have a small ufs mirror for /boot and everything else on one ZFS pool.
   I scrub my zpool in the early hours every monday morning. Last Monday 
when I got to the console I saw DMA_ERRORs slowly scrolling up the screen. 
Could type 'root' to login prompt on virtual terminal but it just hung. 
Nothing I could do apart from reset.
   When it came back it was fine AFAICT. Later that day I got the problem 
again. Reset and all ok. I then, confusingly, managed to successfully 
scrub the whole pool with no problems.
   However, again this morning I had the same symptoms. A couple of 
screenshots here, as nothing got logged, the pool seemed to be effectively 
unavailable:

http://webhost.salford.ac.uk/aix502/29102007(001).thb.jpg
http://webhost.salford.ac.uk/aix502/29102007(004).thb.jpg

The errors all seemto be on one drive. AFAICT it had probably been going 
on for hours when I get to it and seems like it will continue this way 
forever.
   I've looked in the smartctl output for the drive (I do a short offline 
test everyday and a long offline test every Sunday) but nothing there. Ran 
the Hitachi Drive Fitness test on the drive and no problems reported.
   This is one of two drives on a JMB363 controller which is in IDE mode. 
If that makes a difference, as I've seen posts referring to problems with 
that controller, but think they might've been dealing with AHCI mode only?
   Is this a known problem? I've seen mention of known problems with ata, 
but it's hard to get a clear picture of what is currently outstanding from 
searching the last few month's -current.
   Also, why do I lose my zpool and have to reset? This one drive failing 
would not cause a problem for the zpool, as it has redundancy. However, 
why am I effectively losing the whole pool due to this error?
   I'll be glad to provide any more info.
   Many thanks in advance.

-- 
Mark Powell - UNIX System Administrator - The University of Salford
Information Services Division, Clifford Whitworth Building,
Salford University, Manchester, M5 4WT, UK.
Tel: +44 161 295 4837  Fax: +44 161 295 5888  www.pgp.com for PGP key