disk "flipped" - a known problem?

Alexander Motin mav at FreeBSD.org
Thu Feb 21 19:07:12 UTC 2013


On 26.01.2013 03:17, Jeremy Chadwick wrote:
> Okay, I've figured out the exact, 100% reproducible condition that
> causes the situation.  It took me a lot of tries and a digital pocket
> recorder to take verbal notes (there are just too many things to look at
> simultaneously), but I've figured it out.
> 
> I'm sorry for the verbosity, but it's necessary.
> 
> Assume the disk we're talking about is /dev/ada5.
> 
> 1. Prior to any issues, we have this:
> 
> root at icarus:~ # ls -l /dev/ada5* /dev/xpt* /dev/pass5*
> crw-r-----  1 root  operator  0x8c Jan 25 16:41 /dev/ada5
> crw-------  1 root  operator  0x75 Jan 25 16:35 /dev/pass5
> crw-------  1 root  operator  0x51 Jan 25 16:35 /dev/xpt0
> 
> 2. ada5 begins experiencing issues -- ATA commands (CDBs) submit do not
> get a response (not going to discuss how/why that can happen).
> 
> 3. These types of messages are seen on console (naturally the CDB and
> request type will vary -- in this case it was because I was doing the dd
> zero'ing, thus tickling the bad sector/naughty firmware on the drive):
> 
> Jan 25 16:29:28 icarus kernel: ahcich5: Timeout on slot 0 port 0
> Jan 25 16:29:28 icarus kernel: ahcich5: is 00000000 cs 00000000 ss 00000001 rs 00000001 tfd 40 serr 00000000 cmd 0004c017
> Jan 25 16:29:28 icarus kernel: ahcich5: AHCI reset...
> Jan 25 16:29:28 icarus kernel: ahcich5: SATA connect time=1000us status=00000113
> Jan 25 16:29:28 icarus kernel: ahcich5: AHCI reset: device found
> Jan 25 16:29:28 icarus kernel: (ada5:ahcich5:0:0:0): WRITE_FPDMA_QUEUED.  ACB: 61 80 80 77 01 40 00 00 00 00 00 00
> Jan 25 16:29:28 icarus kernel: (ada5:ahcich5:0:0:0): CAM status: Command timeout
> Jan 25 16:29:28 icarus kernel: (ada5:ahcich5:0:0:0): Retrying command
> 
> 4. Any I/O submit to ada5 during this time blocks (this is normal).
> 
> 5. **While this situation is happening**, something using xpt(4)
> attempts to submit a CDB to the disk (ex. smartctl -a /dev/ada5).
> This request also blocks (again, normal).
> 
> 6. Physical device falls off bus, or CAM kicks the disk off the bus.
> Doesn't matter which.  We see messages resembling this (boy am I tired
> of this interspersed output problem):
> 
> Jan 25 16:29:32 icarus kernel: (ada5:ahcich5:0:0:0): lost device
> Jan 25 16:29:32 icarus kernel: (pass5:ahcich5:0:0:0): lost device
> Jan 25 16:29:32 icarus kernel: (ada5:ahcich5:0:0:0): removing device entry
> Jan 25 16:29:32 icarus kernel: (pass5:ahcich5:0:0:0): passdevgonecb: devfs entry is gone
> 
> 7. Standard I/O requests fail with errno=6 "Device not configured".
> xpt(4) requests also fail with the same errno.
> 
> 8. Device-wise, at this stage all we have is:
> 
> root at icarus:~ # ls -l /dev/ada5* /dev/xpt* /dev/pass5*
> crw-------  1 root  operator  0x51 Jan 25 16:35 /dev/xpt0
> 
> 9. Device comes back online for whatever reason.  FreeBSD sees the disk,
> blah blah blah:
> 
> Jan 25 16:30:16 icarus kernel: GEOM: new disk ada5
> Jan 25 16:30:16 icarus kernel: ada5: <WDC WD1500ADFD-00NLR4 21.07QR4> ATA-7 SATA 1.x device
> Jan 25 16:30:16 icarus kernel: ada5: Serial Number WD-WMAP41573589
> Jan 25 16:30:16 icarus kernel: ada5: 150.000MB/s transfers (SATA 1.x, UDMA6, PIO 8192bytes)
> Jan 25 16:30:16 icarus kernel: ada5: Command Queueing enabled
> Jan 25 16:30:16 icarus kernel: ada5: 143089MB (293046768 512 byte sectors: 16H 63S/T 16383C)
> Jan 25 16:30:16 icarus kernel: ada5: Previously was known as ad14
> 
> ...um, where's pass5?
> 
> 10. /dev/pass5 is now completely (permanently) missing:
> 
> root at icarus:~ # ls -l /dev/ada5* /dev/xpt* /dev/pass5*
> crw-r-----  1 root  operator  0x99 Jan 25 16:42 /dev/ada5
> crw-------  1 root  operator  0x51 Jan 25 16:35 /dev/xpt0
> 
> 11. Any further attempts to communicate via xpt(4) with ada5 fail.
> Detaching and reattaching the disk does not fix the issue; the only fix
> is to reboot the system.
> 
> 12. "camcontrol debug -IPXp scbus5" results in tons and tons of output
> all pertaining to xpt(4).  It looks like xpt(4) is in some kind of
> loop.
> 
> Below is my verbose boot (with non-kernel things removed), which
> also includes "camcontrol debug" output once things are in a bad state:
> 
> http://jdc.koitsu.org/freebsd/xpt_oddity.log
> 
> In this log you'll see that after 1 CAM timeout I yanked the drive, then
> roughly 30 seconds later reinserted it.
> 
> If you need me to turn on CAM debugging *prior* to the above, I can do
> that, just let me know.
> 
> The important step is #5.  Without that, the problem shown in #9/10/11
> does not happen.
> 
> It's a good thing I don't run smartd(8) -- most users I see using that
> software set the interval to something like 180s or 60s.  Imagine this
> frustration: "okay so the disk fell off the bus, but what, now I can't
> talk to it with SMART?  Uhhh... <reboots>  Err, works now?  Whatever".

I think, the problem may already be fixed in HEAD by r244014 by ken at .
I've just merged it to 9-STABLE at r247115. So if it is still possible
to reproduce the situation, it would be good to try.

-- 
Alexander Motin


More information about the freebsd-fs mailing list