System lockups caused by USB external HDD

Wed Jan 26 08:48:24 UTC 2011

On 01/24/11 13:27, Hans Petter Selasky wrote:
> On Monday 24 January 2011 12:08:47 CDP wrote:
>> On 01/24/11 11:34, Hans Petter Selasky wrote:
>>> On Monday 24 January 2011 10:00:53 CDP wrote:
>>>> On 01/24/11 01:56, Daniel O'Connor wrote:
>>>>> On 24/01/2011, at 9:10, CDP wrote:
>>>>>> g_vfs_done():da0s2[WRITE(offset=xxxxxxxxxxxx, length=16384)]error = 5
>>>>>> [several more lines similar to the above]
>>>>>> panic: softdep_move_dependencies: need merge code
>>>>>> cpuid = 0
>>>>>> KDB: stack backtrace:
>>>>>> #0 0x... at kdb_backtrace+0x5e
>>>>>> #1 0x... at panic+0x182
>>>>>
>>>>> It looks like the disk is dying, or the FS is corrupt (the former might
>>>>> cause the later).
>>>>>
>>>>> Can you run smartctl on the disk? Unfortunately a lot of enclosures
>>>>> reject SMART commands so you might not be able to :(
>>>>
>>>> I have attached the output of smartctl -d sat -a /dev/da0. I didn't yet
>>>> run a SMART long test for the simple reason that the disk is going into
>>>> sleep mode and interrupts it. Haven't bothered to keep it alive for a
>>>> long test but I might just do that.
>>>>
>>>> Although, I doubt it's a disk failure, since I do backups on it without
>>>> problems by using FreeBSD 7.3, on the same space where FreeBSD 8.x
>>>> fails. And I am talking about over 150GB of data in one run, while
>>>> 8.2-RC2 crashes after 5-10GB. I have experienced disk failure in the
>>>> past, on SATA, and a few read/write errors never caused a system lockup.
>>>>
>>>> My feeling is that enough traffic on USB causes the problem, and that
>>>> this problem is only present in the new USB stack.
>>>> Unfortunately downgrading to 7.x is not an option because there are
>>>> things that won't work on this notebook.
>>>
>>> If you run a simple test like this:
>>>
>>> dd if=/dev/da0 of=/dev/null bs=65536
>>> dd if=/dev/da0 of=/dev/null bs=16384
>>>
>>> Do you then see any errors?
>>>
>>> Do you have a spare USB memory stick which you could run similar write
>>> tests on?
>>
>> Both reads fail with I/O error, while writes to an unused partition seem
>> to be fine (I interrupted the writes after a while):
>>
>> % dd if=/dev/da0 of=/dev/null bs=65536
>> dd: /dev/da0: Input/output error
>> 191732+0 records in
>> 191732+0 records out
>> 12565348352 bytes transferred in 429.999272 secs (29221790 bytes/sec)
>>
>> % dd if=/dev/da0 of=/dev/null bs=16384
>> dd: /dev/da0: Input/output error
>> 126427+0 records in
>> 126427+0 records out
>> 2071379968 bytes transferred in 169.431766 secs (12225452 bytes/sec)
>>
>> # dd if=/dev/random of=/dev/da0s3 bs=65536
>> ^C329378+0 records in
>> 329377+0 records out
>> 21586051072 bytes transferred in 1003.020293 secs (21521051 bytes/sec)
>>
>> # dd if=/dev/random of=/dev/da0s3 bs=16384
>> ^C679571+0 records in
>> 679571+0 records out
>> 11134091264 bytes transferred in 690.135793 secs (16133189 bytes/sec)
>>
>> This is what I get in /var/log/messages when the I/O error occurs:
>> (da0:umass-sim0:0:0:0): AutoSense failed
>>
>> However, I experience no lockup. Maybe this situation is not handled
>> correctly at another level ?
> 
> I haven't looked into the code of CAM or GEOM that much so I won't say too 
> much about that. I believe the USB/umass is not to blame. What you could do is 
> to add a conditional error printout in "umass_t_bbb_status_callback()" in 
> /sys/dev/usb/storage/umass.c when the error happens. If that error is not a 
> USB transport error, then we are most likely seeing a SCSI issue in layers 
> above umass. Or if you have access to USB analyser use that. There is now also 
> the option to trace USB from the kernel itself, but the feature is in its 
> early development.


You are right, I've tracked the problem down to CAM (cam_periph.c:
camperiphsensedone()).
I've changed the code to behave as it did in 7.3, and it mitigates the
problem. I don't get "AutoSense failed" errors anymore and I don't get
any lockups/crashes, not even when using softupdates on the external hdd.
However, the pauses in disk operations still happen, but this doesn't
seem to create any further issues. I haven't looked into this.

I've attached a patch. I don't know if this behavior is correct, and I
hope someone that knows CAM can take a look into this issue.

Claudiu.


-------------- next part --------------

--- sys/cam/cam_periph.c.orig	2011-01-26 09:38:21.000000000 +0200
+++ sys/cam/cam_periph.c	2011-01-26 09:38:02.000000000 +0200
@@ -1024,7 +1024,9 @@
 	int		frozen = 0;
 	u_int		sense_key;
 	int		depth = done_ccb->ccb_h.recovery_depth;
+	int		xpt_done_ccb;
 
+	xpt_done_ccb = FALSE;
 	status = done_ccb->ccb_h.status;
 	if (status & CAM_DEV_QFRZN) {
 		frozen = 1;
@@ -1049,14 +1051,22 @@
 		if (sense_key != SSD_KEY_NO_SENSE) {
 			saved_ccb->ccb_h.status |=
 			    CAM_AUTOSNS_VALID;
-		} else {
+
+                        xpt_done_ccb = TRUE;
+		} /*else {
 			saved_ccb->ccb_h.status &=
 			    ~CAM_STATUS_MASK;
 			saved_ccb->ccb_h.status |=
 			    CAM_AUTOSENSE_FAIL;
-		}
+		}*/
 		bcopy(saved_ccb, done_ccb, sizeof(union ccb));
 		xpt_free_ccb(saved_ccb);
+
+		periph->flags &= ~CAM_PERIPH_RECOVERY_INPROG;
+
+		if (xpt_done_ccb == FALSE)
+			xpt_action(done_ccb);
+
 		break;
 	}
 	default:
@@ -1084,7 +1094,9 @@
 	 */
 	if (frozen != 0)
 		done_ccb->ccb_h.status |= CAM_DEV_QFRZN;
-	(*done_ccb->ccb_h.cbfcnp)(periph, done_ccb);
+
+	if (xpt_done_ccb == TRUE)
+                (*done_ccb->ccb_h.cbfcnp)(periph, done_ccb);
 }
 
 static void