System lockups caused by USB external HDD

CDP dr.clau at gmail.com
Tue Jan 25 00:48:09 UTC 2011


On 01/24/11 13:27, Hans Petter Selasky wrote:
> On Monday 24 January 2011 12:08:47 CDP wrote:
>> On 01/24/11 11:34, Hans Petter Selasky wrote:
>>> On Monday 24 January 2011 10:00:53 CDP wrote:
>>>> On 01/24/11 01:56, Daniel O'Connor wrote:
>>>>> On 24/01/2011, at 9:10, CDP wrote:
>>>>>> g_vfs_done():da0s2[WRITE(offset=xxxxxxxxxxxx, length=16384)]error = 5
>>>>>> [several more lines similar to the above]
>>>>>> panic: softdep_move_dependencies: need merge code
>>>>>> cpuid = 0
>>>>>> KDB: stack backtrace:
>>>>>> #0 0x... at kdb_backtrace+0x5e
>>>>>> #1 0x... at panic+0x182
>>>>>
>>>>> It looks like the disk is dying, or the FS is corrupt (the former might
>>>>> cause the later).
>>>>>
>>>>> Can you run smartctl on the disk? Unfortunately a lot of enclosures
>>>>> reject SMART commands so you might not be able to :(
>>>>
>>>> I have attached the output of smartctl -d sat -a /dev/da0. I didn't yet
>>>> run a SMART long test for the simple reason that the disk is going into
>>>> sleep mode and interrupts it. Haven't bothered to keep it alive for a
>>>> long test but I might just do that.
>>>>
>>>> Although, I doubt it's a disk failure, since I do backups on it without
>>>> problems by using FreeBSD 7.3, on the same space where FreeBSD 8.x
>>>> fails. And I am talking about over 150GB of data in one run, while
>>>> 8.2-RC2 crashes after 5-10GB. I have experienced disk failure in the
>>>> past, on SATA, and a few read/write errors never caused a system lockup.
>>>>
>>>> My feeling is that enough traffic on USB causes the problem, and that
>>>> this problem is only present in the new USB stack.
>>>> Unfortunately downgrading to 7.x is not an option because there are
>>>> things that won't work on this notebook.
>>>
>>> If you run a simple test like this:
>>>
>>> dd if=/dev/da0 of=/dev/null bs=65536
>>> dd if=/dev/da0 of=/dev/null bs=16384
>>>
>>> Do you then see any errors?
>>>
>>> Do you have a spare USB memory stick which you could run similar write
>>> tests on?
>>
>> Both reads fail with I/O error, while writes to an unused partition seem
>> to be fine (I interrupted the writes after a while):
>>
>> % dd if=/dev/da0 of=/dev/null bs=65536
>> dd: /dev/da0: Input/output error
>> 191732+0 records in
>> 191732+0 records out
>> 12565348352 bytes transferred in 429.999272 secs (29221790 bytes/sec)
>>
>> % dd if=/dev/da0 of=/dev/null bs=16384
>> dd: /dev/da0: Input/output error
>> 126427+0 records in
>> 126427+0 records out
>> 2071379968 bytes transferred in 169.431766 secs (12225452 bytes/sec)
>>
>> # dd if=/dev/random of=/dev/da0s3 bs=65536
>> ^C329378+0 records in
>> 329377+0 records out
>> 21586051072 bytes transferred in 1003.020293 secs (21521051 bytes/sec)
>>
>> # dd if=/dev/random of=/dev/da0s3 bs=16384
>> ^C679571+0 records in
>> 679571+0 records out
>> 11134091264 bytes transferred in 690.135793 secs (16133189 bytes/sec)
>>
>> This is what I get in /var/log/messages when the I/O error occurs:
>> (da0:umass-sim0:0:0:0): AutoSense failed
>>
>> However, I experience no lockup. Maybe this situation is not handled
>> correctly at another level ?
> 
> I haven't looked into the code of CAM or GEOM that much so I won't say too 
> much about that. I believe the USB/umass is not to blame. What you could do is 
> to add a conditional error printout in "umass_t_bbb_status_callback()" in 
> /sys/dev/usb/storage/umass.c when the error happens. If that error is not a 
> USB transport error, then we are most likely seeing a SCSI issue in layers 
> above umass. Or if you have access to USB analyser use that. There is now also 
> the option to trace USB from the kernel itself, but the feature is in its 
> early development.


The panics I was able to catch/inspect (latest from add_to_worklist() /
ffs_softdep.c) indicated they were thrown by ffs/softupdates code,
therefore I tried disabling softupdates.
The system doesn't panic anymore. The operations on the USB HDD still
stop, but after several tens of seconds the system logs the 'autosense
failed' error, a bunch of write errors, and the copy operation resumes.
md5 shows the copied files are identical to the source files.

In 7.x I don't recall having any kind of errors, neither temporary locks
in disk operations, so I'm guessing the 'autosense failed' situation is
handled differently in 8.x, compared to 7.x.

Claudiu.



More information about the freebsd-usb mailing list