SIIS timeout with current r197392:

Fri Sep 25 09:26:00 UTC 2009

----- Original Message ----- 
From: "James R. Van Artsdalen" <james-freebsd-current at jrv.org>
> Pegasus Mc Cleaft wrote:
>> Hello Current,
>>
>> Since my latest build of amd64-current kernel and world (r197392) I am
>> getting strange timeout errors in my dmesg and eventual system 
>> instability.
>
> I believe mav has stated that error handling in SIIS isn't finished or
> is problematic in some way.
>
> I see similar problems: most of the time an error results in a hung
> device, requiring reboot.  This usually happens within a TB or two or
> intense I/O: I have not yet seen a 6 TB ZFS pool complete a "scrub" due
> to this.

Hi James,

    I believe I found the problem with my machine, and it _was_ my machine. 
The device that was hanging is a Asus CD-ROM drive. The error messages 
displayed were correct, I had a faulty SATA cable between the controller and 
the drive (Funny how a SATA cable can go bad spontaniously). Re-boots of the 
system did not clear the fault, but a full power down and power up would 
mask the fault for about an hour and then it would start throwing the 
messages into the log every few seconds. It was this behaviour that lead me 
to believe it was a problem with the SIIS driver. It wasent until I noticed 
on a reboot the system hung for a little while while interrogating the drive 
during POST. After a cable change and a lot of swearing, the computer booted 
fine and the error has never reappeared.

    Some lessons learned:

    1) Debug messages _MAY_ actually be telling the truth!  :>
    2) Reboots and software resets wont be heard from a SATA device whos 
port has been scrambled by bad cabling
    3) SATA cables may spontaniously decentergrate.
    4) Modern computers respond less to threats than my older machines :>

    This being said, I have seen the other fault where a device hangs during 
high load / activity.  Mine will, if it is going to do it, hang somewhere 
around midnight to 3am when I am running maintance on the maching (find 
/ -name "*.core" -exec rm {} /; ).   It does exactly as you said where a 
drive hangs, usually with the activity LED still lit.  Sometimes the machine 
will continue on and ZFS will carryon in a degraded state. The odd thing 
about this is, it only started to so this when I was having problems with 
the CD ROM on the SIIS card. The ZFS drives are on a completely different 
controller (JMicron). When the SIIS controler was waiting for the scrambled 
port to say hello, all sorts of weird things would happen.  I would get 
lock-up of the mouse for 2 seconds, keyboard would lock and if a key was 
pressed when it happened, it would trigure off the key-repeat (much to my 
amusement while reading email, hitting the delete key, have the keyboard 
hang and watch it quickly run through deleting everything in my in-box :> ). 
My query is, when it is hung, waiting for the SATA port to respond, could it 
be possible to have the JMicron ports miss an event, or get a double IRQ and 
cause the device to lock?

Best whishes,
Peg