siisch1: Error while READ LOG EXT

Wed Feb 8 21:27:25 UTC 2012

On Wed, Feb 08, 2012 at 04:00:57PM -0500, Mike Tancsa wrote:
> I have a 4 port eSata PCIe card with 3 external port multipliers attached on an AMD64 box (8G of RAM), RELENG8 from Feb1st.
> 
> siis0 at pci0:5:0:0:       class=0x010400 card=0x71241095 chip=0x31241095 rev=0x02 hdr=0x00
>     vendor     = 'Silicon Image Inc (Was: CMD Technology Inc)'
>     device     = 'PCI-X to Serial ATA Controller (SiI 3124)'
>     class      = mass storage
>     subclass   = RAID
>     bar   [10] = type Memory, range 64, base 0xb4408000, size 128, enabled
>     bar   [18] = type Memory, range 64, base 0xb4400000, size 32768, enabled
>     bar   [20] = type I/O Port, range 32, base 0x3000, size 16, enabled
>     cap 01[64] = powerspec 2  supports D0 D1 D2 D3  current D0
>     cap 07[40] = PCI-X 64-bit supports 133MHz, 2048 burst read, 12 split transactions
>     cap 05[54] = MSI supports 1 message, 64 bit enabled with 1 message
> 
> siis0: <SiI3124 SATA controller> port 0x3000-0x300f mem 0xb4408000-0xb440807f,0xb4400000-0xb4407fff irq 19 at device 0.0 on pci5
> siis0: [ITHREAD]
> siisch0: <SIIS channel> at channel 0 on siis0
> siisch0: [ITHREAD]
> siisch1: <SIIS channel> at channel 1 on siis0
> siisch1: [ITHREAD]
> siisch2: <SIIS channel> at channel 2 on siis0
> siisch2: [ITHREAD]
> siisch3: <SIIS channel> at channel 3 on siis0
> siisch3: [ITHREAD]
> 
> # camcontrol devlist
> <WDC WD2001FASS-00U0B0 01.00101>   at scbus0 target 0 lun 0 (pass0,ada0)
> <WDC WD2001FASS-00U0B0 01.00101>   at scbus0 target 1 lun 0 (pass1,ada1)
> <WDC WD2001FASS-00U0B0 01.00101>   at scbus0 target 2 lun 0 (pass2,ada2)
> <WDC WD2001FASS-00U0B0 01.00101>   at scbus0 target 3 lun 0 (pass3,ada3)
> <Port Multiplier 47261095 1f06>    at scbus0 target 15 lun 0 (pass4,pmp1)
> <WDC WD2002FAEX-007BA0 05.01D05>   at scbus1 target 0 lun 0 (pass5,ada4)
> <WDC WD2002FAEX-007BA0 05.01D05>   at scbus1 target 1 lun 0 (pass6,ada5)
> <WDC WD2002FAEX-007BA0 05.01D05>   at scbus1 target 2 lun 0 (pass7,ada6)
> <WDC WD2002FAEX-007BA0 05.01D05>   at scbus1 target 3 lun 0 (pass8,ada7)
> <WDC WD2002FAEX-007BA0 05.01D05>   at scbus1 target 4 lun 0 (pass9,ada8)
> <Port Multiplier 37261095 1706>    at scbus1 target 15 lun 0 (pass10,pmp0)
> <Areca usrvar R001>                at scbus4 target 0 lun 0 (pass11,da0)
> <Areca backup1 R001>               at scbus4 target 0 lun 1 (pass12,da1)
> <Areca RAID controller R001>       at scbus4 target 16 lun 0 (pass13)
> <AMCC 9650SE-2LP DISK 4.10>        at scbus5 target 0 lun 0 (pass14,da2)
> <ST31000333AS SD35>                at scbus6 target 0 lun 0 (pass15,ada9)
> <ST31000528AS CC35>                at scbus7 target 0 lun 0 (pass16,ada10)
> <ST31000340AS SD1A>                at scbus8 target 0 lun 0 (pass17,ada11)
> <WDC WD1002FAEX-00Z3A0 05.01D05>   at scbus11 target 0 lun 0 (pass18,ada12)
> 
> 
> Ever since I added a new PM, I have been seeing a new error (READ LOG EXT) along with a the odd slot timeout error.
> 
> 
> Feb  7 23:49:32 backup3 kernel: siisch1:  ... waiting for slots 47000000
> Feb  7 23:49:32 backup3 kernel: siisch1: Timeout on slot 26
> Feb  7 23:49:32 backup3 kernel: siisch1: siis_timeout is 07040000 ss 7f17e8b9 rs 7f17e8b9 es 00000000 sts 801d2000 serr 00680000
> Feb  7 23:49:32 backup3 kernel: siisch1:  ... waiting for slots 43000000
> Feb  7 23:49:34 backup3 kernel: siisch1: Timeout on slot 30
> Feb  7 23:49:34 backup3 kernel: siisch1: siis_timeout is 07040000 ss 7f17e8b9 rs 7f17e8b9 es 00000000 sts 801d2000 serr 00680000
> Feb  7 23:49:34 backup3 kernel: siisch1:  ... waiting for slots 03000000
> Feb  7 23:49:34 backup3 kernel: siisch1: Timeout on slot 25
> Feb  7 23:49:34 backup3 kernel: siisch1: siis_timeout is 07040000 ss 7f17e8b9 rs 7f17e8b9 es 00000000 sts 801d2000 serr 00680000
> Feb  7 23:49:34 backup3 kernel: siisch1:  ... waiting for slots 01000000
> Feb  7 23:49:34 backup3 kernel: siisch1: Timeout on slot 24
> Feb  7 23:49:34 backup3 kernel: siisch1: siis_timeout is 07040000 ss 7f17e8b9 rs 7f17e8b9 es 00000000 sts 801d2000 serr 00680000

This indicates the controller on channel 1 (siisch1) is "stalled"
waiting for underlying communication with the device attached to it.

> Feb  7 23:57:59 backup3 kernel: siisch1: Error while READ LOG EXT
> Feb  8 00:13:36 backup3 kernel: siisch1: Error while READ LOG EXT
> Feb  8 00:21:53 backup3 kernel: siisch1: Error while READ LOG EXT
> Feb  8 00:22:16 backup3 kernel: siisch1: Error while READ LOG EXT
> Feb  8 00:39:13 backup3 kernel: siisch1: Error while READ LOG EXT
> Feb  8 01:24:25 backup3 kernel: siisch1: Error while READ LOG EXT
> Feb  8 01:33:52 backup3 last message repeated 2 times
> Feb  8 01:43:45 backup3 kernel: siisch1: Error while READ LOG EXT
> Feb  8 01:50:31 backup3 last message repeated 2 times
> Feb  8 01:55:20 backup3 kernel: siisch1: Error while READ LOG EXT
> Feb  8 02:26:26 backup3 kernel: siisch1: Error while READ LOG EXT
> Feb  8 02:27:24 backup3 kernel: siisch1: Error while READ LOG EXT
> Feb  8 03:16:28 backup3 kernel: siisch1: Error while READ LOG EXT
> Feb  8 03:36:20 backup3 kernel: siisch1: Error while READ LOG EXT
> Feb  8 04:04:05 backup3 kernel: siisch1: Error while READ LOG EXT

This indicates the underlying device was handed a READ LOG EXT ATA
command (command 0x2f) and the device did not respond promptly
(resulting in the timeout messages you see).

> smartctl doesnt show any issues on the drives other than one that has some historical errors from a while ago.  What are these errors and do I need to worry about them ? The "READ LOG EXT" ones are new.
>
> {snipping SMART stats}

You're focused heavily on the READ LOG EXT command.  READ LOG EXT is
intended for accessing the GP Log section of a drive.  EXT stands for
"Extended".  "GP Log" means "General Purpose Log", and is where all
sorts of logging information regarding drive performance is stored.
It's usually stored within a reserved section of the platters, or in the
HPA area.  It's not within a "standard" user-accessible LBA/sector
region.  This is a completely separate log from that of SMART logs.

You can review the different types of "logs" on a device by reviewing
the ATA8-ACS specification here.  See Annex A, section A.1, page 362:

http://www.t13.org/documents/UploadedDocuments/docs2007/D1699r4a-ATA8-ACS.pdf

This is almost certainly a lower level problem with the disk that cannot
be addressed/solved via normal means.  Thus, my recommendation is to
replace the disk.

If you would rather not replace the disk, I can try to step you through
looking at the GPLog sections of the disk to see if you can trigger the
problem -- and I have a feeling you'll be able to, but I won't
necessarily be able to tell you where the actual problem lies
hardware-wise, nor will I be able to solve the problem.

Regarding the repeated errors at semi-regular (but not entirely)
intervals: are you using smartd?  Do you have a cronjob that issues
smartctl -a or smartctl -x commands at intervals?  I imagine any of
these could be tickling something lower level.

Also, please upgrade your smartmontools to 5.42.  It does provide some
further enhancements that are useful.

-- 
| Jeremy Chadwick                                 jdc at parodius.com |
| Parodius Networking                     http://www.parodius.com/ |
| UNIX Systems Administrator                 Mountain View, CA, US |
| Making life hard for others since 1977.             PGP 4BD6C0CB |