Need help diagnosing hardware failure

Doug Poland doug at polands.org
Mon May 10 08:52:22 PDT 2004


Hello,

Upon returning from a weeks vacation, I was dismayed to find my home
file server (running 4.8-STABLE) had crashed.  The box in question has
an Adaptec Host adapter

ahc0: <Adaptec 2940A Ultra SCSI adapter> port 0xf800-0xf8ff mem 0xfedfe000-0xfedfefff irq 10 at device 13.0 on pci0
aic7860: Ultra Single Channel A, SCSI Id=7, 3/253 SCBs

and seven identical SCSI drives 

judeah# dmesg | grep IBMRAID
da0: <IBMRAID 0664M1H9337 5 58> Fixed Direct Access SCSI-2 device 
da1: <IBMRAID 0664M1H9337 5 58> Fixed Direct Access SCSI-2 device 
da2: <IBMRAID 0664M1H9337 5 58> Fixed Direct Access SCSI-2 device 
da3: <IBMRAID 0664M1H9337 5 58> Fixed Direct Access SCSI-2 device 
da4: <IBMRAID 0664M1H9337 5 58> Fixed Direct Access SCSI-2 device 
da6: <IBMRAID 0664M1H9337 5 58> Fixed Direct Access SCSI-2 device 
da5: <IBMRAID 0664M1H9337 5 58> Fixed Direct Access SCSI-2 device 


in a vinum stipped volume...

judeah# more /etc/vinum.conf 
drive a device /dev/da0e
drive b device /dev/da1e
drive c device /dev/da2e
drive d device /dev/da3e
drive e device /dev/da4e
drive f device /dev/da5e
drive g device /dev/da6e

volume dataraid
  plex org striped 256k
      sd length 1920m drive a
      sd length 1920m drive b
      sd length 1920m drive c
      sd length 1920m drive d
      sd length 1920m drive e
      sd length 1920m drive f
      sd length 1920m drive g



Perusal of /var/log/messages show...

May  3 11:17:31 judeah /kernel: (da1:ahc0:0:1:0): SCB 0x5a - timed out
May  3 11:17:31 judeah /kernel: >>>>>>>>>>>>>>>>>> Dump Card State Begins <<<<<<<<<<<<<<<<<
May  3 11:17:31 judeah /kernel: ahc0: Dumping Card State while idle, at SEQADDR 0x7
May  3 11:17:31 judeah /kernel: Card was paused
May  3 11:17:31 judeah /kernel: ACCUM = 0x97, SINDEX = 0x52, DINDEX = 0x8c, ARG_2 = 0x0
May  3 11:17:31 judeah /kernel: HCNT = 0x0 SCBPTR = 0x1
May  3 11:17:31 judeah /kernel: SCSISIGI[0x0] ERROR[0x40] SCSIBUSL[0x0] LASTPHASE[0x1] 
May  3 11:17:31 judeah /kernel: SCSISEQ[0x12] SBLKCTL[0x0] SCSIRATE[0x0] SEQCTL[0x10] 
May  3 11:17:31 judeah /kernel: SEQ_FLAGS[0xc0] SSTAT0[0x5] SSTAT1[0xa] SSTAT2[0x0] 
May  3 11:17:31 judeah /kernel: SSTAT3[0x0] SIMODE0[0x0] SIMODE1[0xa4] SXFRCTL0[0x80] 
May  3 11:17:31 judeah /kernel: DFCNTRL[0x0] DFSTATUS[0x29] 
May  3 11:17:31 judeah /kernel: STACK: 0x0 0x166 0x109 0x3
May  3 11:17:31 judeah /kernel: SCB count = 130
May  3 11:17:31 judeah /kernel: Kernel NEXTQSCB = 30
May  3 11:17:31 judeah /kernel: Card NEXTQSCB = 30
May  3 11:17:31 judeah /kernel: QINFIFO entries: 
May  3 11:17:31 judeah /kernel: Waiting Queue entries: 
May  3 11:17:31 judeah /kernel: Disconnected Queue entries: 2:90 
May  3 11:17:31 judeah /kernel: QOUTFIFO entries: 
May  3 11:17:31 judeah /kernel: Sequencer Free SCB List: 1 0 
May  3 11:17:31 judeah /kernel: Sequencer SCB Info: 
May  3 11:17:31 judeah /kernel: 0 SCB_CONTROL[0xe2] SCB_SCSIID[0x67] SCB_LUN[0x0] SCB_TAG[0xff] 
May  3 11:17:31 judeah /kernel: 1 SCB_CONTROL[0xe2] SCB_SCSIID[0x67] SCB_LUN[0x0] SCB_TAG[0xff] 
May  3 11:17:31 judeah /kernel: 2 SCB_CONTROL[0x66] SCB_SCSIID[0x17] SCB_LUN[0x0] SCB_TAG[0x5a] 
May  3 11:17:31 judeah /kernel: Pending list: 
May  3 11:17:31 judeah /kernel: 90 SCB_CONTROL[0x62] SCB_SCSIID[0x17] SCB_LUN[0x0] 
May  3 11:17:31 judeah /kernel: Kernel Free SCB list: 82 88 14 115 12 83 120 92 45 8 16 5 59 124 31 29 38 18 73 42 93 64 19 7 74 100 113 75 24 3 86 71 20 108 6 67 68 125 105 97 110 34 54 87 106 25 61 109 123 47 44 66 53 94 84 76 65 77 72 9 69 32 17 55 119 1 22 91 4 112 56 27 102 62 13 15 128 50 33 51 81 37 57 28 99 117 85 36 41 11 121 49 0 80 35 39 40 95 26 96 10 58 118 122 127 111 2 126 70 98 89 21 60 46 48 78 43 101 23 79 52 63 129 103 104 107 116 114 
May  3 11:17:31 judeah /kernel: 
May  3 11:17:31 judeah /kernel: <<<<<<<<<<<<<<<< Dump Card State Ends >>>>>>>>>>>>>>>>>>


The box rebooted and failed to come up to it's normal state because the the
vinum volume that was running off this SCSI disk system failed to load.

May  3 11:22:01 judeah /kernel: sg[0] - Addr 0x1ddd000 : Length 4096
May  3 11:22:01 judeah /kernel: sg[1] - Addr 0x7be000 : Length 4096
May  3 11:22:01 judeah /kernel: (da1:ahc0:0:1:0): no longer in timeout, status = 34b
May  3 11:22:01 judeah /kernel: ahc0: Issued Channel A Bus Reset. 1 SCBs aborted
May  3 11:22:01 judeah /kernel: vinum: dataraid.p0.s1 is stale by force
May  3 11:22:01 judeah /kernel: vinum: dataraid.p0 is corrupt
May  3 11:22:01 judeah /kernel: fatal :dataraid.p0.s1 write error, block 1905465 for 8192 bytes
May  3 11:22:01 judeah /kernel: dataraid.p0.s1: user buffer block 13336624 for 8192 bytes


It looks like SCSI disk da1 was timing out but recovered.  This is
speculation on my part.  Upon rebooting today, da1 seems to be OK?

May 10 07:03:00 judeah /kernel: da1 at ahc0 bus 0 target 1 lun 0
May 10 07:03:00 judeah /kernel: da1: <IBMRAID 0664M1H9337 5 58> Fixed Direct Access SCSI-2 device 
May 10 07:03:00 judeah /kernel: da1: 10.000MB/s transfers (10.000MHz, offset 15), Tagged Queueing Enabled
May 10 07:03:00 judeah /kernel: da1: 1920MB (3933040 512 byte sectors: 255H 63S/T 244C)


So, the question, do I have a hardware failure?  If so, is it the
Adaptec 2940/UW controller or the SCSI disk?  When I get this resolved,
I'll obviously have to figure out how to fix my corrupt vinum volume :(


-- 
Regards,
Doug



More information about the freebsd-questions mailing list