FreeBSD 4.8, ASR2120, SMP, degraded RAID1/mirror => storage failure

rysanek at fccps.cz rysanek at fccps.cz
Thu Oct 2 01:24:28 PDT 2003


Dear Mr. Long,

during the last week or so, I've been trying to get more information
about the problem in 4.8-RELEASE that this thread has been about.

Yesterday, just when I thought that maybe I had some interesting data
worth sending to you, I noticed that 4.9-RC1 was out. So I tested for the
symptoms in that and I have to admit that

           IT WORKS IN 4.9-RC1 !   Wonderful!

I.e., the machine does boot from a rebuilding array and array degradation
at runtime doesn't make the machine hang because of storage failure.
All of that with SMP, APIC_IO and HT enabled.
No need to include the irrelevant ISP driver (see the attachments for
an explanation of this comment).


In the newsgroups, I've noticed other people complaing about various
aac-based hardware under FreeBSD.
I am aware that there have been changes to dev/aac/* between 4.8 and 4.9.
I'm not sure whether you have managed to squash the bug, or if the remedy
was incidental, and whether or not the bug was in the aac drivers or
elsewhere in the system (APIC handling? DMA mapping?).
Therefore, just in case you were interested, the information I gathered in
4.8 is attached to this message.

Hmm. Now that this is solved, I'd like to focus on the defunct aaccli.
I guess I'd better start another thread related to that.

Thanks for the great job that you're doing in the FreeBSD team.
And, thanks for your patience with me.

Frank Rysanek
-------------- next part --------------
The following problem description applies exclusively to
4.8-RELEASE.

I have managed to carry out further research related to the topic of this 
e-mail thread, in the original 4.8-RELEASE. I have found some more 
deterministic symptoms of the problem, one other dependency apart from 
SMP+APIC_IO, but I'm stuck again.
A detailed explanation follows.


PROBLEM SUMMARY
---------------
With the GENERIC kernel, the ASR2120 (driver AAC) works fine under all 
circumstances.
With SMP+APIC_IO enabled or with "device isp" disabled, the controller
and the driver work fine as long as the array volumes are fine.
When an array becomes degraded at runtime, or when booting off a degraded 
(especially rebuilding) array, the system crashes miserably.
The bottom line is, that my ASR2120 is not fault tolerant in SMP.


DISCOVERED CFG DEPENDENCIES
---------------------------
The ASR2120 controller and the aac driver WORK FINE under these 
conditions:
1) SMP+APIC_IO are disabled (a UP-only kernel)
2) 'device isp' is _enabled_ (though its PCI probe doesn't find anything)

And no, I don't have any Qlogic chips in my machine.


SYMPTOMS
--------
1) the "zero-padded FIB" AKA "unknown command from controller" upon 
   runtime disk fault or when booting from a rebuilding array.
   This is the original symptom.

2) During the kernel boot sequence, still with interrupts disabled, when 
   aac_startup() probes for containers, it finds none - as a result of
   the controller being stuck as per symptom 3).

3) let's focus on booting: during aac_init(), when the controller
   is notified of a ready "mailbox" for the first time, the controller
   pukes - the drive LED's start flashing red and symptom 2) follows.
   Once interrupts are enabled, symptom 3) follows.
   I have discovered the precise moment when the controller pukes
   by inserting debug messages and DELAY(10 s) statements at various
   points in the code of aac_attach(), aac_init(), aac_sync_command()
   etc...
   Obviously the red flashing doesn't occur in the workable setup
   - the controller keeps rebuilding the array(s) merrily throughout
   FreeBSD boot.

4) with the working setup (see the CFG DEPENDENCIES section above),
   the MMIO assumes a different config than with any defunct setup.
   The physical address of the FIBs (or whatever it is) is different,
   I don't know why. See the two log snippets below. See also the
   attached tarball for more complete boot logs.


The following are some variable dumps, logged by instrumentation that
I have inserted into aac.c.


This is the workable setup (UP && 'device isp' enabled):

FRR:  generic attach - aac_attach() called
FRR:   Disabling interrupts.
FRR:    aac_init(): initing controller
FRR:  -- Init structure contents: --
FRR:    aac_common is at e1980000
FRR:    ac_init is at    e1981000
FRR:    InitStructRevision = 3
FRR:    MiniPortRevision = 1
FRR:    AdapterFibsPhysicalAddress = 1c000
FRR:    AdapterFibsVirtualAddress = e1980000
FRR:    AdapterFibsSize = 1000
FRR:    AdapterFibAlign = 200
FRR:    PrintfBufferAddress = 1f184
FRR:    PrintfBufferSize = 100
FRR:    HostPhysMemPages = 3fccf
FRR:    HostElapsedSeconds = 0
FRR:   setting the outbound doorbell register to all one's.
FRR:    aac_common_busaddr = 1c000
FRR:    ac_init offset = 1000
FRR:   aac_sync_command() called.
FRR:   - populating the mailbox...
FRR:    aac_rx_set_mailbox() called.
FRR:    btag:         1
FRR:    handle:       dd96e000
FRR:    command:      5
FRR:    arg0:         1d000
FRR:    arg1:         0
FRR:    arg2:         0
FRR:    arg3:         0
FRR:   - clearing the sync cmd doorbell...
FRR:   - notifying the controller && waiting.
FRR:   - clearing the completion flag...


This is a defunct setup (SMP || 'device isp' disabled):

FRR:  generic attach - aac_attach() called
FRR:   Disabling interrupts.
FRR:    aac_init(): initing controller
FRR:  -- Init structure contents: --
FRR:    aac_common is at e198e000
FRR:    ac_init is at    e198f000
FRR:    InitStructRevision = 3
FRR:    MiniPortRevision = 1
FRR:    AdapterFibsPhysicalAddress = 1000
FRR:    AdapterFibsVirtualAddress = e198e000
FRR:    AdapterFibsSize = 1000
FRR:    AdapterFibAlign = 200
FRR:    PrintfBufferAddress = 4184
FRR:    PrintfBufferSize = 100
FRR:    HostPhysMemPages = 3fcce
FRR:    HostElapsedSeconds = 0
FRR:   setting the outbound doorbell register to all one's.
FRR:    aac_common_busaddr = 1000
FRR:    ac_init offset = 1000
FRR:   aac_sync_command() called.
FRR:   - populating the mailbox...
FRR:    aac_rx_set_mailbox() called.
FRR:    btag:         1
FRR:    handle:       dd97c000
FRR:    command:      5
FRR:    arg0:         2000
FRR:    arg1:         0
FRR:    arg2:         0
FRR:    arg3:         0
FRR:   - clearing the sync cmd doorbell...
FRR:   - notifying the controller && waiting.  
           ^^^ ASR2120 locks right here (LEDs go red) ^^^
FRR:   - clearing the completion flag...


The interesting point is that either SMP or the absence of 'device isp' 
have the same effect on the physical address arbitrated. There's one
consistent value in the single workable setup and a different consistent 
value in all the (three) defunct setups that I've tested.
See the attached full logs.

Another funny point is that isp_pci_probe() doesn't detect any Qlogic
adapter on the PCI bus. I'm not able to determine where else could
the isp driver hook into the kernel's boot sequence and allocate some 
physical memory locations (or whatever), thus shifting the MMIO window of 
the ASR2120.
Spooky stuff.

No other irrelevant drivers apart from ISP have this effect.
At least that's what I determined by trial-and-error + "interval halving" 
in the kernel config file.


More information about the freebsd-questions mailing list