3Ware 9000 series hangs under load

Scott Long scottl at samsco.org
Thu Oct 30 09:34:19 PDT 2008


Oliver Lehmann wrote:
> Hi,
> 
> I've problems with my 3ware controller. Havingg heavy I/O load (e.g.
> running 40 port builds the day over with tinderbox which involves
> un-taring a whole FreeBSD tree 40 times), my system hangs with the well
> known
> 
> swap_pager: indefinite wait buffer: bufobj: 0, blkno: 2, size: 4096
> swap_pager: indefinite wait buffer: bufobj: 0, blkno: 2, size: 4096
> 
> error. I'v opened a ticket at 3ware and after half a month of
> dummy-testings (are your drives fine, can you run a stress test), it
> looks like i was redirected to someone from the 2nd lvl support and he
> told me:
> 
>   There are 2 things that you can try,
>   1, disable apic in your bootloader.conf file, or RMA the controller.
> 
>   The error that you have is generally caused by an interrupt problem,
>   defective backplane, bad drive or bad controller.
> 
> and after I told him that I intend to use the 2 CPUs I have and not
> falling back to one CPU for ever he responded:
> 
>   Yes I do understand about disabling APIC, but the feature is sometimes
>   not stable in all dual proc systems.  There are many variables, the
>   CPU's have to be matched down to the Lot #, the motherboard must have a
>   good design and the kernel supporting APIC must be stable. But, it is a
>   good test to see if it is software or hardware.
> 
> So what I did now, was compiling a kernel w/o apic/smp and I'm running
> this configuration now for 3 days stressing the system w/o running into
> the swap_pager problem. Can it be still a controller problem or is it
> more likley a problem of FreeBSDs smp/apic implementation or the board
> I'm using (Intel L440GX).
> 
> I'm asking because I'm not sure which problem it is now and before
> telling it 3ware and having them responding "ok it is a FreeBSD problem"
> or "ok it is a board problem" I'd like to know what can be the case here.
> 
> (please keep me CCed, I'm not subscribed to smp@)
> 
> Further information (and the history) on this topic can be found here
> (and following):
> 
> http://lists.freebsd.org/pipermail/freebsd-stable/2008-September/045500.html
> 
> 

The probability that it's a problem in the generic interrupt/APIC code 
in FreeBSD is low.  That code has matured quite well over the last 5 
years, and it is very solid for just about every other hardware 
configuration out there.  I'd suspect the following things in the 
following order:

1. Driver bug.  Driver might be loosing an interrupt, or might be 
deadlocking due to coding/design problems.
2. Defective controller
3. Buggy firmware on the controller.  FreeBSD does tend to push I/O
controllers a lot harder than other OS's, resulting in strange bugs
sometimes being found.
4. Defective motherboard.

The fact that it's running fine with SMP/APIC disabled could easily mean
that it's not taking as high of a load, and is thus avoiding problems.
It could also mean that latent bugs in the driver are not being exposed.
I don't have a lot of time to spend debugging this, but I'd suggest that
you either take up AMCC's offer to RMA the board, or put a spare ATA
drive in the chassis and set it up as a dump partition, then get a
crashdump of the system when it gets into this state.

Scott



More information about the freebsd-smp mailing list