twa kernel panic under heavy IO

Mon Oct 24 11:32:56 PDT 2005

> -----Original Message-----
> From: Dan Rue [mailto:drue at therub.org] 
> Sent: Monday, October 24, 2005 11:23 AM
> To: Vinod Kashyap
> Cc: freebsd-stable at FreeBSD.org
> Subject: Re: twa kernel panic under heavy IO
> 
> On Mon, Oct 24, 2005 at 11:07:28AM -0700, Vinod Kashyap wrote:
> > > After going around with 3ware web support, this issue has been 
> > > concluded, but not resolved.  I tried my 3ware 9500 on 
> FreeBSD 5.3, 
> > > 5.4, and 5-STABLE.  With all of these versions of OS and 
> driver (i 
> > > never changed the driver version manually), I received 
> hard lock ups 
> > > and reboots (though, interestingly, no kernel panics).
> > > 
> > > 3ware had me check and troubleshoot a number of 
> possibilities, until 
> > > they finally decided it was a hardware problem and issued me a 
> > > replacement card.  However, in the meantime, I upgraded to FreeBSD
> > > 6.0RC1 and the machine is now working flawlessly.  I returned the 
> > > replacement card unused.
> > > 
> > > I can only conclude that this means that there is a large
> > > (timing?) bug in the twa driver in freebsd 5.3/5.4/5-stable (as 
> > > opposed to an isolated hardware problem with my setup).
> > > 
> > > I have pasted the full conversation with 3ware on my website for 
> > > those interested here:
> > > http://therub.org/9500.txt (sorry for the poor formatting)
> > > 
> > > At one point, I received the following error message just 
> before the 
> > > machine locked up:
> > > 
> > > >Oct 12 11:36:13 leopard kernel: initiate_write_filepage: already 
> > > >started
> > > 
> > > I grepped for that error message in the freebsd kernel 
> source, and 
> > > found it in sys/ufs/ffs/ffs_softdep.c on line 3580.  What 
> makes it 
> > > really interesting is the comment above where the error is thrown:
> > > 
> > > if (pagedep->pd_state & IOSTARTED) {
> > >         /*
> > >          * This can only happen if there is a driver that does not
> > >          * understand chaining. Here biodone will reissue the call
> > >          * to strategy for the incomplete buffers.
> > >          */
> > >         printf("initiate_write_filepage: already started\n");
> > >         return;
> > > }
> > > 
> > > I know this is a 3ware issue.  I am posting this 
> resolution response 
> > > here in hopes that it may help someone else that hits 
> this bug - and 
> > > with the hope that publically it will get the attention 
> of the 3ware 
> > > freebsd driver team/individual.
> > > 
> > 
> > The error messages you are seeing are consistent with bad hardware.
> > The hardware is becoming unavailable for the driver to talk to it.
> > This other message "initiate_write_filepage..." is 
> different but did 
> > you see the machine hang after this message got printed?  I don't 
> > think it's related to the hang.
> > 
> 
> The initiate_write_filepage occured right before the hang.  
> Here's the full log from that time: 
> 
> Oct  6 17:00:32 leopard kernel: twa0: ERROR: (0x16: 0x1301): 
> Missing expected status bit(s): status reg = 0x15025bb0; 
> Missing bits: [MC_RDY,] Oct  6 17:00:33 leopard last message 
> repeated 399 times Oct  6 17:00:36 leopard kernel: ected 
> status bit(s): status reg = 0x15025bb2; Missing bits: 
> [MC_RDY,] Oct  6 17:00:36 leopard kernel: twa0: ERROR: (0x16: 
> 0x1301): Missing expected status bit(s): status reg = 
> 0x15025bb2; Missing bits: [MC_RDY,] Oct  6 17:00:36 leopard 
> last message repeated 296 times Oct  6 17:01:37 leopard 
> kernel: initiate_write_filepage: already started Oct  6 
> 17:01:37 leopard last message repeated 83 times Oct  6 
> 17:01:37 leopard kernel: twa0: ERROR: (0x05: 0x210b): Request 
> timed out!: request = 0xc23fb0a0 Oct  6 17:01:37 leopard 
> kernel: twa0: INFO: (0x16: 0x1108): Resetting controller...:  
> Oct  6 17:01:37 leopard kernel: twa0: INFO: (0x04: 0x005e): 
> Cache synchronized after power fail: unit=0 Oct  6 17:01:37 
> leopard kernel: twa0: INFO: (0x04: 0x0001): Controller reset 
> occurred: resets=1 Oct  6 17:01:37 leopard kernel: twa0: 
> INFO: (0x16: 0x1107): Controller reset done!:  
> 

Ok, that message is preceded by those same messages that indicate
that the hardware became unavailable.  So, that message seems to
have been the result of the same hardware issue I mentioned.

> 
> If it's a hardware problem, why would it run fine on 6.0?  
> The hang was very easy to trigger, and i've put the 6.0 
> machine through the gauntlet trying to recreate the problem.
> 
That's a valid question.  It could be only a matter of time...

> Thanks for looking into this (again) for me, Dan
>
--------------------------------------------------------

CONFIDENTIALITY NOTICE: This e-mail message, including any attachments, is for the sole use of the intended recipient(s) and contains information that is confidential and proprietary to Applied Micro Circuits Corporation or its subsidiaries. It is to be used solely for the purpose of furthering the parties' business relationship. All unauthorized review, use, disclosure or distribution is prohibited. If you are not the intended recipient, please contact the sender by reply e-mail and destroy all copies of the original message.