ffs snapshot lockup

Vivek Khera vivek at khera.org
Fri Oct 6 18:11:07 UTC 2006

On Oct 6, 2006, at 1:57 PM, Kris Kennaway wrote:

>> This is very strange. You 3 instances of getty where just reading the
>> tty input, and all suspectible processes (like sshd) are waiting  
>> on net
>> events. No processes are blocked on the fs. One nfsd is serving  
>> the request,
>> and dump is active.
> To repeat something I said earlier: when creating a snapshot
> (e.g. which dump -L does), the entire system may become unresponsive
> untilk the snapshot completes, which can take many minutes.

I know snapshot takes a while -- we're used to that.

> How long are you waiting before pronouncing the system deadlocked?

10's of minutes.
> What does ^T on the console (e.g. when trying to log in), show you?

nothing.  the console is non-responsive.  the remote shells are non  
responsive to any input.

I'm now convinced it was all stemming from some bug in bge driver (at  
least for my specific chipset.)  Last night I put in an old spare  
3c905 NIC and turned off the motherboard bge via BIOS.

I can't make the machine lock up at all, even with the watchdog  
running, and doing level0 dumps.

Also, even though this NIC is only 10/100 and the prior was running  
at GigE speed, the system is *way* more responsive to network  
operations.  For example, when I logged in this morning my IMAP mail  
client took barely a second or or so to open my inbox, whereas before  
it would take upwards of 10 seconds.

This machine was always this way since it was first set up running  
5.3.  I can't believe I lived with it for so long...  I'd like to  
find a nice stable GigE NIC for it, since I know that the onboard bge  
is definitely sub-optimal with FreeBSD.  Dell's diagnostics don't  
find any hardware fault, for what that's worth.

Curiously, I have a handful of other Dell servers at the office which  
all have bge and run just great at GigE speed to the same switch.

If it does lock up again, I'll be sure to let you know!

