Big problems with 7.1 locking up :-(

Mon Jan 12 16:05:04 PST 2009

On Mon, 12 Jan 2009, Tomas Randa wrote:

> I have similar problems. The last "good" kernel I have from stable brach, 
> october the 8. Then in next upgrade, I saw big problems with performance. I 
> tried ULE, 4BSD etc, but nothing helps, only downgrading system back.
>
> Now I am trying 7.1-p1 and problems are here again. Mysql is waiting a lot 
> of time with status "waiting for opening table" or "waiting for close 
> tables"
>
> I have 32bit FreeBSD with PAE, 1x xeon 5420, supermicro motherboard, areca 
> SATA controller. Could not be problem in "da" device for example?

So far, this sounds like a different problem than the one others have been 
posting about, which involves full system freezes rather than specific 
processes wedging or responding poorly.  I'd suggest starting by using 
"procstat -k" on the process ID to look at where specific threads are waiting 
in the kernel.  Is it simply that MySQL is being unreasonably slow in certain 
situations, or does it actually entirely stop operating?

If you're able to narrow down the date on the 7.x branch where the problem 
you're experiencing "begins", that would be most helpful.  I'd suggest leaving 
your userspace on the 8th october, and sliding the kernel forward in a binary 
search until you've narrowed it down a bit.  Obviously, this takes a bit of 
patience, but narrowing it down could be quite informative.

Robert N M Watson
Computer Laboratory
University of Cambridge

>
> Thanks Tomas Randa
>
> Garance A Drosihn wrote:
>> At 2:55 PM +0000 1/12/09, Robert Watson wrote:
>>> On Fri, 9 Jan 2009, Garance A Drosihn wrote:
>>> 
>>>> At 2:39 PM -0500 1/9/09, Robert Blayzor wrote:
>>>>> On Jan 8, 2009, at 8:58 PM, Pete French wrote:
>>>>>> I have a number of HP 1U servers, all of which were running 7.0 
>>>>>> perfectly happily. I have been testing 7.1 in it's various incarnations 
>>>>>> for the last couple of months on our test server and it has performed 
>>>>>> perfectly.
>>>>> 
>>>>> I noticed a problem with 7.0 on a couple of Dell servers.  [...] We've 
>>>>> since then compiled the kernel under the BSD scheduler to rule that out, 
>>>>> and so far so good.
>>>>> 
>>>>> Since ULE is now default in 7.1 and not in 7.0, perhaps you can try 
>>>>> that?
>>>> 
>>>> FWIW, the other guy I know who is having this problem had already 
>>>> switched to using ULE under 7.0-release, and did not have any problems 
>>>> with it.  So *his* problem was probably not related to SCHED_ULE, unless 
>>>> something has recently changed there.
>>>> 
>>>> Turns out he hasn't reverted back to 7.0-release just yet, so he's going 
>>>> to try SCHED_4BSD and see if that helps his situation.
>>> 
>>> Scheduler changes always come with some risk of exposing bugs that have 
>>> existed in the code for a long time but never really manifested 
>>> themselves. ULE is well shaken-out, having been under development for at 
>>> least five years, but it is possible that some problems will become 
>>> visible as a result of the switch.  I would encourage people to stick with 
>>> ULE, but if you're having a stability problem then experimenting with 
>>> scheduler as a variable that could be triggering the problem may well be 
>>> useful to help track down the bug.
>> 
>> Just to followup on this:  My friend did switch back to a 7.1 kernel with
>> SCHED_4BSD, and he still ran into problems.  The error messages weren't
>> the same, but errors did happen in the same high disk-I/O situations as
>> the lockup happened with SCHED_ULE.  At this point he's fallen back to
>> the 7.0-kernel that he had been running (which also has SCHED_ULE), and
>> all the problems have gone away.  So at the moment he's running with a
>> 7.0-ish kernel and the 7.1-release userland, without the hanging problems.
>> So the problem is something in the kernel, but it is *NOT* the scheduler
>> (at least, not in his case).
>> 
>> He is not eager to do a whole lot of experiments to track down the
>> problem, since this is happening on busy production machines and he
>> can't afford to have a lot of downtime on them (especially now that the
>> semester at RPI has started up).  The systems have some large (2 TB)
>> filesystems on them, and the lockups occur in high disk-I/O situations.
>> He's seeing the problem on one system which is a dual CPU quad-core
>> xeon, and another which is a 64 bit P4 with hyperthreading.  The one
>> thing in common between the two setups is that the boot drives + a
>> 3ware controller (with its array of RAID disks) is moved from one
>> machine to the other one:
>>
>>   "its a 3ware 9500 12 port model, the boot drive is connected to
>>    an ICH6 in IDE mode, and yes, I've run it in single, single with
>>    hyper threading, and 8 way mode.  All 64 bit."
>> 
>> We still have no idea where the problem really is.  For all we know,
>> someone spilled a Pepsi on it when he wasn't looking...
>> 
>