Big problems with 7.1 locking up :-(

Guy Helmer ghelmer at palisadesys.com
Thu Feb 12 10:16:16 PST 2009


Guy Helmer wrote:
> Pete French wrote:
>> I have a number of HP 1U servers, all of which were running 7.0
>> perfectly happily. I have been testing 7.1 in it's various incarnations
>> for the last couple of months on our test server and it has performed
>> perfectly.
>>
>> So the last two days I have been round upgrading all our servers, 
>> knowing
>> that I had run the system stably on identical hardware for some time.
>>
>> Since then I have starte seeing machines lock up. This always happens 
>> under
>> heavy disc load. When I bring the machine back up then sometimes it 
>> fails
>> to fsck due to a partialy truncated inode. The locksup appear to
>> be disc related - on my mysql msater machine it will come back up with
>> files somewhat shorted than  those which ahve aready been transmitted to
>> the slave (i.e. some data was in memory, and claimed to have been 
>> written
>> to the drive, but never made it onto the disc).
>>
>> The only time I have seen anything useful on the screen was during 
>> one lockup
>> where I got a message about a spin lock being held too long and some
>> comment in parentheses about it being a turnstile lock.
>>
>> Help! :-(
>>
>> I am now downgrading all the machine to 7.0 as fast as I can - though 
>> the
>> machine I am trying to compile it on has locked up once during the 
>> compile
>> so I havent got anywhere so far.
>>
>> The machines are HP Proliant DL360 G5s - they have an embedded P400i
>> RAID controller with a pair of mirrored drives connected. Each one has
>> both ethernets connected, bundled using lagg and LACP.
>>
>>   
> I can't tell whether my situation is related, but I am seeing lockups 
> on SMP Supermicro servers with both older (NetBurst-ish) and current 
> Xeon CPUs.  I have been dropping into the kernel debugger and getting 
> lock information and process backtraces, but so far nothing has been 
> conclusively identified.  I think the issue I'm seeing was introduced 
> sometime between October 2 and November 24 in the RELENG_7 branch, and 
> I suppose the next step is to do a binary search for the offending 
> change.
>
> Guy
>
FWIW, I think I have tracked down the changes just prior to 7.1-RELEASE 
that is causing my Supermicro dual Xeon machines to wedge.  I did the 
binary search between 2008-10-02 and 2008-11-24 without reproducing any 
lockups, and then I went on to search between 2008-11-24 and 
2009-01-04.  An SMP kernel build from 2008-12-22 (r186409) sources was 
stable for over two weeks; a kernel built from 2008-12-29 (r186590) 
sources wedged in under 24 hours under moderate load.

It appears that the significant changes between r186409 and r186590 were 
r186552 (delphij - reverted ATA changes) and r186535/r186534 (delphij - 
reverted bce changes).  My machines don't have bce interfaces, so I 
suspect the ATA changes.

Any thoughts?

Thanks,
Guy

-- 
Guy Helmer, Ph.D.
Chief System Architect
Palisade Systems, Inc.



More information about the freebsd-stable mailing list