vrele: negative ref cnt (was: Re: Crash dumps not working correctly for amd64?)

Fri Feb 25 17:51:15 GMT 2005

>> So, buried in the last email I sent out was the actual panic string 
>> for
>> the dumps I've been having.  Has anyone else seen this or is this a
>> known problem?  I haven't been able to extract a correct dump yet.  I
>> just dd(1)'ed the swap partition hoping that next time it dumps, there
>> will be a clean dump that I can extract a backtrace from, however I'm
>> not too hopeful that this will work as a fix for me.  Regardless, I'd
>> assume that twa(1) + amd64 work fine together?  This crash seems to
>> emanate from the file system code so I doubt it's driver related.
>> Anyone know what could lead to this?  Feel free to send me in the
>> direction of fs@ if I'm not the only one having this problem or that
>> this isn't amd64 specific.  Right now I can generally trigger this 
>> once
>> every 24hrs, so it's becoming something of a problem for me and 
>> without
>> a valid dump, I'm not sure where to begin.  -sc
>>
>>> # savecore -vf
>>> bounds number: 9
>>> checking for kernel dump on device /dev/da0s1b
>>> mediasize = 3221225472
>>> sectorsize = 512
>>> magic mismatch on last dump header on /dev/da0s1b
>>> forcing magic on /dev/da0s1b
>>> savecore: first and last dump headers disagree on /dev/da0s1b
>>> savecore: reboot after panic: vrele: negative ref cnt
>>> Checking for available free space
>>> Dump header from device /dev/da0s1b
>>>   Architecture: amd64
>>>   Architecture Version: 16777216
>>>   Dump Length: 2146631680B (2047 MB)
>>>   Blocksize: 512
>>>   Dumptime: Thu Dec 16 03:06:24 2004
>>>   Hostname: nfs.example.com
>>>   Magic: FreeBSD Kernel Dump
>>>   Version String: FreeBSD 5.3-STABLE #1: Wed Dec  8 22:20:38 PST 2004
>>>     root at nfs.example.com:/usr/obj/usr/src/sys/NFS
>>>   Panic String: vrele: negative ref cnt
>>>   Dump Parity: 1999448632
>>>   Bounds: 9
>>>   Dump Status: bad
>>> savecore: writing core to vmcore.9
>>> 2146631680
>
> twe and twa have worked great on my side. Re: the dumps, thats a
> bizzare one, what hardware do you have?

twa (9K Escalade 8 port), 8x 200GB 7200RPM Maxtor SATA (7x RAID5, 1x 
hotspare), Athlon64 3000, 2GB of RAM, and... here's the kicker ASUS KV8 
Deluxe mobo.  I'm not suspecting the hardware, however.

> I may be able to recreate, if you've got exact specs, and last time 
> you cvsup'd to stable.

Wednesday, Feb 16th.

The machine's doing an insane amount of load for one macchine, esp of 
this hardware class.  Once every few hours (about every 12-24 or so), 
the machine craps out.  It doesn't crap out at peak load, nor at low 
load times or any time that makes sense.  It's been doing this for 
about a week and I'm certain this bug has nothing to do with usage in 
any way, shape, or form.  I've verified the RAID, the memory, and CPU.

I have MySQL listening on lo0, same with memcached(8).  What is 
interesting to me is that before the machine goes AWOL for 20min while 
it reboots, traffic on lo0 goes crazy insane, then reboots after lo0 
goes into its FUBAR state.  I'm sure what I'm hitting is a kernel bug 
and not a bug in the hardware since there are clear indicators that the 
system dynamics change, but I can't seem to get any dumps out of the 
machine to even begin diagnosing the problem.  It doesn't leave a core, 
however, and that's irritating, to say the least.  Now that it's dumped 
a few times after I dd(1)'ed the swap partition, I'm not convinced the 
above error is even the correct error since I haven't seen any core 
dumps since the dd(1).  Strange, no?  Anyway, I'm at a loss as to where 
to begin since I get no diagnostics other than noticing lo0 usage is an 
indicator of an impending reboot.

Here's a graph of lo0.  What's important for me to note is that when 
lo0 starts pushing 100+MB of data across it, it has no bearing on the 
actual load of the machine or the delivery of the content for the 
userland programs.  Everything looks nominal and is within normal 
parameters.

-------------- next part --------------

I sure hope someone has an "a ha!" brewing, 'cause this isn't making a 
whole lot of sense to me.  I'm probably going to bind everything to a 
non lo0 interface (ie, em0) in the hopes that there is a locking issue 
with lo0.  If I had to point to any one thing that's strange about the 
configuration of this machine, it's its very heavy use on lo0 my 
multiple programs.  tia.  -sc

-- 
Sean Chittenden