New FreeBSD snapshots available: stable/10 (20150625 r284813)

Mon Jul 6 20:40:09 UTC 2015

On 7/2/15 11:00 AM, Glen Barber wrote:
> On Thu, Jul 02, 2015 at 10:52:00AM -0400, Kurt Lidl wrote:
>>> Kurt, can you re-enable the ipv6 line in rc.conf(5), and add '-tso6' to
>>> your rc.conf(5) lines?
>>>
>>>   ifconfig_bge0="DHCP"
>>>   ifconfig_bge0_ipv6="inet6 accept_rtadv -tso6"
>>>
>>
>> I tried this, and it panic'd in the same manner.  (Note - I've upgraded
>> this machine to the second 10.2-PRELEASE build.)
>>
>
> Okay, thank you for testing.  The last commits that I see specifically
> referencing this bge(4) model were a long time ago, but TSO was
> mentioned.  It was worth a shot.

Sure, no problem.

>
>> [...]
>>
>> I've also seen (now that it's been running a bit longer), a couple of
>> other occurrences of the "spin lock held too long" panic. So while
>> having the IPv6 configuration in /etc/rc.conf causes this crash to
>> occur most of the time on boot, the same crash occurs at other times
>> too, which don't appear to IPv6 related.
>>
>
> Can you update the PR with this information, please?

Already done by the time I sent the email.

>
>> 1) when making the requested change, I editted my /etc/rc.conf file,
>> and then issued "reboot".  The machine panic'd during the reboot
>> processing:
>>
>> root at spork:~ # reboot
>> Jul  2 09:48:53 spork reboot: rebooted by root
>> Jul  2 09:48:53 spork syslogd: exiting on signal 15
>> Waiting (max 60 seconds) for system process `vnlru' to stop...done
>> Waiting (max 60 seconds) for system process `bufdaemon' to stop...done
>> Waiting (max 60 seconds) for system process `syncer' to stop...
>> Syncing disks, vnodes remaining...0 0 0 0 done
>> All buffers synced.
>> Uptime: 14h34m16s
>> GEOM_MIRROR: Device gswap: provider mirror/gswap destroyed.
>> GEOM_MIRROR: Device gswap destroyed.
>> pid 1 (init), uid 0: exited on signal 4
>> spin lock 0xc0cba338 (smp rendezvous) held by 0xfffff8000bbbe920 (tid
>> 100367) too long
>> timeout stopping cpus
>> panic: spin lock held too long
>> cpuid = 1
>> KDB: stack backtrace:
>> #0 0xc05757c0 at panic+0x20
>> #1 0xc0559250 at _mtx_lock_spin_failed+0x50
>> #2 0xc0559318 at _mtx_lock_spin_cookie+0xb8
>> #3 0xc08d801c at tick_get_timecount_mp+0xdc
>> #4 0xc05840c8 at binuptime+0x48
>> #5 0xc08a400c at timercb+0x6c
>> #6 0xc08d8380 at tick_intr+0x220
>> Uptime: 14h34m16s
>> Automatic reboot in 15 seconds - press a key on the console to abort
>> Rebooting...
>> timeout stopping cpus
>> timeout shutting down CPUs.
>>
>> SC Alert: Host System has Reset
>>
>> Note: the "SC Alert:" message comes the Sparc's ALOM management system,
>> so that's from the hardware directly, not from FreeBSD's kernel.
>>
>
> Hmm.  Any chance this could be hardware (failure) related?

Highly unlikely.

First, both Chris and I both see this same error on our V240 machines.

Also, I took the time this weekend to re-install from the
10.0-RELEASE media onto the other disks in this machine.[*]
My V240 has 4x72GB drives, so I now have 10.0-RELEASE running
on a ZFS mirror on disk0/disk1 and have the second 10.2-PRERELEASE
bits installed onto a ZFS mirror on disk2/disk3.  So I can boot
into either of those environments pretty easily.

When running 10.0-RELEASE, the hardware does not exhibit the
"spin lock held too long" message.

-Kurt

[*] This turned out to be unexpected hard.  I was able to boot
from the 10.0-RELEASE cdrom, and create a ZFS mirror, and install
to it, but when I rebooted, I got this error:

Trying to mount root from zfs:sys/ROOT/default []...
Mounting from zfs:sys/ROOT/default failed with error 45.

It took me a while to figure out what was going on.  In 10.0,
the sparc ZFS support probed all the disk devices, looking
for the disks in the boot zpool.  In 10.2, it only probes the
the devices configured in the eeprom's "boot-device" setting.
I had installed the 10.2-ish bits into the zpool called "sys",
and when I reinstalled the 10.0 bits, I also put them into
a zpool named "sys".  So I had two entirely different "sys"
zpools, the first on disk0/disk1 and the second on disk2/disk3.

The 10.2 code can handle this (since it only looked at disk2/disk3),
and happily booted from disk2/disk3.
The 10.0 code, on the other hand, examined all the disks, found
devices that didn't match up, and gave up.  I ultimately ended up
reinstalling the 10.0-RELEASE software into a zpool named "sys0".