SunFire X2200 ilo's bge1 DOWN/UP

Mon May 27 13:41:43 UTC 2013

On May 27, 2013, at 12:59 AM, Daniel Braniss wrote:

On Fri, May 24, 2013 at 05:31:13PM +0300, Daniel Braniss wrote:
hi, after upgrading to 9.1-stable, this particular hardware - SunFire X2200,

If you're truly running stable/9, and it's up-to-date, you should have have already
SVN revisions 248858 and 250650. Both of which have significant impact for
(a) the SunFire X2200 (r248858) and (b) the DOWN/UP problem (r250650).

Show me dmesg(bge(4) and brgphy(4) only) and 'ifconfig bge1' output.

bge0: <Broadcom NetXtreme Gigabit Ethernet Controller, ASIC rev. 0x009003> mem
0xfdff0000-0xfdffffff,0xfdfe0000-0xfdfeffff irq 17 at device 4.0 on pci6
bge0: CHIP ID 0x00009003; ASIC REV 0x09; CHIP REV 0x90; PCI-X 133 MHz
miibus2: <MII bus> on bge0
brgphy0: <BCM5714 1000BASE-T media interface> PHY 1 on miibus2
brgphy0:  10baseT, 10baseT-FDX, 100baseTX, 100baseTX-FDX, 1000baseT,
1000baseT-master, 1000baseT-FDX, 1000baseT-FDX-master, auto, auto-flow
bge0: Ethernet address: 00:1b:24:5d:5b:bd
bge1: <Broadcom NetXtreme Gigabit Ethernet Controller, ASIC rev. 0x009003> mem
0xfdfc0000-0xfdfcffff,0xfdfb0000-0xfdfbffff irq 18 at device 4.1 on pci6
bge1: CHIP ID 0x00009003; ASIC REV 0x09; CHIP REV 0x90; PCI-X 133 MHz
miibus3: <MII bus> on bge1
brgphy1: <BCM5714 1000BASE-T media interface> PHY 1 on miibus3
brgphy1:  10baseT, 10baseT-FDX, 100baseTX, 100baseTX-FDX, 1000baseT,
1000baseT-master, 1000baseT-FDX, 1000baseT-FDX-master, auto, auto-flow
bge1: Ethernet address: 00:1b:24:5d:5b:be

sf-10> ifconfig bge1
bge1: flags=8802<BROADCAST,SIMPLEX,MULTICAST> metric 0 mtu 1500
       options=8009b<RXCSUM,TXCSUM,VLAN_MTU,VLAN_HWTAGGING,VLAN_HWCSUM,LINKSTA
TE>
       ether 00:1b:24:5d:5b:be
       nd6 options=21<PERFORMNUD,AUTO_LINKLOCAL>
       media: Ethernet autoselect (100baseTX <full-duplex>)
       status: active

Saw similar things happening over here with different broadcom chipset, and the above revisions
helped significantly (URLs below):

http://svnweb.freebsd.org/base?view=revision&revision=248858
http://svnweb.freebsd.org/base?view=revision&revision=250650

is toggeling bge1 DOWN/UP every few hours, this port is being used by the ILO.
To check, I upgraded another identical host, and the same problem appears.

What is the last known working revision?

I have no idea, but I have older versions, and ill start from the oldets
(9.1-prerelease), but
it will take time, since it takes hours till it happens.

There are ways you can speed up the replication time. I tend to flood a server with
TCP while I've heard of it happening under UDP flood too.

Here's a nice way to flood a server with TCP (assuming you have SSH access to the
system via keys):

sh -c 'while :;do dd if=/dev/urandom of=/dev/stdout bs=1m count=1024 | ssh HOST2KILL /sbin/md5; done'

Run that about 16 times in separate screen sessions from various other hosts on your network,
taking care to replace "HOST2KILL" with the hostname or IP of the box with the SunFire X2200.

Let that run for a while, and then when you think you've had a reset (if you weren't standing
there watching for one)…

grep 'bge.*DOWN' /var/log/messages

On a system that has booted and stayed up-and-running, there shouldn't be any messages like this:

bge0: link state changed to DOWN

When you actually get this message (if your experience is like ours), you'll be down for 90 seconds
while the NIC resets.

However, since you say you have some older 9.1 releases… I'd start by first trying to bring the
replication time of the problem down by using TCP and/or UDP floods. That way you'll be able to
test for resolution of the problem as you progress up to stable/9 (where the problem should be fixed
by the aforementioned SVN revisions -- specific to your hardware).

There
is not correlation with time, since they happend at totaly different times.
I rebooted both hosts at almost the same time.
one host :
uptime: 5:24PM  up  6:15, 0 users, load averages: 0.00, 0.00, 0.00
May 24 12:53:52 sf-04 kernel: bge1: link state changed to DOWN
May 24 12:53:55 sf-04 kernel: bge1: link state changed to UP
May 24 15:34:25 sf-04 kernel: bge1: link state changed to DOWN
May 24 15:34:28 sf-04 kernel: bge1: link state changed to UP

and
uptime: 5:24PM  up  6:14, 0 users, load averages: 0.00, 0.00, 0.00

May 24 16:30:44 sf-10 kernel: bge1: link state changed to DOWN
May 24 16:30:44 sf-10 kernel: bge1: link state changed to UP

this is not serious, the ilo (ssh) connection is ok, but it's anoying, we have
more
than 10 of this hosts, and if I upgrade all of them, the logs will fill up
with this :-)

any ideas?

Well, you say the connection is OK… so it doesn't sound like a full reset as it
was in our case (we have a different chipset).

But I agree that a log full of those would be annoying.

Try getting up to stable/9 in its current state (note: stable/8 also has all the
aforementioned revisions too).
--
Devin

_____________
The information contained in this message is proprietary and/or confidential. If you are not the intended recipient, please: (i) delete the message and all copies; (ii) do not disclose, distribute or use the message in any manner; and (iii) notify the sender immediately. In addition, please be aware that any message addressed to our domain is subject to archiving and review by persons other than the intended recipient. Thank you.