alpha/50659: reboot causes SRM console to loop endless error and needs to be restetted hard

Jens Röder j.roeder at tu-bs.de
Wed Apr 9 15:33:08 PDT 2003


Hello Wilko,

thanks for the quick response and good support.



On Wed, 9 Apr 2003, Wilko Bulte wrote:

> > The machine has about 1 GB RAM. Honestly I am not sure what "processor
>
> 1GB... that is overkill for a gateway, but hey, it should not hurt ;)

:-) Yes, I am happy that I got that machine, what stood unused for years
in a cluster because of a loud hardrive, what I simply removed. I hoped
for a stable hardware of a real unix box as all depend on that gateway. Of
course it can host a few users later when I get it stable with FreeBSD. I
think FreeBSD should be secure enough to handle that also users on a
gateway.

> That is a kernel panic, not a memory problem ;)
>
> Most Alphas, and your AS500 too, have ECC (error correction) memory. That allows
> single bit memory errors to be corrected. The kernel will tell you if a
> correction was applied, these are the processor correctable errors I
> mentioned.

Hm, sounds interesting, so that does mean for me that in the case of a
hardware memory problem I would get a kernel-message and don't need to do
any memory checks?

> Unaligned accesses in kernel mode are Bad(TM). Check the handbook on
> creating more debug info on the crash please.

I am not sure if I did the right thing, so there is a core file now
available at:

http://octopus.homeunix.net/jens@piero.ptch.nat.tu-bs.de.gz



> > At the moment I consider also defect memory and will check that as soon as
> > I have a temporarily replacement for that Institute gateway and a night
>
> Very unlikely, this looks like a problem in the kernel to me.
>
> > Meanwhile I have compiled a kernel with suffiencet debug mode with the
> > hope to offer proper error messages.
>
> Can you catch a crash dump maybe?

At least the kernel did not reboot with the debug function so I could
write down for the 5.0-p7 version:


fatal kerneltrap:

trapentry	= 0x4	(unaligned access fault)
cpuid		= 0
faulting va	= 0xfffffc0031d12d0c
opcode		= 0x2d
register	= 0x9
pc		= 0xfffffe0004138bc0
ra		= 0xfffffe0004138bb4
sp		= 0xfffffe001da7db70
usp 		= 0x11fff628

curthread	= 0xfffffc003e2c87c0
pid 593, comm ipfw
Stopped at ipfw_ctl+0x1c0; or 	zero, s0,t2
			<zero=0x0,s0=0xfffffc0031d12d0c,t2=0x2710>


Unfortunately I am too new in that area and never work on the db> prompt,
so I need lots of reading to do to handle this. Are there one or two
commands just to do, to give you a propper error message? (By the way it
is an generic kernel in this case).

Again, when you use "ipfw show" on 5.0 on alpha, you get messages like
this:

ptchgate# ipfw show
00100         94      10410 allow ip from any to any via lo0
00200          0          0 deny ip from any to 127.0.0.0/8
pid 585 (ipfw): unaligned access: va=0x1200a80b4 pc=0x120001780
ra=0x120001764 op=ldq
pid 585 (ipfw): unaligned access: va=0x1200a80bc pc=0x120001784
ra=0x120001764 op=ldq
00300          0          0 deny ip from 127.0.0.0/8 to any
65000        921      89561 allow ip from any to any
65535          0          0 deny ip from any to any


It gets more likely to crash, when my set of rules are specified and list
the rule then.

This does not occur on 5.0 for i386, what seems to run stable yet.


> > I think the "unalighed access error" when listing the firewall rules
> > showed only up in the 5.0 version. I will probably downgrade to 4.7 or 4.8
> > (what is better to use?) again and recompile with ipfw2 then, and let you
> > know then. Before I will try to produce proper errror messages with the
> > debug kernel of 5.0.
>
> I'd go for 4.8. Do you need any ipfw2 functionality?
>
> > Maybe you can try out the SRM console problem without upgrading to 5.0 as
> > I remember I first noticed it, when I booted from floppy or CD and called
> > the machine to abort. I thought first of the errors reason to be my fault
> > because of the abortion. Again 4.7 did not have that problem.
>
> I have a fresh 4.8 on my AS500 and that does not show me the problem.

Ok, I will downgrade to 4.8 as soon as I got a proper crash dump from 5.0.


> What kind of PCI cards are in the machine? Can you post a SHOW CONF
> from the SRM ?

Of course, a pleasure for me:

.....................................................................
Firemware
SRM Console: V7.2-2
ARC Console: 4.58
PALcode: OpenVMS PALcode V1.20-0 Tru64 UNIX PALcode V1.22-0

Processor
DECchip (tm)21164A-2 Pass 2 500MHz 96KByte SCache
8MB BCache

Cia ASIC Pass 3

Memory Size 1024 Mb

Bank	Size/Sets	Base Addr	Speed
-----	---------	---------	------
00	512 Mb/1	000000000	Fast
01	512 Mb/1	020000000	Fast

BCache Size 8Mb

Tested Memory 33 Mbyte

PCI Bus

Bus 00	Slot 06: DECchip 21040 Network Controller
				ewa0.0.0.6.0

Bus 00 	Slot 08: Digital TGA2 Graphics Controller

Bus 00	Slot 09: ISP1020 SCSI Controller
			pka0.7.9.0 	SCSI Bus ID 7
			dka400.4.0.9.0	RRD46
			dka500.5.0.9.0	IBM DGHS18u

Bus 00 	Slot 10: Intel 8275EB PCI to Eisa Bridge

Bus 00 	Slot 12: Vendor: 10ec	Device: 8139 Sub_id 813910ec

...............................................................

I hope my eye to keyboard copy doesn't contain errors. :-)


Well, I hope, there was something productive for debugging in my mail. I
am sorry if I appear to be very unexperienced in FreeBSD, but I am just
getting started.

with best regards from Germany

Jens


-----------------------------------------------------------------------------
Physikalische und Theoretische Chemie der TU-Braunschweig
Jens Röder, Hans-Sommer Str.10, 38106 Braunschweig
-----------------------------------------------------------------------------



More information about the freebsd-alpha mailing list