[Testers wanted] /dev/console cleanups

Thu Nov 20 07:39:01 PST 2008

On Nov 20, 2008, at 4:03 AM, Jeremy Chadwick wrote:
>
> This has two problems, but I'm probably missing something:
>
> 1) See my original post, re: users of our systems use "dmesg" to find
> out what the status of the system is.  By "status" I don't mean "from
> the point the kernel finished to now", I literally mean they *expect*
> to see the kernel device messages and all that jazz.  No, I'm not
> making this up, nor am I arguing just to hear myself talk (despite
> popular belief).  I can bring these users into the discussion if  
> people
> feel it would be useful.

Sorry for jumping in late, but...

cat /var/run/dmesg.boot

Is that acceptable? I know it depends on what end users want, but some  
of my old hosting customers really just wanted to see the specs of the  
box and nothing else. Making dmesg a shell script that just cats that  
file satisfied everyone who asked.

Also:

> On Thu, Nov 20, 2008 at 05:39:36PM +1100, Peter Jeremy wrote:
>> On 2008-Nov-19 02:47:31 -0800, Jeremy Chadwick <koitsu at freebsd.org>  
>> wrote:
>>> There's a known "issue" with the kernel message buffer though:  
>>> it's not
>>> NULL'd out upon reboot.
>>
>> This is deliberate.  If the system panics, stuff that was in the
>> message buffer (and might not be on disk) can be read when the system
>> reboots.  If there is no crashdump, this might be the only record of
>> what happened.
>
> That doesn't sound deliberate at all -- it sounds like a quirk that
> people (you?) are relying on.  I do not think any piece of the FreeBSD
> system (e.g. savecore, etc.) relies on this behaviour.
>
> You're under the mentality that the information is *always* available
> after a panic/reboot -- it isn't.  I have 4 different Supermicro
> motherboards (all from different years) which will "most of the time"
> lose the msgbuf after rebooting from single-user -- but sometimes the
> msgbuf is retained.  And no, bad hardware is not responsible for the
> randomness of the problem.
>
> I think it's been discussed in the past how/why this can happen.  It  
> has
> to do with what each BIOS manufacturer chooses to do with some parts  
> of
> memory during start-up.  I'm sure the "Quick Boot" (e.g. no extensive
> memory test, which really doesn't test anything these days) option  
> plays
> a role, and that option is enabled by default on all motherboards I've
> used in the past 10 years.

I've been involved with a few embedded systems, some BSD based some  
not. In a few cases we've used custom BIOSes on the motherboard.

At least one BIOS SDK specifically describes this as a feature. What  
is exposed to the end user as "Quick Boot" is actually several options  
that the motherboard designer/BIOS configurer can select. One of which  
is specifying which chunks of memory should be preserved after a  
reboot, up to "memory tests should be as non-destructive as possible".  
While you can't *rely* on it, with careful use of atomic writes and  
state checking you can pick up where you left off after a reboot on an  
embedded device that has no long-term storage if it was a warm boot.  
Or, gathering crash dumps and sending them off to the network.

Here, the dmesg buffer is a simple ring buffer in the kernel. The  
start/end pointers and contents of the ring buffer are deliberately  
not cleared after a reboot in FreeBSD to at least make the information  
available if the BIOS didn't clobber it. This can be extremely useful  
in those "a box on the other side of the world keeps rebooting with no  
panic", if you can tell the BIOS to skip the memory check. In one case  
I even (ab)used some extra video ram on the motherboard as a dmesg  
buffer since the video bios didn't wipe it on boot, but the system's  
bios insisted on doing a full memory test each time.

If you have a BIOS that's sometimes but not always wiping the buffer,  
it's probably because a few bits are being lost while the motherboard  
turns the DRAM refresh off during the reboot. Some DIMMS can handle no  
refresh for tens of seconds, some start popping bit errors rather  
quickly. Take a look at sys/kern/subr_msgbuf.c:msgbuf_reinit. There's  
a magic number it looks for as well as a checksum on the whole buffer  
that's updated after every write. If anything changes, the kernel  
throws out the whole buffer and starts fresh. If you boot with "-v"  
you'll probably see "msgbuf cksum mismatch".

While I don't think any of the shipping FreeBSD tools rely on this  
behavior, I know I tried submitting a patch back in the FreeBSD-2.2(?)  
days to fix that "overlooked uninitialized buffer" and got educated by  
jkh pretty quickly. :)