6.1 kernel unable to find /dev ?

Sun Jun 4 03:35:31 UTC 2006

On Sat, 3 Jun 2006, Doug White wrote:
>
> This is usually indicative of bad RAM or a faulty processor. Since
> you seem to be having disk problems, it may just be due to the disk
> returning faulty data. Or there is a bad kernel module in the mix
> that is randomly corrupting data.

    Thanks for taking the time to comment, Doug.  Yeah, flaky hardware
seems to be a common cause, but I'm able to do hundreds of
buildworld/buildkernel loops for days at a time without a hiccup.  But
boot into a newer kernel, and the system stays up for the 8 seconds it
takes for the kernel to try to hand things over to init.

> My gut feeling is that there is still a disconnect on what the root
> filesystem is. That or there is hidden corruption that 6.0 isn't
> noticing that 6.1 is.  Here's what I'd do next:
>
> 1. Capture the boot output from both the working 6.0 kernel and your
> broken 6.1 kernel and compare the two. If there are differences or
> errors being returned from the ATA controller or disks then those
> will need to be addressed.

    I have the boot messages logged from 6.0-RELEASE-p6 and
6.1-RELEASE-p1, and there are no salient differences, at least in the
ATA probes.  I do see things like an extra "AMD Features2=0x1<LAHF>"
for 6.1 that was not there in 6.0, and a bunch of "pci_link[n]" (where
[n] goes from 0 to 11) messages in 6.0 that aren't in 6.1.

> 2. Try a splat-over reinstall of 6.1-R from CD to force everything
> to match up. Mount the filesystems but don't mark them to be
> newfs'd. Install the GENERIC kernel only.

    I was thinking of a similar approach, but to do a plain-jane 6.1
install to another drive, then attempt to mount the root filesystem
from the 6.0-friendly drive and see what happens.  If it mounts fine,
then I'm not much further ahead.  But if it doesn't mount, I'm hoping
I'll see some sort of clue as to why.

> If you are going to be tracking a branch, please read the
> instructions at the end of src/UPDATING on how to perform the build.
> There is a specific procedure and not following it can cause
> significant issues. While unlikely, it is possible to irreparibly
> damage the system by not following the instructions to the letter.

    Yes, I keep an eye on /usr/src/UPDATING and follow the
instructions at the bottom, but I haven't seen anything in there to
suggest there might be an issue upgrading from 6.0 to 6.1.

    Hrm... here's something different:  when I try to boot 6.1p1 into
"safe mode", I get this:

| Trying to mount root from ufs:/dev/ad4s1a
| exec /sbin/init: error 8
| exec /sbin/init.bak: error 13
| exec /rescue/init: error 20
| init: not found in path
| /sbin/init:/sbin/oinit:/sbin/init.bak:/rescue/init:/stand/sysinstall
| panic: no init
| Uptime: 1s
| Cannot dump. No dump device defined.
| Automatic reboot in 15 seconds - press a key on the console to abort
| --> Press a key on the console to reboot,
| --> or switch off the system now.

    errno 8 is ENOEXEC and 13 is EACCES... but:

# file /boot/kernel/kernel /sbin/init /sbin/init.bak
/boot/kernel/kernel: ELF 64-bit LSB executable, AMD x86-64, version 1 (FreeBSD), dynamically linked (uses shared libs), not stripped
/sbin/init:          ELF 64-bit LSB executable, AMD x86-64, version 1 (FreeBSD), statically linked, stripped
/sbin/init.bak:      ELF 64-bit LSB executable, AMD x86-64, version 1 (FreeBSD), statically linked, stripped

    Binary formats and kernel architecture match... is there another
reason why an ENOEXEC would come up?
-- 
Brian Tao (BT300, taob at risc.org)
"Though this be madness, yet there is method in't"