How to troubleshoot a frozen boot sequence

Billy Newsom billy at nlcc.us
Mon Jan 25 01:10:30 UTC 2010


I am not sure why, but here was my solution.

I determined through a lot of poking that the Master Boot Record of each 
drive. Here is what I found out:

1. My backup drive (ad0) had the FreeBSD boot manager installed.
2. My main drive (twed0) had the FreeBSD MBR installed.

So, what is the problem? All I could figure is to install the boot manager 
(called boot0cfg) onto my main drive. Silly, but it worked.

Why, I don't have a clue. I do, by the way, remember purposely using this 
setup when I ran sysinstall to configure this machine. I felt that the ad0 
drive needed a boot manager (just in case it was used someplace else) and the 
main drive would not need a boot manager. But nothing ever indicated to me 
that a standard MBR on twed0 would not work if ad0 was missing.

Here is my partition table from twed0:
# /dev/twed0
g c60801 h255 s63
p 1 0xa5 63 976768002
a 1

Notice there is just one partition and it is active. But it wouldn't boot 
until I ran:
bootcfg -B twed0

which keeps the slice table the same.

Once I was done, the server will now boot with or without the ad0 drive. In 
case of a backup drive failure, I had to also mess with fstab:

1. I had to add the "noauto" option, as someone suggested.
2. I had to disable all fsck passes (3 didn't work -->0) or fsck failure will 
boot single user.

My question is now, do I write a script to mount the drive (too late, I did) 
during boot and then to run fsck also? I am not sure how fsck should be run, 
but I assume it is kind of important.

My main challenge was determining when to mount the disk. Here is my solution 
and my script so far that seems to work.

=====================
#!/bin/sh
# mounts my special drive
# TODO: Need to fsck it

# PROVIDE: mountbackup
# REQUIRE: mail
# KEYWORD: nojail

. /etc/rc.subr

name="mountbackup"
start_cmd="mountbackup_start"
stop_cmd=":"

THIS="/disk250"
HOSTNAME=`/bin/hostname`
MAILTO=root@${HOSTNAME}
TOD=`/bin/date`

mountbackup_start()
{
         local err
         # Mount "backup" filesystems.
         echo -n "Mounting $THIS Backup filesystems"
         mount $THIS
         err=$?
         echo '.'

         case ${err} in
         0)
                 ;;
         *)
                 echo "Mounting $THIS filesystems failed," \
                     " but it's okay for now. Sending mail to $MAILTO"
                 (echo " Mounting $THIS filesystems failed on boot!"
                  echo " "
                  echo "Host: $HOST      Date: $TOD" | \
                  mail -s "FAILURE to mount $THIS on $HOST" $MAILTO
                 ;;
         esac
}
load_rc_config $name
run_rc_command "$1"

=====================

Billy Newsom wrote:
> Nathan Vidican wrote:
>  > To me, it sounds like you have two issues to deal with here:
>  >
>  > #1 - booting off of the twed0 disk, what is your systems' BIOS currently
>  > set to boot from, from the way you describe it's almost as if the system
>  > is booting from ad0 - in which case yes, you will have to put a valid
>  > boot config onto twed0
> 
> I feel that I have run across a common and old "SCSI v IDE" battle (The 
> FreeBSD Handbook still talks about it). Even though I make the drive 
> controller (the twe = 3Ware SATA controller) as my first boot drive in 
> BIOS (effectively 0x80 as I understand it), FreeBSD does not ever pay 
> attention to the BIOS's numerical order. (See my reason below*) It wants 
> to find stuff on ad0 and boot that drive if it exists.
> 
> My supposition is that since I had twe0 and ad0 running during my 7.2 
> install, that the correct drive partition and MBR stuff were applied to 
> get it to boot AS-IS, but...
> 
> When it is not as it is now, It freezes at the boot loader, attempting 
> to find ad0.
> 
> It is either
> 
> a. Finding ad0 in fstab and really wishing it was there
> or
> b. The boot strap code is physically on ad0 and not twed0 because the 
> Sysinstall process never wrote it there.
> 
> I think it is b. If b, the boot process may be:
> 
> Stage 1: BIOS picks twe0 to be the first drive to attempt a boot.
> Stage 2: MBR (boot 0) -- located on twe0
> Stage 3: boot1 -- located on twed0 (BTX Boot Loader?)
> Stage 4: boot2 -- located on ad0 (FreeBSD/i386 bootstrap loader 1.1?)
> Stage 5: Boot Loader -- shows menu on twed0s1a
> Stage 6: Kernel boots up on twed0s1a
> 
> And so when I remove ad0 to simulate a backup drive failure, the stage 4 
> tries to run a missing bootstrap loader from twed0.
> 
> Stage 4: boot2 -- missing on twed0, system hangs.
> 
> I think this is happening because it is the BTX loader which may find 
> and concatenate the BIOS drives, getting confused, and switching the 
> boot to ad0 for just the one stage that finishes the bootstrap.
> 
> I think one solution is to (next time) not install my backup drive until 
> after Sysinstall is long done! I think it's a sysinstall bug, some of this.
> 
> * My Reason for saying that is my guess that the sysinstall program saw 
> the ad0 as something important, and included it in the chain of the 
> boot. For example, when I was done SLICING my drives in Sysinstall, the 
> silly thing then got the "w" write command and went out there and made 
> some (wrong) decisions under the assumption that ad0 would NATURALLY 
> (via BIOS) be part of the boot process. So the right code never got 
> written to twe0 in the right places. Sure, it got all the kernel and I 
> told it to put a standard FreeBSD MBR, but it must be missing something 
> on track 0.
> 
>  > #2 - you could add the flag 'noauto' to ad0 from within fstab - this
>  > will allow the system to boot without mounting the disk (alleviating the
>  > dreaded single-user-mode). Use a startup script in /usr/local/etc/rc.d
>  > to then mount the disk if available on bootup. I've done similar setups
>  > to this before where we were using external USB drives for backup and
>  > weren't 100% sure they'd always be connected in the case a server might
>  > be rebooted - worst case, you'll end up with it not mounted, but the
>  > system will still be up at least.
> 
> I will give it a try. I need to do something to correct this second 
> issue for certain. My ad0 is a good spare, but it's old.
> 
>  > --
>  > Nathan Vidican
>  > nathan at vidican.com <mailto:nathan at vidican.com>
>  >
>  >
>  > On Fri, Jan 22, 2010 at 12:53 PM, Billy Newsom <billy at nlcc.us
>  > <mailto:billy at nlcc.us>> wrote:
>  >
>  >     I am doing a test run on a production server. It has 2 hard drives.
>  >
>  >     ad0 (mounted on /disk250 in a single slice plus SWAP)
>  >     twed0 (mounted on / /var /usr and a SWAP)
>  >
>  >     The twed0 is a hardware mirror and my main drive.
>  >     ad0 is just for backups.
>  >
>  >     What the issue is, and you probably know where I'm heading. The boot
>  >     process freezes if I remove the ad0 (to test a drive failure 
> condition)
>  >
>  >     It freezes after saying:
>  >     BTX boot loader.... etc.
>  >
>  >     FreeBSD/i386 bootstrap loader 1.1
>  >     It spins for a second, then stops... unless I have ad0 in the 
> computer.
>  >     /boot/kernel/kernel text=0x7b03a0 data=0xcdee0 /
>  >
>  >     And it never gets to the boot menu.
>  >
>  >     So:
>  >
>  >     1. Should I put a new boot0config on the twed0 drive? If so do I
>  >     boot from a CD to do that?
>  >
>  >     I need to potentially do something also to my disk labels and my
>  >     fstab so that I don't boot to single user mode if drive ad0 fails. I
>  >     haven't done this exact type of thing before, so I am looking for a
>  >     little help.
>  >
>  >     my fstab:
>  >     /dev/ad0s1b             none            swap    sw              0
>  >         0
>  >     /dev/twed0s1b           none            swap    sw              0
>  >         0
>  >     /dev/twed0s1a           /               ufs     rw              1
>  >         1
>  >     /dev/ad0s1d             /disk250                ufs     rw      2
>  >         2
>  >     /dev/twed0s1e           /tmp            ufs     rw              2
>  >         2
>  >     /dev/twed0s1f           /usr            ufs     rw              2
>  >         2
>  >     /dev/twed0s1d           /var            ufs     rw              2
>  >         2
>  >     /dev/acd0               /cdrom          cd9660  ro,noauto       0
>  >         0
>  >
>  >
>  >     I tried to read the MBR from the twed0 drive, and the program
>  >     couldn't read it. The one from the ad0 drive is readable and I saved
>  >     a copy of it.
> 



More information about the freebsd-questions mailing list