svn commit: r308089 - in head

Bruce Evans brde at optusnet.com.au
Thu Mar 9 08:29:02 UTC 2017


On Wed, 8 Mar 2017, Julian Elischer wrote:

> On 7/3/17 4:48 pm, Toomas Soome wrote:
>>> ...
>> The problem is deeper, the idea behind the nextboot is that it is 
>> attempting to provide recovery from failed boot, so if you set nextboot 
>> dataset, attempt to boot from it, you need to do 2 things: 1. detect the 
>> nextboot config, so you would actually be able to use it, and 2, you want 
>> to reset it as early as possible, because later you may not have a chance.
>> 
>> So it means the gptzfsboot has to read out the config to know where from it 
>> has to load the zfsloader, and gptzfsboot has to reset the config, so that 
>> if anything will go wrong, on next boot the fallback or “normal” boot 
>> will be done. Which means that either gptzfsboot has to know how to deal 
>> with geli in context of handling nextboot, or with geli, you just can not 
>> use nextboot config.
>> 
>> The similar issue is with using boot block area in zfs pool label - to be 
>> able to store and use gptzfsboot in pool label boot area, the boot1 either 
>> has to know how to read the geli, or geli must be able not to encrypt the 
>> bootblock area, or we just can not use that area [with geli]. All in all, 
>> it is another example of the chicken and the egg issue:)
>
> this is why the ORIGINAL nextboot in freebsd 3 (+ or -) wrote the data into 
> block 1 of the drive and read it from boot0, and rewrote block 1 after 
> zeroing out teh entry.
> All using bios calls.
> 1/ read and remove ASAP,
> 2  don't depend on the filesystem.. it may be dead, and that is why we are 
> redirecting somewhere else.

I didn't like this method.  Anything that writes to the disk increases
fragility.  Is still use my version of biosboot (updated for elf and EDD)
and try not to see the nextboot code in it (I can't delete this code
since it would then show up even more in diffs).

My method (used mostly only interactively) is to depend on a filesystem
for boot2 and things loaded by boot2.  It is easy to maintain such a
file system somewhere (perhaps on removable write-only media), but can
be hard to find it or ensure that it is the one booted (this often bites
me when bringing up a new system from a USB drive; first it can be hard
to boot from USB, and then I forgot how the standard boot0 misbehaves
and hit F5 which tends to switch to a Windows drive with an unusable MBR
on it).

Once a known-good partition (with a good file system including boot2 or
even loader on it) is found.  It is easy to control things by booting
to the kernel on that.  My method for cycling through kernels to run
run benchmarks on them is to copy the next kernel to a standard place
and boot to there.  The kernel is selected by an index in a text file.
Cycling is done by incrementing the index.  I only do this for rebooting
within a single partition, but could immediately use a variation that
switches between i386 and amd64 partitions.

To use this method for nextboot, the main cycle would have to be of
length 2, to switch between a know-good partition and the next try on
a not-known-good partition (after booting to the known-good partition,
it is is easy to clean up the other partition and copy a hopefull-better
kernel or even a whole filesystem to it).

The index (of 0 or 1) for the main cycle can't be stored in the known-good
filesystem since after a crash booting a bad partition there is no safe
way to update the index there.  On x86, there should be space reserved
for OS use in CMOS, but unfortunately nothing is properly reserved there.
I have though of using the alarm register[s] for this.  The alarm
register[s] are not normally used, and for cycles of nextbooting the OS
could simply turn off alarms and use them.  Add ECC to this and it becomes
very safe to use them.  At worst, another OS might boot in the middle of
the cycle and change the alarm setting.  Just boot to the known-good
partition when ECC detects an error, and trust ECC to detect the unlikely
interruption and change.

Recently, I noticed that the entire msgbuf survives rebooting on amd64
systems with 16GB memory.  The msgbuf is in high memory for amd64 and
this seems to survive warm boots and is not clobbered by the BIOS.
But the BIOS on the same system scribbles over almost all memory below
4GB.  It leaves large portitions of the msbguf intact, but the msgbuf
is protected by a stupid 32-bit checksum and always detects the clobbering.
Change this to ECC and apply it to individual messages and we can recover
most of the message buffer on i386, without expanding it much.  Add
redundancy/ECC to recover more of it.  It also has atomic update problems
that are best fix by redundancy and localized checksums which could be
ECC exept that is a bit over-engineered.

If we can robustly prreserve entire msgbufs across reboot, then it is
trivial to preserve a single status bit for nextboot.  Just not so easy
to do ECC for this bit in 512 bytes in boot0.  The memory bit could be
for extra checking of the RTC registers, with simple checksums instead
of ECC on both.

> the current nextboot is not nearly as useful and needs to be replaced as soon 
> as possible as a failed experiment.
> things we coudl do to improve nextboot functionality:
> 1/ declare a partition type freebsd-bootinfo tha t is just raw boot info.

Ugh, this is as bad as Windows using multiple precious (non-GPT) partitions
for itself.  My method needs a known-good partition, but this can be a
FreeBSD partition.  My methods would actually have to put the decisions into
boot2, since there is not enough space in boot0 and avoid using loader:
- boot0: boot to known-good slice, say ad4s3
- boot2: this actually lives at the start of the slice, not in a partition,
   so is easy to find.  One reason I don't like loader is that it lives on
   a file system, so deferring decisions to it is not so robust.
- boot2 has to find the right drive and slice.  This is not so easy.  The
   default of the first FreeBSD slice is wrong in some of my configurations.
   This can be made more robust by hard-coding values in the boot2 binary.
- boot2 then has to find the right partition.  It almost always defaults
   to the 'a' partition, and loads /boot.config from there to possibly
   override the default.  It is safest to always make the known-good
   partition 'a'.
- before reading boot.config, boot2 does ECC on various places to find
   overrides.  With nextboot inactive or the cycle-control value is 0,
   it reads boot.config from the known-good partition.  Otherwise, it
   reads boot.config from somewhere else, say the the 'b' partition
   or just an alternative boot.config on the known-good partition.  It
   ca contain anything, so the next try is not limited to 1 alernative.

> 2/ store the info in a known place in the freebsd-zfs partition (what andriy 
> is doing I believe)
> 3/ store it at the end of the freebsd-boot partition.
> It should be read by gptzfsboot and set into the environment (what comes 
> earlier in a gpt system?)  originally I read it using bios calls from boot0.
> that was of course a UFS system on a dedicated drive.

It is fundamentally insecure to allow booting from all over the place,
especially when selected by simple binary values written to raw drives.
My method actually works well to fix this.  Better write the alternative
boot.config to the same known-good partition as the non-alternative (this
is only impossibly of the known-good partition is read-only).  Then the
simple binary value is not very robust, but all it does is select between
the known-good boot.config and the alternative one.  Security is attained
by never pointing the alternative one to an insecure partition.

Bruce


More information about the freebsd-fs mailing list