loader(8) readin failed on 7.2R and later including 8.0R

Fri Dec 4 15:36:16 UTC 2009

On Thursday 03 December 2009 4:20:08 pm Hiroki Sato wrote:
> John Baldwin <jhb at freebsd.org> wrote
>   in <200912030803.29797.jhb at freebsd.org>:
> 
> jh> On Thursday 03 December 2009 5:29:13 am Hiroki Sato wrote:
> jh> > John Baldwin <jhb at freebsd.org> wrote
> jh> >   in <200912020948.05698.jhb at freebsd.org>:
> jh> >
> jh> > jh> On Tuesday 01 December 2009 12:13:39 pm Hiroki Sato wrote:
> jh> > jh> >  While the "load" command seemed to finish, the box got stuck just
> jh> > jh> >  after entering "boot" command.
> jh> > jh> >
> jh> > jh> >  Curious to say, I have got this symptom only on a specific box in
> jh> > jh> >  more than ten different boxes I upgraded so far; it is based on an
> jh> > jh> >  old motherboard Supermicro P4DPE[*].
> jh> > jh> >
> jh> > jh> >  [*]
> jh> http://www.supermicro.com/products/motherboard/Xeon/E7500/P4DPE.cfm
> jh> > jh> >
> jh> > jh> >  Any workaround?  Booting from release CDROMs (7.2R and 8.0R) also
> jh> > jh> >  fail.  On the box "7.1R" or "7.1R's loader + 7.2R kernel" worked
> jh> > jh> >  fine.  It is possible something in changes of loader(8) between 7.1R
> jh> > jh> >  and 7.2R is the cause, but I am still not sure what it is...
> jh> > jh>
> jh> > jh> It may be related to the loader switching to using memory > 1MB for its
> jh> > jh> malloc().  Maybe try building the loader with
> jh> 'LOADER_NO_GPT_SUPPORT=yes' in
> jh> > jh> /etc/src.conf?
> jh> >
> jh> >  Thanks, a recompiled loader with LOADER_NO_GPT_SUPPORT=yes' displayed
> jh> >  "elf32_loadimage: could not read symbols - skipped!" for 8.0R kernel.
> jh> >  This is the same as 7.1R's loader + 8.0R kernel case.
> jh>
> jh> Can you get the output of 'smap' from the loader?  Is the 8.0 kernel bigger
> jh> than the 7.x kernel?  If so, can you try trimming the 8.0 kernel a bit to see
> jh> if that changes things?
> 
>  Sure.  Output of smap on an 8.0R loader with LOADER_NO_GPT_SUPPORT=yes
>  was:
> 
> | OK smap
> | SMAP type=01 base=0000000000000000 len=000000000009f400
> | SMAP type=02 base=000000000009f400 len=0000000000000c00
> | SMAP type=02 base=00000000000dc000 len=0000000000024000
> | SMAP type=01 base=0000000000100000 len=0000000000e00000

So this is the region that ends up getting used for malloc:

	/* look for the first segment in 'extended' memory */
	if ((smap.type == SMAP_TYPE_MEMORY) && (smap.base == 0x100000)) {
	    bios_extmem = smap.length;

	...

    /* Set memtop to actual top of memory */
    memtop = memtop_copyin = 0x100000 + bios_extmem;

and then later:

#if defined(LOADER_BZIP2_SUPPORT) || defined(LOADER_FIREWIRE_SUPPORT) || defined(LOADER_GPT_SUPPORT) || defined(LOADER_ZFS_SUPPORT)
    heap_top = PTOV(memtop_copyin);
    memtop_copyin -= 0x300000;
    heap_bottom = PTOV(memtop_copyin);
#else

So memtop_copyin would start off as 0xf00000 but would end up as 0xc00000,
and since the kernel starts at 4MB, I think that only leaves about 8MB for
the kernel.  Probably the loader needs to be more intelligent about using
high memory for malloc by using the largest region > 1MB but < 4GB for
malloc() instead of stealing memory from bios_extmem in the SMAP case.
Try the attached patch which tries to make the loader use better smarts
when picking a memory region for the heap (warning, I haven't tested it
myself yet).

> | SMAP type=02 base=0000000000f00000 len=0000000000100000
> | SMAP type=01 base=0000000001000000 len=00000000beef0000
> | SMAP type=03 base=00000000bfef0000 len=000000000000c000
> | SMAP type=04 base=00000000bfefc000 len=0000000000004000
> | SMAP type=01 base=00000000bff00000 len=0000000000080000
> | SMAP type=02 base=00000000bff80000 len=0000000000080000
> | SMAP type=02 base=00000000fec00000 len=0000000000010000
> | SMAP type=02 base=00000000fee00000 len=0000000000001000
> | SMAP type=02 base=00000000ff800000 len=0000000000400000
> | SMAP type=02 base=00000000fff00000 len=0000000000100000
> | OK
> 
>  Size difference between the two kernels was:
> 
> | -r-xr-xr-x  1 root  wheel   9708240 Dec  1 16:22 kernel.7/kernel
> | -r-xr-xr-x  1 root  wheel  11492703 Nov 21 15:48 kernel.8/kernel
> 
>  Then I rebuilt a smaller 8.0 kernel by removing some entries from the
>  kernel configuration file.  The size is now smaller than 7.1R kernel:
> 
> | -r-xr-xr-x  1 root  wheel  7710491 Dec  3 21:10 /boot/kernel.8X/kernel
> 
>  Loading the new kernel seemed to work fine with the recompiled 8.0R
>  loader, but it got stuck just after entering "boot":
> 
> | OK load /boot/kernel.8X/kernel
> | /boot/kernel.8X/kernel text=0x5a7664 data=0x88d74+0x82f04 syms=[0x4+0x6d290+0x4+0x987e3]
> | OK boot
> | /

I'm not sure why it would get stuck.  Can you add some debug printfs to see
how far it gets before it dies?  E.g. does it get to the point of calling
exec() (in which case the hang is in the kernel in locore.S rather than in
the loader).

-- 
John Baldwin
-------------- next part --------------
A non-text attachment was scrubbed...
Name: loader_heap.patch
Type: text/x-patch
Size: 5508 bytes
Desc: not available
Url : http://lists.freebsd.org/pipermail/freebsd-stable/attachments/20091204/ea82cfe4/loader_heap.bin