[amd64] Reproducible cold boot failure (reboot succeeds) in -CURRENT

Thu Nov 17 08:59:44 UTC 2011

Am 16.11.2011 17:16, schrieb John Baldwin:
> On Sunday, November 13, 2011 12:56:12 pm Stefan Esser wrote:
>> ...
>> WARNING: WITNESS option enabled, expect reduced performance.
>> Table 'FACP' at 0xba918a58
>> Table 'APIC' at 0xba918b50
>> Table 'SSDT' at 0xba918be8
>> Table 'MCFG' at 0xba918dc0
>> Table 'HPET' at 0xba918e00
>> ACPI: No SRAT table found
>> Preloaded elf kernel "/boot/kernel/kernel" at 0xffffffff81109000
>> Preloaded elf obj module "/boot/kernel/zfs.ko" at 0xffffffff81109370 <--
>> kldload: unexpected relocation type 67108875
>> kernel trap 12 with interrupts disabled
>>
>> The irritating detail is the load address of "zfs.ko", which is just
>> 0x370 bytes above the kernel load address ...
> 
> That isn't unusual.  Those are the addresses of the metadata provided by the 
> loader, not the base address of the kernel or zfs.ko object themselves.  The 
> unexpected relocation type is interesting however.  That value in hex is 
> 0x400000b.  0xb is the R_X86_64_32S relocation type which is normal for the 
> kernel.  I think you just have a single-bit memory error due to a failing 
> DIMM.

Thanks for the information about the load address semantics. The other
unexpected relocation type I observed was 268435457 == 0x10000001, which
also hints at a single bit error. But today the system failed with a
different error:

ath0: ...
ioapic0: routing interrupt 18 to ...
panic: vm_page_insert: page already inserted

This could of course also be caused by a single bit error ...

But the strange thing is that the system runs perfectly stable under
load (e.g. "make -j8 world") and that the ZFS ARC grows to some 6GB
(of 8GB RAM installed) and I'd expect checksum errors to occur, if
there is a bad DIMM.

Anyway, I'll check with memtest86+ (or whatever best supports my
system with 8GB RAM) over night.

The system boots reliably when switched off for less than a few hours
(I haven't determined the exact limit, but 3 hours are not sufficient
to reproduce the boot failure, while 10 hours cause the first boot
attempt to fail with 90% likelihood; the second one always succeeds).

I'm wondering whether the system RAM is not correctly initialized
after being powered off for 10 hours (but I do not understand why
3 hours should not lead to the exact same initial state). BTW: It
suffices to have the system at power state S5 for 10 hours to cause
the boot failure, while less than 3 hours (without any power or at
S5) let the boot succeed on the first attempt.

>> I had already assumed that memory was corrupted during early start-up,
>> but now I think that gptzfsboot writes the zfs kernel module over the
>> start of the loaded kernel. I'll try some more tests later today.
> 
> Nah, if zfs.ko were loaded over the beginning of the kernel you wouldn't even 
> get to the point of the first kernel printf.

Yes, I see that the failure would be less random (3 different kinds
of panic and different warning messages before the panic occurs).

But I still do not understand how the symptoms can be interpreted:

1) The system booted reliably for many months
2) It boots reliably when powered off for only a few hours
3) It fails on the first boot attempt after 10 hours or more
4) It never shows signs of instability after a successful boot

Hmmm, perhaps there is a problem with components at room temperature
and the system is still significantly warmer after 3 hours?

I'll have to check for such a thermal effect too ...

Best regards, STefan