Where can I get help for debugging system crash ?

Valeri Galtsev galtsev at kicp.uchicago.edu
Wed Dec 21 18:12:07 UTC 2016


On Wed, December 21, 2016 11:49 am, Matthew Seaman wrote:
> On 21/12/2016 14:00, Manish Jain wrote:
>> I am running a FreeBSD 11 amd64 box. The box generally works well, but
>> once every while (about once a month), the system produces a crash, with
>> a large core file at /var/crash. I had a crash yesterday. The info.0 for
>> the the last core reads as :
>>
>> Dump header from device: /dev/ada0p3
>>    Architecture: amd64
>>    Architecture Version: 2
>>    Dump Length: 1012834304
>>    Blocksize: 512
>>    Dumptime: Tue Dec 20 19:05:28 2016
>>    Hostname: bourne.1dent1ty
>>    Magic: FreeBSD Kernel Dump
>>    Version String: FreeBSD 11.0-RELEASE-p1 #0 r306420: Thu Sep 29
>> 01:43:23 UTC 2016
>>      root at releng2.nyi.freebsd.org:/usr/obj/usr/src/sys/GENERIC
>>    Panic String: page fault
>>    Dump Parity: 4025560426
>>    Bounds: 0
>>    Dump Status: good
>>
>> /dev/ada0p3 corresponds to my swap partition. My box has 2 solid state
>> disks, which provide ada0p1 (efi), ada0p2 (ufs), ada0p3 (swap), ada0p4
>> (ufs) and ada1s1 (ufs).
>>
>> I need help to determine exactly what is producing the crash - Is it
>> some hardware problem or some issue with the FreeBSD code ? If anyone
>> can help me get through to the right channel, I will be grateful indeed.
>
> Hi, Manish,
>
> The best thing to do here is to open a PR with what details of the crash
> you can extract from the core dump.  You have a full system core, so you
> should be able to follow the instructions here, and extract a backtrace
> from the kernel:
>
> https://www.freebsd.org/doc/en_US.ISO8859-1/books/developers-handbook/kerneldebug-gdb.html
>
> Or generating a textdump will automatically process your saved core and
> produce a textual report with lots of debugging information.  This does
> require a modified kernel configuration though.  See
> textdump(4) and http://www.etinc.com/122/Using-FreeBSD-Text-Dumps for
> details.
>
> If you can pin the problem down to a particular subsystem or device,
> then that should indicate which mailing list would be a good choice to
> discuss the problem.  If it doesn't appear to be in any device or
> sub-system specific part of the kernel, then try asking on
> freebsd-stable at ...

Thanks Matthew, this is very instructive.

Manish, before opening PR though I would first make sure there is nothing
fishy with _your_ hardware. Just go over same old routine first: re-seat
all cards. Check all fans are spinning (especially CPU ones). Re-seat all
memory modules (and CPUs). Check that all memory is from the same batch.
I've seen memory with the same specs, but mixed different brands causing
crash (very rarely, once a year for each given machine, but that was 32
node cluster, so one of machines of cluster crashed during given Month
almost certainly). Try to run with single CPU (system always boot off CPU
in the socket number 0 ), minimum memory, without any additional cards in
expansions slots (unless you can pinpoint particular card via panic inside
particular driver). The worst one can have is if system board (motherboard
is jargon for over couple of decades) has micro crack. If you have another
hardware with the same model of system board, try to move everything into
that box and see if that box crashed as well under that system.

Good luck!

Valeri

>
> 	Cheers,
>
> 	Matthew
>
>
>


++++++++++++++++++++++++++++++++++++++++
Valeri Galtsev
Sr System Administrator
Department of Astronomy and Astrophysics
Kavli Institute for Cosmological Physics
University of Chicago
Phone: 773-702-4247
++++++++++++++++++++++++++++++++++++++++


More information about the freebsd-questions mailing list