Advice on kernel panics

Thu Jun 1 15:21:45 UTC 2017

On Thu, June 1, 2017 10:14 am, Raimo Niskanen wrote:
> On Thu, Jun 01, 2017 at 10:03:27AM -0500, Valeri Galtsev wrote:
>> On Thu, June 1, 2017 9:34 am, Ian Smith wrote:
>> > In freebsd-questions Digest, Vol 678, Issue 4, Message: 4
>> > On Thu, 1 Jun 2017 10:27:49 +0200 Raimo Niskanen
>> > <raimo+freebsd at erix.ericsson.se> wrote:
>> >  > On Thu, Jun 01, 2017 at 12:10:30AM -0500, Doug McIntyre wrote:
>> >  > > On Mon, May 29, 2017 at 11:20:43AM +0200, Raimo Niskanen wrote:
>> >  > > > I have a server that panics about every 3 days and need some
>> advice
>> > on how
>> >  > > > to handle that.
>> >  > >
>> >  > > I'd expect it is some sort of hardware failure, as I would expect
>> kernel panics more on the order of once a decade with FreeBSD. Ie.
>> I've seen one or two on my hundred or so servers, but its pretty
>> > rare.
>> >  > >
>> >  > > Check and recheck your hardware items.
>> >  >
>> >  > I have removed one of four memory capsules - panicked again.  Will
>> > rotate
>> >  > through all of them...
>> >  >
>> >  > >
>> >  > > Runup memtest86+. Check your drive hardware, turn on SMART
>> checking.
>> >  >
>> >  > I have run memtest86+ over night - no errors found.
>> >  >
>> >  > I have installed smartmontools - no errors found, short and long
>> self
>> > tests
>> >  > on both disks run fine.  zpool scrub repaired 0 errors and has no
>> known
>> > data
>> >  > errors.
>> >
>> > Everyone's suggesting hardware problems, and it's certainly worthwhile
>> eliminating that possibility - but this could be a software/OS issue.
>>
>> I would agree with Ian,  it can be software, though it is less likely. I
>> have seen a few times that SCSI attached external RAID (attached to LSI
>> SCSI HBA) was announcing change of its status (like rebuilt finished or
>> drive timed out/failed) which simultaneously with other traffic on SCSI
>> bus confused adapter and led to kernel panic.
>>
>> That said, I will first check hardware thoroughly. Andrea mentioned aged
>> PS under heavy load. And these are prime suspects. Of all components
>> electrolytic capacitors are the ones degraded most, may even leak, and
>> they don't filter ripple sufficiently, thus leading to ripple beyond
>> tolerable at high currents. So:
>>
>> 1. open the box, and inspect interior. System board ("motherboard" is
>> its
>> jargon name for over 30 years): inspect electrolytic capacitors around
>> CPU(s), and those that filter PCI (or PCI-X, or PCI-E) bus power leads.
>> Any of them bulged, or even have traces of leaked electrolyte (brown
>> residue usually) - throw away system board. The model of your box fall
>> into the time span when they used worst electrolytic capacitors.
>
> I did not think this machine was old, but it has apparently been a few
> years...

If it is manufactured less than 5 years ago, then I'm mistaken (I do not
follow Dell server models closely...)

>
>>
>> 2. re-seat all components (including expansion boards, memory, CPU is
>> less
>> likely, but I would do that too), disconnect and reconnect all
>> connectors.
>> Contacts, even gold plated, sometimes do oxidize
>
> Will try.
>
>>
>> 3. Get new power supply, not necessarily designed for this machine, but
>> with the same connectors to the system board, and with higher power
>> rating. disconnect box's own PS, and power it from new PS; see if it
>> stops
>> failing (PSes do have electrolytic capacitors inside as well; other
>> components do not degrade but do not die totally, except for ultra high
>> frequency diodes and transistors, and very high voltage diodes)
>>
>> Good luck!
>>
>> Valeri
>
> Thank you!
>
>
> --
>
> / Raimo Niskanen, Erlang/OTP, Ericsson AB
> _______________________________________________
> freebsd-questions at freebsd.org mailing list
> https://lists.freebsd.org/mailman/listinfo/freebsd-questions
> To unsubscribe, send any mail to
> "freebsd-questions-unsubscribe at freebsd.org"
>


++++++++++++++++++++++++++++++++++++++++
Valeri Galtsev
Sr System Administrator
Department of Astronomy and Astrophysics
Kavli Institute for Cosmological Physics
University of Chicago
Phone: 773-702-4247
++++++++++++++++++++++++++++++++++++++++