Monitoring server for crashes

Fri Aug 12 18:07:55 UTC 2016

Please provide exact version of FreeBSD, I recall an issue in 10.2, a cron
job with exact symptoms and was fixed with updating. I doubt this is the
problem however providing a more precise version information can help
narrow down software related issues.

On Fri, Aug 12, 2016 at 12:10 PM, Valeri Galtsev <galtsev at kicp.uchicago.edu>
wrote:

>
> On Fri, August 12, 2016 10:51 am, Robert Fitzpatrick wrote:
> > Valeri Galtsev wrote:
> >> Before doing such monitoring I would really do a good hardware test.
> >> Incidentally, who is hardware manufacturer (just for my curiosity). The
> >> usual suspects are: memory (poor/flaky memory, or combination of memory
> >> with slightly different specs; these even though they may work together
> >> can lead to failure sometimes very rarely, like once every 6 Months
> >> which
> >> is really hard to troubleshoot: just avoid this). Another possibility:
> >> tripping temperature threshold set in BIOS. (These, BTW will leave no
> >> tracks in crash, logs etc.) Check this and bring threshold some 15-20 F
> >> (7
> >> - 10 C ) up. Incidentally: which CPU(s) do you have? (I'm used to think,
> >> AMD will withstand any abuse without failing: you almost can boil water
> >> on
> >> these, Intels are not as robust). What I would do is : open the box,
> >> leave
> >> minimal hardware (run with minimal amount of RAM, remove all extra cards
> >> etc) and see if you have problem with this minimal hardware
> >> configuration.
> >> If not, start adding hardware, install all RAM first, test if it doesn't
> >> crash. Run memtest96 at this point for at least 48 hours (or at the very
> >> minimum 2-3 full loops of test). In this configuration try to run system
> >> and create significant CPU load (several multi-thread "build world" can
> >> help do that), and simultaneously try to use all the RAM. Things are
> >> slightly different under heavy load. And so on - add the rest of
> >> hardware
> >> and test... One more thing: check if your PS provides at least 30% more
> >> power than all hardware may need. Marginally insufficient power may lead
> >> to unpredictable thing on PCI bus. Incidentally, how old is power supply
> >> (and the rest of hardware). Electrolytic capacitors may loose
> >> capacitance
> >> with age, thus not filtering well enough ripple on PS leads (capacitors
> >> inside PS), on CPU power leads and on PCI bus power lines (capacitors on
> >> system board - check if they do not showing traces of leakage).
> >>
> >
> > Thanks for all the suggestions, will check temp and other info in BIOS
> > tonight, I really can't have the server down for long memory test, will
> > make sure all memory is the same. The server is IBM x3650 with 2 Quad
> > Core Xeon L5420 a mixture of drives using hardware ServeRAID 8k and 12GB
> > of RAM.
>
> Sound like memory under heavy load. I definitely would:
>
> 1. re-seat all RAM modules.
>
> 2. While doing 1 check all modules are same brand same part number. I
> don't remember off hand if your CPU has its memory controller (like in AMD
> opterons) or it is older "memory bus" used by all CPUs, and memory
> controller sits on system board, In last case I would just stick extra FAN
> on that memory controller chip. If memory controllers are on CPU dies, the
> make sure that memory modules connected to given CPU are the same; they
> can be [somewhat] different from ones connected to different CPU.
> Basically: all RAM modules connected to the same memory controller should
> be teh same.
>
> Do I get it correctly: this machine (purchased used) originally run
> without problems for you (for multiple Months), right?
>
> One more thing I wouldn't exclude: used system board may have fried
> PCI-express slot, if you have something in it, the machine will be flaky.
> I had it once ;-( If you can remove everything, or just move extra cards
> to different slots, this may help you to test this.
>
> Good luck!
>
> > I purchased second hand in 2011. I have a screenshot of the
> > product data screen in the BIOS, it has a diagnostics date of Aug 2009
> > in the BIOS, all hardware should be original except drives and memory.
> > The load comes from a PostgreSQL database primarily, also provides DNS
> > and LDAP services. Not sure heat is the issue, mainly happens at the
> > same general time at night, heaviest load is definitely during the day.
> >
> > I see now, most of the time it happens during dumping of the db each
> > night, but it has happened once during the day and once a couple of
> > hours before backup. I'm leaning toward a memory issue and will
> > definitely visit the data center tonight and see the types. The db size
> > has not changed much over time and this just started recently. It is a
> > SpamAssassin/ClamAV db and purges, vacuums every night after dumping. I
> > will disable and do dump manually tonight, 90% of the time it seems to
> > be going down during backup of the largest db. Perhaps the crashes have
> > caused a table to corrupt, I 'fsck -y' all mounts in single user mode
> > after every crash.
> >
> > --
> > Robert
> >
> > _______________________________________________
> > freebsd-questions at freebsd.org mailing list
> > https://lists.freebsd.org/mailman/listinfo/freebsd-questions
> > To unsubscribe, send any mail to
> > "freebsd-questions-unsubscribe at freebsd.org"
> >
>
>
> ++++++++++++++++++++++++++++++++++++++++
> Valeri Galtsev
> Sr System Administrator
> Department of Astronomy and Astrophysics
> Kavli Institute for Cosmological Physics
> University of Chicago
> Phone: 773-702-4247
> ++++++++++++++++++++++++++++++++++++++++
> _______________________________________________
> freebsd-questions at freebsd.org mailing list
> https://lists.freebsd.org/mailman/listinfo/freebsd-questions
> To unsubscribe, send any mail to "freebsd-questions-
> unsubscribe at freebsd.org"
>