Monitoring server for crashes
William A. Mahaffey III
wam at hiwaay.net
Sat Aug 13 18:26:45 UTC 2016
On 08/13/16 09:33, Ian Smith wrote:
> In freebsd-questions Digest, Vol 636, Issue 7, Message: 10
> On Fri, 12 Aug 2016 11:51:50 -0400 Robert Fitzpatrick <robert at webtent.org> wrote:
> > Valeri Galtsev wrote:
> > > Before doing such monitoring I would really do a good hardware test.
> > > Incidentally, who is hardware manufacturer (just for my curiosity). The
> > > usual suspects are: memory (poor/flaky memory, or combination of memory
> > > with slightly different specs; these even though they may work together
> > > can lead to failure sometimes very rarely, like once every 6 Months which
> > > is really hard to troubleshoot: just avoid this). Another possibility:
> > > tripping temperature threshold set in BIOS. (These, BTW will leave no
> > > tracks in crash, logs etc.) Check this and bring threshold some 15-20 F (7
> > > - 10 C ) up. Incidentally: which CPU(s) do you have? (I'm used to think,
> > > AMD will withstand any abuse without failing: you almost can boil water on
> > > these, Intels are not as robust). What I would do is : open the box, leave
> > > minimal hardware (run with minimal amount of RAM, remove all extra cards
> > > etc) and see if you have problem with this minimal hardware configuration.
> > > If not, start adding hardware, install all RAM first, test if it doesn't
> > > crash. Run memtest96 at this point for at least 48 hours (or at the very
> > > minimum 2-3 full loops of test). In this configuration try to run system
> > > and create significant CPU load (several multi-thread "build world" can
> > > help do that), and simultaneously try to use all the RAM. Things are
> > > slightly different under heavy load. And so on - add the rest of hardware
> > > and test... One more thing: check if your PS provides at least 30% more
> > > power than all hardware may need. Marginally insufficient power may lead
> > > to unpredictable thing on PCI bus. Incidentally, how old is power supply
> > > (and the rest of hardware). Electrolytic capacitors may loose capacitance
> > > with age, thus not filtering well enough ripple on PS leads (capacitors
> > > inside PS), on CPU power leads and on PCI bus power lines (capacitors on
> > > system board - check if they do not showing traces of leakage).
>
> All good advice Valeri; not sure about messing with temps in BIOS though
> .. FreeBSD should be handling that ok via ACPI thermal Zones (versus
> _HOT and _CRT temperatures) which should cleanly shutdown at _CRT temp.
> That said, if it gets anywhere near that hot there's a serious issue ..
>
> > Thanks for all the suggestions, will check temp and other info in BIOS
> > tonight, I really can't have the server down for long memory test, will
> > make sure all memory is the same. The server is IBM x3650 with 2 Quad
> > Core Xeon L5420 a mixture of drives using hardware ServeRAID 8k and 12GB
> > of RAM. I purchased second hand in 2011. I have a screenshot of the
> > product data screen in the BIOS, it has a diagnostics date of Aug 2009
> > in the BIOS, all hardware should be original except drives and memory.
> > The load comes from a PostgreSQL database primarily, also provides DNS
> > and LDAP services. Not sure heat is the issue, mainly happens at the
> > same general time at night, heaviest load is definitely during the day.
>
> I guess you've checked with ibm re a BIOS update .. 2009 is a while ago.
>
> Apart from RAM, which rarely just 'goes bad' esp. on server grade gear,
> but "rarely happens" happens too.
>
> First thing I'd suspect at that age would be the power supply - can you
> swap it with another? Quickest fix if it works - and it was needed.
>
> Second would be temperature, possibly fan/s - which is also the primary
> cause of blown P/S in my experience. Below is a script I run from cron
> from 02:59 through 3:09 to record load averages and temperatures through
> daily maintenance from 3:01, every 10 seconds - for debugging a load
> average issue, not relevant here. Or you can run it over SSH at home,
> and read the last entries over breakfast, whether it crashes or not ..
>
> The lack of any messages - and you should see one if ACPI thermal zone
> detection and forced shutdown is working properly - suggests more of a
> hardware seizure, but at 10 second testing you could see if temps (and
> load) were a problem prior to crash, at least if it happens in a window.
>
> > I see now, most of the time it happens during dumping of the db each
> > night, but it has happened once during the day and once a couple of
> > hours before backup. I'm leaning toward a memory issue and will
> > definitely visit the data center tonight and see the types. The db size
> > has not changed much over time and this just started recently. It is a
> > SpamAssassin/ClamAV db and purges, vacuums every night after dumping. I
> > will disable and do dump manually tonight, 90% of the time it seems to
> > be going down during backup of the largest db. Perhaps the crashes have
> > caused a table to corrupt, I 'fsck -y' all mounts in single user mode
> > after every crash.
>
> Do the fscks log success or any problems then? If not, might be worth
> doing manual fsck to check?
>
> /etc/crontab:
> 59 2 * * * root /root/bin/loadavg_daily
>
> /root/bin/loadavg_daily:
> =======
> #!/bin/sh
> # 19Feb16 loadavg_daily .. every 10 seconds from 02:59 to 03:09 (run by cron)
> log='/root/loadavg_daily.log'
> secs=10
> i=0
> /root/bin/x200stat >> $log # or something else, or nothing ..
> while [ $i -lt 60 ]; do
> echo -n "`uptime` " >> $log
> echo "`sysctl -n hw.acpi.thermal.tz0.temperature`" \
> "`sysctl -n hw.acpi.thermal.tz1.temperature`" >> $log
> sleep $secs
> i=$((i + 1))
> done
> /root/bin/x200stat >> $log
> echo >> $log
> =======
>
> Check sysctl hw.acpi.thermal for your thermal zones of interest.
>
> HTH, Ian
> _______________________________________________
> freebsd-questions at freebsd.org mailing list
> https://lists.freebsd.org/mailman/listinfo/freebsd-questions
> To unsubscribe, send any mail to "freebsd-questions-unsubscribe at freebsd.org"
>
Out of curiosity, I tried the above command under 9.3R:
[wam at kabini1, ~, 1:30:25pm] 581 % sysctl -n hw.acpi.thermal.tz1.temperature
sysctl: unknown oid 'hw.acpi.thermal.tz1.temperature'
[wam at kabini1, ~, 1:30:46pm] 582 % uname -a
FreeBSD kabini1.local 9.3-RELEASE-p33 FreeBSD 9.3-RELEASE-p33 #0: Wed
Jan 13 17:55:39 UTC 2016
root at amd64-builder.daemonology.net:/usr/obj/usr/src/sys/GENERIC amd64
[wam at kabini1, ~, 1:31:58pm] 583 %
When did it become available ?
--
William A. Mahaffey III
----------------------------------------------------------------------
"The M1 Garand is without doubt the finest implement of war
ever devised by man."
-- Gen. George S. Patton Jr.
More information about the freebsd-questions
mailing list