Xserve G5 keeps shutting down

Nathan Whitehorn nwhitehorn at freebsd.org
Thu Jun 23 04:48:21 UTC 2011


On 06/21/11 14:32, Paul Mather wrote:
> On Jun 20, 2011, at 7:59 PM, Nathan Whitehorn wrote:
>
>> On 06/20/11 15:22, Paul Mather wrote:
>>> I'm running FreeBSD/powerpc64 -CURRENT on an Xserve G5.  With a recent kernel, the system will not stay up for more than a few hours at a time. :-(
>>>
>>> I have no idea why the machine is shutting off.  There is no panic or crash dump and there is no indication in the logs of anything awry.  The system just powers down.  The times this has happened when I have been there have not indicated anything stressing the system (like all fans racing madly) and oftentimes the system has been relatively idle.  (Oddly, it never appears to my knowledge to have shut down when doing sometime potentially taxing, such as a make -j5 buildworld or the likes.)
>>>
>>> The main thing I have noticed since building this new kernel is that the fans are now controlled automatically, i.e., there is now no need for the tickle-the-fan-controller cron job of yore, meaning the fans won't race when in single user mode (e.g., during an installworld).
>> If the temperature on any sensor exceeds its maximum value, it will cause the machine to shut off. There was at one point a problem with some of the sensor drivers that would would report erroneous crazy values sometimes. Most of the known problems were fixed andreast a few weeks ago, but it looks like you ran into another. My work desktop has a ds1775 and a max6690, and has no problems, but not an ad7417, so I would guess the problem lies there. Could you try commenting out line 116 of /sys/powerpc/powermac/powermac_thermal.c? That will cause it to spam the console (and dmesg) about the error, identifying the sensor, but not shut off the machine and so both keep your server on and let us work out the problem.
>
> I built a new kernel with the shutdown line identified above commented out.  The resultant system stayed up for several hours doing various -j5 buildworld/buildkernels but just now shut down. :-(  Unfortunately, nothing appeared on the console, so there is no logged reason for the shutdown.
>
> I started up the system again, but it shut down again after a few minutes of uptime.  When I started it up for the third (and last time), I managed to grab this output from the temp/fan sysctls before it shut down (a minute or two after booting up):
>
> paul at backup:/home/paul>  sysctl -a | egrep 'dev.*temp|fans'
> machdep.manage_fans: 1
> dev.max6690.0.%pnpinfo: name=temp-monitor compat=max6690
> dev.max6690.0.sensor.sys_ctrlr_ambient.temp: 41.5C
> dev.max6690.0.sensor.sys_ctrlr_internal.temp: 50.1C
> dev.fcu.0.fans.cpu_a_1.minrpm: 1200
> dev.fcu.0.fans.cpu_a_1.maxrpm: 14000
> dev.fcu.0.fans.cpu_a_1.rpm: 1984
> dev.fcu.0.fans.cpu_a_2.minrpm: 1200
> dev.fcu.0.fans.cpu_a_2.maxrpm: 14000
> dev.fcu.0.fans.cpu_a_2.rpm: 1984
> dev.fcu.0.fans.cpu_a_3.minrpm: 1200
> dev.fcu.0.fans.cpu_a_3.maxrpm: 14000
> dev.fcu.0.fans.cpu_a_3.rpm: 1984
> dev.fcu.0.fans.cpu_b_1.minrpm: 1200
> dev.fcu.0.fans.cpu_b_1.maxrpm: 14000
> dev.fcu.0.fans.cpu_b_1.rpm: 1984
> dev.fcu.0.fans.cpu_b_2.minrpm: 1200
> dev.fcu.0.fans.cpu_b_2.maxrpm: 14000
> dev.fcu.0.fans.cpu_b_2.rpm: 1984
> dev.fcu.0.fans.cpu_b_3.minrpm: 1200
> dev.fcu.0.fans.cpu_b_3.maxrpm: 14000
> dev.fcu.0.fans.cpu_b_3.rpm: 1984
> dev.fcu.0.fans.sys_ctrlr_fan.minpwm: 40
> dev.fcu.0.fans.sys_ctrlr_fan.maxpwm: 100
> dev.fcu.0.fans.sys_ctrlr_fan.pwm: 54
> dev.fcu.0.fans.sys_ctrlr_fan.rpm: 11264
> dev.fcu.0.fans.pci_fan.minpwm: 40
> dev.fcu.0.fans.pci_fan.maxpwm: 100
> dev.fcu.0.fans.pci_fan.pwm: 48
> dev.fcu.0.fans.pci_fan.rpm: 9792
> dev.ad7417.0.sensor.cpu_a_ad7417_amb.temp: 36.7C
> dev.ad7417.0.sensor.cpu_a_diode_temp.temp: 53.8C
> dev.ad7417.1.sensor.cpu_b_ad7417_amb.temp: 32.0C
> dev.ad7417.1.sensor.cpu_b_diode_temp.temp: 52.6C
> dev.ds1775.0.%pnpinfo: name=temp-monitor compat=lm75
>
>
> The cpu_{a,b}_diode_temp temperatures were higher during the buildworld (63--67C) and it stayed up at that time.
>
> I'm flummoxed at this point as to what is responsible for the shutdowns.  Are there any other hardware monitoring-related shutdowns in the kernel code?  The funny thing about the ad7417 device is that I only recently added it to my kernel config file as I noticed it had appeared in GENERIC.
>
> Tomorrow I'll build a GENERIC kernel with the shutdown line commented out, and see if I have any better luck with that.

Bizarre! So, on the console, it just had a logon prompt, and then Open 
Firmware again? Nothing at all in between?
-Nathan


More information about the freebsd-ppc mailing list