Xserve G5 keeps shutting down

Nathan Whitehorn nwhitehorn at freebsd.org
Sun Jun 26 00:52:43 UTC 2011


On 06/24/11 13:00, Paul Mather wrote:
> On Jun 23, 2011, at 12:48 AM, Nathan Whitehorn wrote:
>
>> On 06/21/11 14:32, Paul Mather wrote:
>>> On Jun 20, 2011, at 7:59 PM, Nathan Whitehorn wrote:
>>>
>>>> On 06/20/11 15:22, Paul Mather wrote:
>>>>> I'm running FreeBSD/powerpc64 -CURRENT on an Xserve G5.  With a recent kernel, the system will not stay up for more than a few hours at a time. :-(
>>>>>
>>>>> I have no idea why the machine is shutting off.  There is no panic or crash dump and there is no indication in the logs of anything awry.  The system just powers down.  The times this has happened when I have been there have not indicated anything stressing the system (like all fans racing madly) and oftentimes the system has been relatively idle.  (Oddly, it never appears to my knowledge to have shut down when doing sometime potentially taxing, such as a make -j5 buildworld or the likes.)
>>>>>
>>>>> The main thing I have noticed since building this new kernel is that the fans are now controlled automatically, i.e., there is now no need for the tickle-the-fan-controller cron job of yore, meaning the fans won't race when in single user mode (e.g., during an installworld).
>>>> If the temperature on any sensor exceeds its maximum value, it will cause the machine to shut off. There was at one point a problem with some of the sensor drivers that would would report erroneous crazy values sometimes. Most of the known problems were fixed andreast a few weeks ago, but it looks like you ran into another. My work desktop has a ds1775 and a max6690, and has no problems, but not an ad7417, so I would guess the problem lies there. Could you try commenting out line 116 of /sys/powerpc/powermac/powermac_thermal.c? That will cause it to spam the console (and dmesg) about the error, identifying the sensor, but not shut off the machine and so both keep your server on and let us work out the problem.
>>> I built a new kernel with the shutdown line identified above commented out.  The resultant system stayed up for several hours doing various -j5 buildworld/buildkernels but just now shut down. :-(  Unfortunately, nothing appeared on the console, so there is no logged reason for the shutdown.
>>>
>>> I started up the system again, but it shut down again after a few minutes of uptime.  When I started it up for the third (and last time), I managed to grab this output from the temp/fan sysctls before it shut down (a minute or two after booting up):
>>>
>>> paul at backup:/home/paul>   sysctl -a | egrep 'dev.*temp|fans'
>>> machdep.manage_fans: 1
>>> dev.max6690.0.%pnpinfo: name=temp-monitor compat=max6690
>>> dev.max6690.0.sensor.sys_ctrlr_ambient.temp: 41.5C
>>> dev.max6690.0.sensor.sys_ctrlr_internal.temp: 50.1C
>>> dev.fcu.0.fans.cpu_a_1.minrpm: 1200
>>> dev.fcu.0.fans.cpu_a_1.maxrpm: 14000
>>> dev.fcu.0.fans.cpu_a_1.rpm: 1984
>>> dev.fcu.0.fans.cpu_a_2.minrpm: 1200
>>> dev.fcu.0.fans.cpu_a_2.maxrpm: 14000
>>> dev.fcu.0.fans.cpu_a_2.rpm: 1984
>>> dev.fcu.0.fans.cpu_a_3.minrpm: 1200
>>> dev.fcu.0.fans.cpu_a_3.maxrpm: 14000
>>> dev.fcu.0.fans.cpu_a_3.rpm: 1984
>>> dev.fcu.0.fans.cpu_b_1.minrpm: 1200
>>> dev.fcu.0.fans.cpu_b_1.maxrpm: 14000
>>> dev.fcu.0.fans.cpu_b_1.rpm: 1984
>>> dev.fcu.0.fans.cpu_b_2.minrpm: 1200
>>> dev.fcu.0.fans.cpu_b_2.maxrpm: 14000
>>> dev.fcu.0.fans.cpu_b_2.rpm: 1984
>>> dev.fcu.0.fans.cpu_b_3.minrpm: 1200
>>> dev.fcu.0.fans.cpu_b_3.maxrpm: 14000
>>> dev.fcu.0.fans.cpu_b_3.rpm: 1984
>>> dev.fcu.0.fans.sys_ctrlr_fan.minpwm: 40
>>> dev.fcu.0.fans.sys_ctrlr_fan.maxpwm: 100
>>> dev.fcu.0.fans.sys_ctrlr_fan.pwm: 54
>>> dev.fcu.0.fans.sys_ctrlr_fan.rpm: 11264
>>> dev.fcu.0.fans.pci_fan.minpwm: 40
>>> dev.fcu.0.fans.pci_fan.maxpwm: 100
>>> dev.fcu.0.fans.pci_fan.pwm: 48
>>> dev.fcu.0.fans.pci_fan.rpm: 9792
>>> dev.ad7417.0.sensor.cpu_a_ad7417_amb.temp: 36.7C
>>> dev.ad7417.0.sensor.cpu_a_diode_temp.temp: 53.8C
>>> dev.ad7417.1.sensor.cpu_b_ad7417_amb.temp: 32.0C
>>> dev.ad7417.1.sensor.cpu_b_diode_temp.temp: 52.6C
>>> dev.ds1775.0.%pnpinfo: name=temp-monitor compat=lm75
>>>
>>>
>>> The cpu_{a,b}_diode_temp temperatures were higher during the buildworld (63--67C) and it stayed up at that time.
>>>
>>> I'm flummoxed at this point as to what is responsible for the shutdowns.  Are there any other hardware monitoring-related shutdowns in the kernel code?  The funny thing about the ad7417 device is that I only recently added it to my kernel config file as I noticed it had appeared in GENERIC.
>>>
>>> Tomorrow I'll build a GENERIC kernel with the shutdown line commented out, and see if I have any better luck with that.
>> Bizarre! So, on the console, it just had a logon prompt, and then Open Firmware again? Nothing at all in between?
>
> Not quite: on the console it just had a login prompt and then nothing else was output---the machine had powered off.

That's extraordinarily odd. And you said there was no fsck or other 
problem after the reboot? It looks like it shut down normally?

> Now, it seems things have gone from bad to worse.  I just built and installed a GENERIC64 kernel but it hangs just after probing pcm0 (Apple I2S Audio Controller).  (I omit this device from the Xserve G5 custom kernel config file I normally use, as its hardware lacks any graphics or sound.)  I rebuilt this kernel with the sound devices commented out, and, fortunately, this version of GENERIC64 will boot without hanging.  I'll see if it fares any better when it comes to the Xserve G5 staying up for more than a few hours.

Ack. Thanks for reporting this -- it's fixed now.

> I notice with the pwm-controlled fans that there is a minpwm and maxpwm (40 and 100 respectively on my system).  Is it possible to lower the minpwm?  I tried to do so via /boot/loader.conf (lowering it to 30) but that appears to have had no effect.  Here is my /boot/loader.conf:
>
> dev.fcu.0.fans.sys_ctrlr_fan.minpwm=30
> dev.fcu.0.fans.pci_fan.minpwm=30

I've turned the minimum back down. I had turned it up to 40 due to being 
frightened by a electronics-burning smell, so that should be fixed too.

> As I understand it from corresponding with Andreas Tobler in the past, the pwm value is a percentage reflecting the fan rpm between off and full rpms.  My dev.fcu.0.fans.sys_ctrlr_fan.pwm currently reports a value of 52 but the dev.fcu.0.fans.sys_ctrlr_fan.rpm of 11170 seems proportionally too high if this fan tops out at 14000 rpm like the CPU fans apparently do.  I know the Xserve G5 fans are supposed to run at higher rpms than the desktop PowerMac G5s, because they're smaller, but I'd like for my sys_ctrlr_fan not to run quite so fast as it is.

That is strange -- hopefully it runs a bit slower now. Andreas, any idea 
why it would be so non-linear?
-Nathan


More information about the freebsd-ppc mailing list