misdetection of tz2 temperature

Wed Mar 7 06:08:11 UTC 2007

On Mon, 5 Mar 2007, Nate Lawson wrote:

> [Stefan removed from cc]
>
> Bruce Evans wrote:
>> On Mon, 5 Mar 2007, Bruce Evans wrote:
>>
>>> I now have a completely different acpi problem to ask about.  My HP
>>> nx6325 now shuts down an instant after booting FreeBSD with a 1 week
>>> old kernel, since the tz2 temperature is misdetected as 3413.3 degrees
>>> C.  All temperatures seemed to be detected correctly in 3+ week old
>>> kernels.  Only batter battery misdetection that caused shutdowns (less
>>> cleanly via panics) in the old kernels.
>>
>> This seems to be fixed in -current.
>
> I recently committed a major reworking of the embedded controller
> driver.  See this message:
>
> http://lists.freebsd.org/pipermail/freebsd-current/2007-February/069525.html
>
> See this message for a list of things to try.  The goal is to diagnose
> why the EC is timing out.  The thermal misdetection is only a symptom.
>
> http://lists.freebsd.org/pipermail/freebsd-current/2007-February/069577.html
>
> The one I think would be most helpful is increasing the total time spent
> waiting, but I would appreciate your help seeing what combo of
> polling/total timeout works for you.
>
> debug.acpi.ec.timeout=1000  # 1 sec total

Changing ec.timeout to 1000 or 10000 and changing ec.poll_time to 100
or 10000 had no visible effect.  I don't seem to be getting any timeouts.
Setting ec.burst=0 fixes the problem according to the the "_TMP value
is absurd" messages.

Diffs between output of "sysctl -a | grep acpi" between an old kernel
without the problem and a new kernel with the problem:

% --- old	Mon Mar  5 22:18:52 2007
% +++ new	Wed Mar  7 16:39:12 2007
% @@ -2 +2 @@
% -debug.acpi.acpi_ca_version: 0x20051021
% +debug.acpi.acpi_ca_version: 20051021
% @@ -3,0 +4,3 @@
% +debug.acpi.ec.burst: 1
% +debug.acpi.ec.poll_time: 500
% +debug.acpi.ec.timeout: 500
% @@ -23 +26 @@
% -hw.acpi.acline: 1
% +hw.acpi.acline: 0

Another bug.  AC is connected.

% @@ -27,2 +30,2 @@
% -hw.acpi.thermal.tz0.temperature: 45.0C
% -hw.acpi.thermal.tz0.active: 3
% +hw.acpi.thermal.tz0.temperature: 71.0C
% +hw.acpi.thermal.tz0.active: 1

Probably correct.

% @@ -34,2 +37,2 @@
% -hw.acpi.thermal.tz0._ACx: 75.0C 65.0C 55.0C 15.9C -1 -1 -1 -1 -1 -1
% -hw.acpi.thermal.tz1.temperature: 43.0C
% +hw.acpi.thermal.tz0._ACx: 75.0C 60.0C 50.0C 40.0C -1 -1 -1 -1 -1 -1
% +hw.acpi.thermal.tz1.temperature: 56.0C

15.9C seems too low.

% @@ -43 +46 @@
% -hw.acpi.thermal.tz2.temperature: 29.5C
% +hw.acpi.thermal.tz2.temperature: 16.0C

16.0C is too low.  Room temperature is 26C.

Before the absurd values were ignored, 3400+C was printed here.  Now
the only evidence of the absurd values is the messages about them.

% @@ -94,0 +98 @@
% +dev.cpu.1.%parent: acpi0
% @@ -96,0 +101,2 @@
% +dev.acpi_perf.1.%driver: acpi_perf
% +dev.acpi_perf.1.%parent: cpu1

After setting ec.burst to 0, with no changes to timeouts, only one more
"absurd" message has been printed after 20 minutes, so the message seems
to have been for old state.

Diffs from this change:

% --- z3	Wed Mar  7 16:39:12 2007
% +++ z4	Wed Mar  7 16:42:09 2007
% @@ -4 +4 @@
% -debug.acpi.ec.burst: 1
% +debug.acpi.ec.burst: 0
% @@ -30,2 +30,2 @@
% -hw.acpi.thermal.tz0.temperature: 71.0C
% -hw.acpi.thermal.tz0.active: 1
% +hw.acpi.thermal.tz0.temperature: 52.0C
% +hw.acpi.thermal.tz0.active: 2
% @@ -37,2 +37,2 @@
% -hw.acpi.thermal.tz0._ACx: 75.0C 60.0C 50.0C 40.0C -1 -1 -1 -1 -1 -1
% -hw.acpi.thermal.tz1.temperature: 56.0C
% +hw.acpi.thermal.tz0._ACx: 75.0C 65.0C 50.0C 40.0C -1 -1 -1 -1 -1 -1
% +hw.acpi.thermal.tz1.temperature: 55.0C

Probably correct.

% @@ -46 +46 @@
% -hw.acpi.thermal.tz2.temperature: 16.0C
% +hw.acpi.thermal.tz2.temperature: 30.7C

Possibly correct.  30.7C still seems low.  ISTR tz2 usually being too
low in old kernels, but that may have only been the 15.9C ACx value.

With ec.burst=0, ec.timeout=3 seems to work but ec.timeout.2 causes
AE_NO_HARDWARE_RESPONSE errors.

With ec.burst=1, ec.timeout=2 seems to work but ec.timeout=1 causes
AE_NO_HARDWARE_RESPONSE errors.

Toggling physical AC power doesn't change "hw.acpi.acline: 0".  ISTR
that this used to work.

Bruce