"ipmi0: KCS..." whines

David Wolfskill david at catwhisker.org
Fri Aug 12 21:43:42 UTC 2016


On Fri, Aug 12, 2016 at 11:54:38AM -0700, John Baldwin wrote:
> ...
> So the issue is probably that the BMC controller on your box is sometimes
> slow in responding.  The completion code is the third byte of the reply
> we wait to read after sending a request to the BMC via KCS.  However, the
> first two bytes just echo back the request ID and command we asked for, so
> it may be that the BMC echoes those back right away without waiting for
> whatever work it needs to do to handle the request to complete, but doesn't
> send the completion code (the status of the request) until the request is
> fully processed.
> 
> The driver is complaining that the BMC didn't respond with the completion
> code before it's timeout expired.  The default timeout is MAX_TIMEOUT in
> sys/dev/ipmi/ipmivars.h which corresponds to 6 seconds.  It may be that
> occasionally some "background" task runs in the BMC OS that delays responses
> to handling commands.  It could also be that whatever work the BMC has to do
> to read this specific value is actually timing out or having issues in the
> hardware, etc.

I could easily modify the stress-test loop to run "date" after each
"ipmitool" invocation.  (Pity we don't seem to have a sub-second format
in strftime().)

So... I tried the above (interspersing "date" commands while running
"ipmitool dcmi power reading" in a loop within script(1)).  I did not
get a whine at 32 repetitions; I got one at 64.

The total elapsed time was no more than 3 seconds (last timestamp -
first timestamp difference was 2 seconds).

> You could try increasing the timeout in MAX_TIMEOUT (just increase '6' to
> however many seconds you want to tolerate), but keep in mind that the CPU
> sits and spins polling for a reply (though the cure may be worse than the
> disease).  You might also try polling this sensor less often.

That's one of the "odd things" -- based on the change that was committed
(locally) I would expect that we issue the "ipmitool dcmi power reading"
command (along with a handful of others) once every 59 seconds.

The complete list of such commands (fed to ipmitool via stdin) is:

dcmi power reading
sensor
raw 0x06 0x52 0x07 0x5b 0x01 0x92
raw 0x30 0x70 0x4b 0x00 0x03
exit

> We could maybe use ppsratecheck() to rate limit the errors, but that's
> sort of papering over the problem that the BMC is timing out the request.

Well, in fairness, that's probably doing a slightly less brute force bit
of "papering over" than the patch I had provided. :-}

> A larger option is to modify the IPMI driver to support interrupt-driven
> operation (and not just polled) in which case a longer timeout might not
> hurt so much (you at least wouldn't be spinning on the CPU for N seconds).
> ....
 
I wouldn't mind testing that, but I don't think I'm up to writing it.

Thanks!

Peace,
david
-- 
David H. Wolfskill				david at catwhisker.org
Those who would murder in the name of God or prophet are blasphemous cowards.

See http://www.catwhisker.org/~david/publickey.gpg for my public key.
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 603 bytes
Desc: not available
URL: <http://lists.freebsd.org/pipermail/freebsd-hackers/attachments/20160812/5f5955e3/attachment.sig>


More information about the freebsd-hackers mailing list