"ipmi0: KCS..." whines
David Wolfskill
david at catwhisker.org
Fri Aug 12 21:43:42 UTC 2016
On Fri, Aug 12, 2016 at 11:54:38AM -0700, John Baldwin wrote:
> ...
> So the issue is probably that the BMC controller on your box is sometimes
> slow in responding. The completion code is the third byte of the reply
> we wait to read after sending a request to the BMC via KCS. However, the
> first two bytes just echo back the request ID and command we asked for, so
> it may be that the BMC echoes those back right away without waiting for
> whatever work it needs to do to handle the request to complete, but doesn't
> send the completion code (the status of the request) until the request is
> fully processed.
>
> The driver is complaining that the BMC didn't respond with the completion
> code before it's timeout expired. The default timeout is MAX_TIMEOUT in
> sys/dev/ipmi/ipmivars.h which corresponds to 6 seconds. It may be that
> occasionally some "background" task runs in the BMC OS that delays responses
> to handling commands. It could also be that whatever work the BMC has to do
> to read this specific value is actually timing out or having issues in the
> hardware, etc.
I could easily modify the stress-test loop to run "date" after each
"ipmitool" invocation. (Pity we don't seem to have a sub-second format
in strftime().)
So... I tried the above (interspersing "date" commands while running
"ipmitool dcmi power reading" in a loop within script(1)). I did not
get a whine at 32 repetitions; I got one at 64.
The total elapsed time was no more than 3 seconds (last timestamp -
first timestamp difference was 2 seconds).
> You could try increasing the timeout in MAX_TIMEOUT (just increase '6' to
> however many seconds you want to tolerate), but keep in mind that the CPU
> sits and spins polling for a reply (though the cure may be worse than the
> disease). You might also try polling this sensor less often.
That's one of the "odd things" -- based on the change that was committed
(locally) I would expect that we issue the "ipmitool dcmi power reading"
command (along with a handful of others) once every 59 seconds.
The complete list of such commands (fed to ipmitool via stdin) is:
dcmi power reading
sensor
raw 0x06 0x52 0x07 0x5b 0x01 0x92
raw 0x30 0x70 0x4b 0x00 0x03
exit
> We could maybe use ppsratecheck() to rate limit the errors, but that's
> sort of papering over the problem that the BMC is timing out the request.
Well, in fairness, that's probably doing a slightly less brute force bit
of "papering over" than the patch I had provided. :-}
> A larger option is to modify the IPMI driver to support interrupt-driven
> operation (and not just polled) in which case a longer timeout might not
> hurt so much (you at least wouldn't be spinning on the CPU for N seconds).
> ....
I wouldn't mind testing that, but I don't think I'm up to writing it.
Thanks!
Peace,
david
--
David H. Wolfskill david at catwhisker.org
Those who would murder in the name of God or prophet are blasphemous cowards.
See http://www.catwhisker.org/~david/publickey.gpg for my public key.
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 603 bytes
Desc: not available
URL: <http://lists.freebsd.org/pipermail/freebsd-hackers/attachments/20160812/5f5955e3/attachment.sig>
More information about the freebsd-hackers
mailing list