Re: f0f3e3e961d3 - main - ipmi: use a queue for kcs driver requests when possible

From: Chuck Silvers <chuq_at_chuq.com>
Date: Tue, 01 Nov 2022 21:55:51 UTC
Hi Ravi,

the usual path (the IPMI_SYSTEM_INTERFACE_ADDR_TYPE case) for an "application" request
(ie. one from userland) is:

ipmi_polled_enqueue_request()
sc->ipmi_enqueue_request()
ipmi_ioctl()

which uses the regular (low priority) queue.

in the non-IPMI_SYSTEM_INTERFACE_ADDR_TYPE case (aka. the IPMB case),
ipmi_ioctl() does issue two driver requests to set up the IPMB stuff before
then issuing the real application request with sc->ipmi_enqueue_request() as above.
I don't know much about this IPMB thing, it looks like a kind of pass-through
interface for communicating with another device that sits behind the main
BMC IPMI controller.

I suppose that it would make sense to change these two IPMB setup driver requests
to use the regular (low priority) queue (since sleeping is ok in that path),
but I don't think I have any way to test that since our systems don't use this path.

-Chuck


On Tue, Nov 01, 2022 at 11:53:57AM -0700, Ravi Pokala wrote:
> Hi Chuck,
> 
> As best as I can tell, the stack for handling a request from userland (e.g. `ipmitool') looks like this:
> 
> ipmi_polled_enqueue_request_highpri()
> kcs_driver_request_queue() /* assuming not panicked or dumping core */
> kcs_driver_request()
> sc->ipmi_driver_request()
> ipmi_submit_driver_request()
> ipmi_ioctl()
> 
> And the stack for handling kernel-internal requests (i.e. watchdog) is essentially the same, except for calling ipmi_submit_driver_request() directly rather than going through ipmi_ioctl(). Specifically, there doesn't seem to be anything that would put the two different types of requests on two different queues; it looks like everything goes onto the highpri queue.
> 
> What am I missing?
> 
> Thanks,
> 
> Ravi (rpokala@)
> 
> -----Original Message-----
> From: <owner-src-committers@freebsd.org> on behalf of Chuck Silvers <chs@FreeBSD.org>
> Date: 2022-11-01, Tuesday at 10:55
> To: <src-committers@FreeBSD.org>, <dev-commits-src-all@FreeBSD.org>, <dev-commits-src-main@FreeBSD.org>
> Subject: git: f0f3e3e961d3 - main - ipmi: use a queue for kcs driver requests when possible
> 
>     The branch main has been updated by chs:
> 
>     URL: https://cgit.FreeBSD.org/src/commit/?id=f0f3e3e961d311f1cd938f1319385e7f454525f1
> 
>     commit f0f3e3e961d311f1cd938f1319385e7f454525f1
>     Author:     Chuck Silvers <chs@FreeBSD.org>
>     AuthorDate: 2022-11-01 17:55:14 +0000
>     Commit:     Chuck Silvers <chs@FreeBSD.org>
>     CommitDate: 2022-11-01 17:55:14 +0000
> 
>         ipmi: use a queue for kcs driver requests when possible
> 
>         The ipmi watchdog pretimeout action can trigger unintentionally in
>         certain rare, complicated situations.  What we have seen at Netflix
>         is that the BMC can sometimes be sent a continuous stream of
>         writes to port 0x80, and due to what is a bug or misconfiguration
>         in the BMC software, this results in the BMC running out of memory,
>         becoming very slow to respond to KCS requests, and eventually being
>         rebooted by its own internal watchdog.  While that is going on in
>         the BMC, back in the host OS, a number of requests are pending in
>         the ipmi request queue, and the kcs_loop thread is working on
>         processing these requests.  All of the KCS accesses to process
>         those requests are timing out and eventually failing because the
>         BMC is responding very slowly or not at all, and the kcs_loop thread
>         is holding the IPMI_IO_LOCK the whole time that is going on.
>         Meanwhile the watchdogd process in the host is trying to pat the
>         BMC watchdog, and this process is sleeping waiting to get the
>         IPMI_IO_LOCK.  It's not entirely clear why the watchdogd process
>         is sleeping for this lock, because the intention is that a thread
>         holding the IPMI_IO_LOCK should not sleep and thus any thread
>         that wants the lock should just spin to wait for it.  My best guess
>         is that the kcs_loop thread is spinning waiting for the BMC to
>         respond for so long that it is eventually preempted, and during
>         the brief interval when the kcs_loop thread is not running,
>         the watchdogd thread notices that the lock holder is not running
>         and sleeps.  When the kcs_loop thread eventually finishes processing
>         one request, it drops the IPMI_IO_LOCK and then immediately takes the
>         lock again so it can process the next request in the queue.
>         Because the watchdogd thread is sleeping at this point, the kcs_loop
>         always wins the race to acquire the IPMI_IO_LOCK, thus starving
>         the watchdogd thread.  The callout for the watchdog pretimeout
>         would be reset by the watchdogd thread after its request to the BMC
>         watchdog completes, but since that request never processed, the
>         pretimeout callout eventually fires, even though there is nothing
>         actually wrong with the host.
> 
>         To prevent this saga from unfolding:
> 
>          - when kcs_driver_request() is called in a context where it can sleep,
>            queue the request and let the worker thread process it rather than
>            trying to process in the original thread.
>          - add a new high-priority queue for driver requests, so that the
>            watchdog patting requests will be processed as quickly as possible
>            even if lots of application requests have already been queued.
> 
>         With these two changes, the watchdog pretimeout action does not trigger
>         even if the BMC is completely out to lunch for long periods of time
>         (as long as the watchdogd check command does not also get stuck).
> 
>         Sponsored by:   Netflix
>         Reviewed by:    imp
>         Differential Revision:  https://reviews.freebsd.org/D36555
>     ---
>      sys/dev/ipmi/ipmi.c     | 33 ++++++++++++++++++++++++++++++---
>      sys/dev/ipmi/ipmi_kcs.c | 28 +++++++++++++++++++++++++++-
>      sys/dev/ipmi/ipmivars.h |  2 ++
>      3 files changed, 59 insertions(+), 4 deletions(-)
> 
>     diff --git a/sys/dev/ipmi/ipmi.c b/sys/dev/ipmi/ipmi.c
>     index d79690d55c68..b8705a81627b 100644
>     --- a/sys/dev/ipmi/ipmi.c
>     +++ b/sys/dev/ipmi/ipmi.c
>     @@ -208,6 +208,15 @@ ipmi_dtor(void *arg)
>      	IPMI_LOCK(sc);
>      	if (dev->ipmi_requests) {
>      		/* Throw away any pending requests for this device. */
>     +		TAILQ_FOREACH_SAFE(req, &sc->ipmi_pending_requests_highpri, ir_link,
>     +		    nreq) {
>     +			if (req->ir_owner == dev) {
>     +				TAILQ_REMOVE(&sc->ipmi_pending_requests_highpri, req,
>     +				    ir_link);
>     +				dev->ipmi_requests--;
>     +				ipmi_free_request(req);
>     +			}
>     +		}
>      		TAILQ_FOREACH_SAFE(req, &sc->ipmi_pending_requests, ir_link,
>      		    nreq) {
>      			if (req->ir_owner == dev) {
>     @@ -579,13 +588,19 @@ ipmi_dequeue_request(struct ipmi_softc *sc)
> 
>      	IPMI_LOCK_ASSERT(sc);
> 
>     -	while (!sc->ipmi_detaching && TAILQ_EMPTY(&sc->ipmi_pending_requests))
>     +	while (!sc->ipmi_detaching && TAILQ_EMPTY(&sc->ipmi_pending_requests) &&
>     +	    TAILQ_EMPTY(&sc->ipmi_pending_requests_highpri))
>      		cv_wait(&sc->ipmi_request_added, &sc->ipmi_requests_lock);
>      	if (sc->ipmi_detaching)
>      		return (NULL);
> 
>     -	req = TAILQ_FIRST(&sc->ipmi_pending_requests);
>     -	TAILQ_REMOVE(&sc->ipmi_pending_requests, req, ir_link);
>     +	req = TAILQ_FIRST(&sc->ipmi_pending_requests_highpri);
>     +	if (req != NULL)
>     +		TAILQ_REMOVE(&sc->ipmi_pending_requests_highpri, req, ir_link);
>     +	else {
>     +		req = TAILQ_FIRST(&sc->ipmi_pending_requests);
>     +		TAILQ_REMOVE(&sc->ipmi_pending_requests, req, ir_link);
>     +	}
>      	return (req);
>      }
> 
>     @@ -601,6 +616,17 @@ ipmi_polled_enqueue_request(struct ipmi_softc *sc, struct ipmi_request *req)
>      	return (0);
>      }
> 
>     +int
>     +ipmi_polled_enqueue_request_highpri(struct ipmi_softc *sc, struct ipmi_request *req)
>     +{
>     +
>     +	IPMI_LOCK_ASSERT(sc);
>     +
>     +	TAILQ_INSERT_TAIL(&sc->ipmi_pending_requests_highpri, req, ir_link);
>     +	cv_signal(&sc->ipmi_request_added);
>     +	return (0);
>     +}
>     +
>      /*
>       * Watchdog event handler.
>       */
>     @@ -817,6 +843,7 @@ ipmi_startup(void *arg)
>      	mtx_init(&sc->ipmi_requests_lock, "ipmi requests", NULL, MTX_DEF);
>      	mtx_init(&sc->ipmi_io_lock, "ipmi io", NULL, MTX_DEF);
>      	cv_init(&sc->ipmi_request_added, "ipmireq");
>     +	TAILQ_INIT(&sc->ipmi_pending_requests_highpri);
>      	TAILQ_INIT(&sc->ipmi_pending_requests);
> 
>      	/* Initialize interface-dependent state. */
>     diff --git a/sys/dev/ipmi/ipmi_kcs.c b/sys/dev/ipmi/ipmi_kcs.c
>     index df3b37614eb7..5908ec88e039 100644
>     --- a/sys/dev/ipmi/ipmi_kcs.c
>     +++ b/sys/dev/ipmi/ipmi_kcs.c
>     @@ -32,6 +32,7 @@ __FBSDID("$FreeBSD$");
>      #include <sys/param.h>
>      #include <sys/systm.h>
>      #include <sys/bus.h>
>     +#include <sys/conf.h>
>      #include <sys/condvar.h>
>      #include <sys/eventhandler.h>
>      #include <sys/kernel.h>
>     @@ -490,7 +491,21 @@ kcs_startup(struct ipmi_softc *sc)
>      }
> 
>      static int
>     -kcs_driver_request(struct ipmi_softc *sc, struct ipmi_request *req, int timo)
>     +kcs_driver_request_queue(struct ipmi_softc *sc, struct ipmi_request *req, int timo)
>     +{
>     +	int error;
>     +
>     +	IPMI_LOCK(sc);
>     +	ipmi_polled_enqueue_request_highpri(sc, req);
>     +	error = msleep(req, &sc->ipmi_requests_lock, 0, "ipmireq", timo);
>     +	if (error == 0)
>     +		error = req->ir_error;
>     +	IPMI_UNLOCK(sc);
>     +	return (error);
>     +}
>     +
>     +static int
>     +kcs_driver_request_poll(struct ipmi_softc *sc, struct ipmi_request *req)
>      {
>      	int i, ok;
> 
>     @@ -504,6 +519,17 @@ kcs_driver_request(struct ipmi_softc *sc, struct ipmi_request *req, int timo)
>      	return (req->ir_error);
>      }
> 
>     +static int
>     +kcs_driver_request(struct ipmi_softc *sc, struct ipmi_request *req, int timo)
>     +{
>     +
>     +	if (KERNEL_PANICKED() || dumping)
>     +		return (kcs_driver_request_poll(sc, req));
>     +	else
>     +		return (kcs_driver_request_queue(sc, req, timo));
>     +}
>     +
>     +
>      int
>      ipmi_kcs_attach(struct ipmi_softc *sc)
>      {
>     diff --git a/sys/dev/ipmi/ipmivars.h b/sys/dev/ipmi/ipmivars.h
>     index b0548ee3d7c3..1468e86be73b 100644
>     --- a/sys/dev/ipmi/ipmivars.h
>     +++ b/sys/dev/ipmi/ipmivars.h
>     @@ -111,6 +111,7 @@ struct ipmi_softc {
>      	uint8_t			ipmi_dev_support;	/* IPMI_ADS_* */
>      	struct cdev		*ipmi_cdev;
>      	TAILQ_HEAD(,ipmi_request) ipmi_pending_requests;
>     +	TAILQ_HEAD(,ipmi_request) ipmi_pending_requests_highpri;
>      	int			ipmi_driver_requests_polled;
>      	eventhandler_tag	ipmi_power_cycle_tag;
>      	eventhandler_tag	ipmi_watchdog_tag;
>     @@ -237,6 +238,7 @@ void	ipmi_complete_request(struct ipmi_softc *, struct ipmi_request *);
>      struct ipmi_request *ipmi_dequeue_request(struct ipmi_softc *);
>      void	ipmi_free_request(struct ipmi_request *);
>      int	ipmi_polled_enqueue_request(struct ipmi_softc *, struct ipmi_request *);
>     +int	ipmi_polled_enqueue_request_highpri(struct ipmi_softc *, struct ipmi_request *);
>      int	ipmi_submit_driver_request(struct ipmi_softc *, struct ipmi_request *,
>      	    int);
> 
>