Time sharing for interrupt threads

From: John Baldwin <jhb_at_FreeBSD.org>
Date: Wed, 04 May 2022 21:48:39 UTC
My recent changes to the softclock threads (raising their priority) were to
address a livelock issue in which a constant stream of packets could starve
timeout events.  However, in the discussion on the review
(https://reviews.freebsd.org/D33693), several other livelock cases were
raised.  As a result, I've pursued a possible more general solution to
livelock among interrupt threads.

Previously I have toyed with the notion of extending our existing interrupt
API to support cooperative time-sharing among interrupt handlers (where
handlers would have to explicitly reschedule themselves before returning).
This would be similar to what some drivers have tried to do in the past
with scheduling tasks on taskqueues to defer work.  However, this approach
was a bit clunky (and the implementation was also a bit clunky) as it
requires driver changes.  It also requires drivers to try to understand
how much work is "too much work" and knowing when to yield.  Using tasks
on taskqueues had similar problems.  One of the other issues with the
taskqueue approach is that the tasks would try to also yield by
rescheduling the task, but this didn't actually work since the taskqueue
thread handler would just see the requeued task and run it again right
away, so the task always ended up running to completion.

iflib attempts to ameliorate the requeue task problem by forcing all
iflib devices to use a single pool of per-CPU threads so that the
thread can at least round-robin among the tasks for multiple devices.
However, this doesn't succeed in time-sharing with other interrupt
handlers either for non-iflib NIC drivers or for non-network devices.
It also still has the problem of drivers having to guess at when to
yield.

Instead, an approach I now think might be most workable is to do
forced time-sharing via preemption from clock interrupts similar to
what is done for userspace time-sharing threads.  However, this would
be a separate tier of time-sharing that only forces time-sharing among
ithreads.  I have implemented an initial prototype of this in ULE and
Drew has tested it in some specific livelock use cases.  The general
idea of this approach is that sched_clock() (which runs at stathz)
notices that an ithread has used a full time slice without going idle,
the ithread's priority is demoted by 4 (to force it into a separate
runq) and if there is a runnable thread waiting at the ithread's
original priority, a preemption is scheduled when the clock interrupt
finishes.  The ithread's "base" priority is restored if the ithread
goes idle (the priority isn't actually restored until the idle
ithread is resumed via sched_wakeup, but it gives the desired behavior
of undoing "demotion" once the ithread stops running and goes idle).
One other constraint is that ithreads will only demote to PRI_MAX_ITHD
but no lower, so ithreads will still always be preferred to non-ithreads.

The changes to implement this can be found on the swi_refactor
branch here: https://github.com/freebsd/freebsd-src/compare/main...bsdjhb:swi_refactor

One thing that Drew found though when testing this initially is that
while non-iflib NIC drivers (which run handlers at PI_NET) could
preempt each other, an iflib driver (whcih runs its gtaskqueue threads
at a hardcoded PI_SOFT) could not, or at least not as well.  My
suspicion is that it just takes several full slices for a PI_NET
ithread to demote down to PI_SOFT where the gtaskqueue iflib thread
would actually get a chance to run, and that even with a steady stream
of traffic the non-iflib driver occasionally catches up and briefly
goes idle which resets the priority back to PI_NET.  Drew hacked
iflib to use PI_NET as a quick hack and once this was done the iflib
driver was able to preempt the non-iflib driver smoothly.

 From that experience, I suspect that if we do want to pursue this
approach, we may need to collapse our notion of ithread priorities
a bit.  That is, we may want priorities to look something like
this:

PI_REALTIME (PI_MIN_ITHD + 0)  /* Reserved for "unusual" hardware like maybe clocks. */
PI_HARDWARE (PI_MIN_ITHD + 4)  /* All "normal" hardware like NET/DISK/etc. */
PI_SOFT     (PI_MIN_ITHD + 8)  /* Software handlers like netisr and others. */

This way, once a hardware handler uses up a time slice, it ends up
yielding to all other hardware handlers and if it uses two slices it
can't even starve swis.  It's not really clear to me that we need
hard lines between different classes of interrupts nowadays especially
if we force time-sharing between long-running handlers.  It's debatable
that NVMe devices (PI_DISK) should always be preempted by NICs (PI_NET)
for example.  In the review Warner had proposed a separate PI_FLASH
perhaps for flash storage that might be above PI_NET, but I think it
might be simplest to let a system auto-adjust to interrupt load via
a time-sharing system instead.

In terms of softclock, I'm not quite sure if it should still run at
PI_HARDWARE or PI_SOFT in the above model.  I think it probably still
wants to run at PI_HARDWARE, but it would probably still work even
at PI_SOFT in at least some livelock workloads.

If we do decide to move forward with some sort of time-sharing, I do
think we can possibly simplify iflib drivers by removing the global
taskqueue thread pool and reverting back to just using dedicated ithreads.
This would avoid the need for iflib drivers needing to figure out when
to yield but instead let the handlers run to completion.

One further historical note: the reason we have different priorities for
ithreads at all is mostly just an artifact of moving away from spl.
In the old interrupt masking implementation in FreeBSD 4.x and before,
pending interrupts (both hardware and software) were tracked in a bitmask
(in fact, there were only 24 bits for hardware interrupts, so we enforced
artificial interrupt sharing by tying multiple hardware interrupts to the
same bit) and pending interrupts were handled via an ffs() loop on the
bitmask during splx().  When we added ithreads we kept priorities that
reflected the relative order of the bits set aside in the pending bitmask.
Certainly that is where the SWI_* values come from, and the relative
ordering of PI_* interrupts is mostly I think a SWAG based on the the
fact that at the time network devices needed lower latency than disk
(which was all spinning rust).  At least some versions of Solaris (the
one my Solaris Internals book describes) don't assign differing priorities
to ithreads at all (all ithreads ran at the same priority and I think the
version of Solaris in question had per-cpu ithreads that handlers were
queued to when an interrupt asserted).  I'm less familiar with what other
systems do in terms of the relative priority of device interrupts.

Given all that, my question is if other folks think that time-sharing
of interrupt threads is an idea worth pursuing (vs having ithreads
always run to completion)?  If so, do folks have feedback on the
design or the question of collapsing ithread priorities?

-- 
John Baldwin