svn commit: r295557 - head/sys/dev/uart
Ian Lepore
ian at freebsd.org
Tue Feb 16 15:11:16 UTC 2016
On Tue, 2016-02-16 at 12:46 +1100, Bruce Evans wrote:
> On Mon, 15 Feb 2016, Ian Lepore wrote:
>
> > On Tue, 2016-02-16 at 11:01 +1100, Bruce Evans wrote:
> >> On Mon, 15 Feb 2016, Ian Lepore wrote:
> >>
> >>> On Tue, 2016-02-16 at 09:28 +1100, Bruce Evans wrote:
> >>>> On Mon, 15 Feb 2016, Michal Meloun wrote:
> >>>>
> >>>>> [...]
> >>>>> Please note that ARM architecture does not have vectored interrupts,
> >>>>> CPU must read actual interrupt source from external interrupt
> >>>>> controller (GIC) register. This register contain predefined value if
> >>>>> none of interrupts are active.
> >>>>>
> >>>>> 1 - CPU1: enters ns8250_bus_transmit() and sets IER_ETXRDY.
> >>>>> 2 - HW: UART interrupt is asserted, processed by GIC and signaled
> >>>>> to CPU2.
> >>>>> 3 - CPU2: enters interrupt service.
> >>>>
> >>>> It is blocked by uart_lock(), right?
> >>>>
> >>>>> 4 - CPU1: writes character to into REG_DATA register.
> >>>>> 5 - HW: UART clear its interrupt request
> >>>>> 6 - CPU2: reads interrupt source register. No active interrupt is
> >>>>> found, spurious interrupt is signaled, and CPU leaves interrupted
> >>>>> state.
> >>>>> 7 - CPU1: executes uart_barrier(). This function is not empty on ARM,
> >>>>> and can be slow in some cases.
> >>>>
> >>>> It is not empty even on x86, although it probably should be.
> >>>>
> >>>> BTW, if arm needs the barrier, then how does it work with
> >>>> bus_space_barrier() referenced in just 25 files in all of /sys/dev?
> >>>
> >>> With a hack, of course. In the arm interrupt-controller drivers we
> >>> always call bus_space_barrier() right before doing an EOI. It's not a
> >>> 100% solution, but in practice it seems to work pretty well.
> >>
> >> I thought about the x86 behaviour a bit more and now see that it does
> >> need barriers but not the ones given by bus_space_barrier(). All (?)
> >> interrupt handlers use mutexes (if not driver ones, then higher-level
> >> ones). These might give stronger or different ordering than given by
> >> bus_space_barrier(). On x86, they use the same memory bus lock as
> >> the bus_space_barrier(). This is needed to give ordering across
> >> CPUs. But for accessing a single device, you only need program order
> >> for a single CPU. This is automatic on x86 provided a mutex is used
> >> to prevent other CPUs accessing the same device. And if you don't use
> >> a mutex, then bus_space_barrier() cannot give the necessary ordering
> >> since if cannot prevent other CPUs interfering.
> >>
> >> So how does bus_space_barrier() before EOI make much difference? It
> >> doesn't affect the order for a bunch of accesses on a single CPU.
> >> It must do more than a mutex to do something good across CPUs.
> >> Arguably, it is a bug in mutexes is they don't gives synchronization
> >> for device memory.
> >>> ...
> >>> The hack code does a drain-write-buffer which doesn't g'tee that the
> >>> slow peripheral write has made it all the way to the device, but it
> >>> does at least g'tee that the write to the bus the perhiperal is on has
> >>> been posted and ack'd by any bus<->bus bridge, and that seems to be
> >>> good enough in practice. (If there were multiple bridged busses
> >>> downstream it probably wouldn't be, but so far things aren't that
> >>> complicated inside the socs we support.)
> >>
> >> Hmm, so there is some automatic strong ordering but mutexes don't
> >> work for device memory?
> >
> > I guess you keep mentioning mutexes because on x86 their implementation
> > uses some of the same instructions that are involved in bus_space
> > barriers on x86? Otherwise I can't see what they have to do with
> > anything related to the spurious interrupts that happen on arm. (You
> > also mentioned multiple CPUs, which is not a requirement for this
> > trouble on arm, it'll happen with a single core.)
>
> Partly. I wasn't worrying about this "spurious" interrupt but locking
> in general. Now I don't see how mutexes can work unless they order
> device accesses in exactly the same way as ordinary memory accesses.
>
Mutexes on arm are implemented with entirely different instructions,
which do not cause any of this buffer draining or influence the
ordering of surrouding accesses to non-mutex data. When a mutex is
used to prevent concurrent hardware access between the interrupt and
non-interrupt parts of a driver, or multiple cores accessing the
hardware, a bus_space_barrier() is required after writing to the
hardware and before releasing the mutex.
Of course, such barriers don't exist in most drivers, especially ones
not written for soc-specific hardware, and to the degree those drivers
work, it's by accident. ::sigh:: There's no easy way to slip in a
"mostly fixes all drivers" hack like the EOI hack, short of doing a
barrier at the end of every bus_space access, and that's too expensive.
> > The piece of info you're missing might be the fact that memory-mapped
> > device registers on arm are mapped with the Device attribute which
> > gives stronger ordering than Normal memory. In particular, writes are
> > in order and not combined, but they are buffered.
>
> arm doesn't need the barrier after each output byte then, and in general
> bus_space_write_multi_N() shouldn't need a barrier after each write, but
> only it can reasonably know if it does, so any barrier(s) for it belong
> in it and not in callers.
>
There is certainly no need for a barrier call after each byte is
written into the fifo (on any hardware that I know of). A single
barrier at the end of the loop should suffice.
Hmmm, I wonder if that implies that another way to fix the original
problem, one that would work with buggy hardware too, would be:
enable interrupt
barrier
loop to fill the fifo
barrier
The first barrier would ensure the interrupt-enable reaches its
register before any data gets to the fifo (probably not needed but safe
and not-too-expensive). The second barrier should ensure that at least
one byte has gotten to the TX register and that should result in de
-asserting the TXRDY before control leaves the uart interrupt routines.
> Drivers for arm could also do a bunch of unrelated writes (and reads?)
> followed by a single barrier. But this is even more MD.
>
That's what virtually all soc-specific drivers should be doing on arm,
but even that close to the metal, most of them don't. (I'm as guilty
as anyone on that front -- some of this knowledge was acquired after
drivers were written, and I haven't gone back and cleaned up.)
> > In some designs
> > there are multiple buffers, so there can be multiple writes that
> > haven't reached the hardware yet. A read from the same region will
> > stall until all writes to that region are done, and there is also an
> > instruction that specifically forces out the buffers and stalls until
> > they're empty.
>
> The multiple regions should be in separate bus spaces so that the barriers
> can be optimized to 1 at the end. I now see how the optimization can
> almost be done at the bus_space level -- drivers call
> bus_space_maybe_barrier() after every access, but this is a no-op on bus
> spaces with sufficient ordering iff the bus space hasn't changed;
> drivers call bus_space_sync_barriers() at the end. I think the DMA sync
> functions work a bit like this. A well-written interrupt handler would
> have just 1 bus_space_sync_barriers() at the end. This works like the
> EOI hack. If all arches have sufficiently strong ordering then
> bus_space_maybe_barrier() is not needed but more careful syncing is
> needed when switching the bus space.
>
There is a special place in my heart for any solution that includes a
function with "maybe" in the name.
Beyond that, I'm going to have to ponder this one. The part that
sounds expensive at runtime is keeping track of which bus_space was
last accessed so that barriers can be automatically inserted between
writes to difference spaces. (Or semi-automatically if it's done only
on a maybe_barrier, but the bookkeeping is the same.)
> > Without doing the drain-write-buffer (or a device read) after each
> > write, the only g'tee you'd get is that each device sees the writes
> > directed at it in the order they were issued. With devices A and B,
> > you could write a sequence of A1 B1 A2 B2 A3 B3 and they could arrive
> > at the devices as A1 A2 B1 B2 A3 B3, or any other permutation, as long
> > as device A sees 123 and device B sees 123.
>
> I think it is even more important that the order is synchronized relative
> to memory for mutex locking. Mutex locking presumably doesn't issue
> bus barriers because that would be too wasteful for the usual case. So
> before almost every mutex unlock for a driver, there must be a
> bus_space_sync_barriers() call and this must flush all the write buffers
> for devices before all memory accesses for the mutex, so that the mutex
> works as expected. But is this automatic on arm? Ordinary memory is
> like a different bus space that has a different write buffer which might
> be flushed in a different order.
>
Looks like what I wrote in my first paragraphs might have been more in
context here. Ordinary memory is literally another bus here.
> > So on arm the need for barriers arises primarily when two different
> > devices interact with each other in some way and it matters that a
> > series of interleaved writes reaches the devices in the same relative
> > order they were issued by the cpu. That condition mostly comes up only
> > in terms of the PIC interacting with basically every other device.
>
> The code under discussion seems to provide a good example of why you
> need ordering relative to ordinary memory too. In the old version,
> CPU2 is waiting for the unlock. If it sees this before the device
> memory is flushed, then it might be confused. I guess if the device
> memory is strongly ordered (but buffered) CPU2 would not see reads out
> of order for the device memory, but it might see everything for device
> memory out order relative to ordinary memory. However, the sprinkled
> bs barriers order everything.
>
Yes, the bus_space_barrier that is done for the EOI hack is also a
barrier for any preceeding ordinary memory accesses. But it usually
happens right after a mutex has been released in an interrupt routine,
so there's still that big dose of "works by accident".
> > I expect trouble to show up any time now as we start implementing DMA
> > drivers in socs that have generic DMA engines that are only loosely
> > coupled to the devices they're moving data for. That just seems like
> > another place where a single driver is coordinating the actions of two
> > different pieces of hardware that may be on different busses, and it's
> > ripe for the lack of barriers to cause rare or intermittant failures.
>
> I don't understand DMA syncs for most NICs either. They seem too simple
> to work. Mutex locking isn't going to control the order of host DMA.
>
On arm, the cache sync operations for DMA all end with the same drain
write buffer sequence used in the EOI hack.
-- Ian
More information about the svn-src-all
mailing list