Re: git: e769bc771843 - main - sym(4): Employ memory barriers also on x86

From: Konstantin Belousov <kostikbel_at_gmail.com>
Date: Sun, 01 Feb 2026 05:31:20 UTC
On Thu, Jan 29, 2026 at 10:32:14PM +0100, Marius Strobl wrote:
> On Wed, Jan 28, 2026 at 11:35:09PM +0200, Konstantin Belousov wrote:
> > On Tue, Jan 27, 2026 at 10:56:04PM +0100, Marius Strobl wrote:
> > > On Tue, Jan 27, 2026 at 01:13:04AM +0200, Konstantin Belousov wrote:
> > > > On Mon, Jan 26, 2026 at 09:30:58PM +0100, Marius Strobl wrote:
> > > > > On Mon, Jan 26, 2026 at 06:34:49PM +0200, Konstantin Belousov wrote:
> > > > > > On Mon, Jan 26, 2026 at 03:57:45PM +0000, Marius Strobl wrote:
> > > > > > > The branch main has been updated by marius:
> > > > > > > 
> > > > > > > URL: https://cgit.FreeBSD.org/src/commit/?id=e769bc77184312b6137a9b180c97b87c0760b849
> > > > > > > 
> > > > > > > commit e769bc77184312b6137a9b180c97b87c0760b849
> > > > > > > Author:     Marius Strobl <marius@FreeBSD.org>
> > > > > > > AuthorDate: 2026-01-26 13:58:57 +0000
> > > > > > > Commit:     Marius Strobl <marius@FreeBSD.org>
> > > > > > > CommitDate: 2026-01-26 15:54:48 +0000
> > > > > > > 
> > > > > > >     sym(4): Employ memory barriers also on x86
> > > > > > >     
> > > > > > >     In an MP world, it doesn't hold that x86 requires no memory barriers.
> > > > > > It does hold.  x86 is much more strongly ordered than all other arches
> > > > > > we currently support.
> > > > > 
> > > > > If it does hold, then why is atomic_thread_fence_seq_cst() employing
> > > > > a StoreLoad barrier even on amd64?
> > > > > I agree that x86 is more strongly ordered than the other supported
> > > > > architectures, though.
> > > > Well, it depends on the purpose.
> > > > 
> > > > Can you please explain what is the purpose of this specific barrier, and
> > > > where is the reciprocal barrier for it?
> > > > 
> > > > Often drivers for advanced devices do need fences.  For instance, from
> > > > my experience with the Mellanox networking cards, there are some structures
> > > > that are located in regular cacheable memory.  The readiness of the structure
> > > > for the card is indicated by a write to some location.  If this location is
> > > > BAR, then at least on x86 we do not need any barriers. But if it is also
> > > > in the regular memory, the visibility of writes to the structure before
> > > > the write to a signalling variable must be enforced.
> > > > 
> > > > This is done normally by atomic_thread_fence_rel(), which on x86 becomes
> > > > just compiler barrier, since the ordering is guaranteed by CPU (but not
> > > > compiler).
> > > > 
> > > > In this situation, using rmb() (which is fence) really degrades
> > > > the performance on high rates.
> > > 
> > > The problem at hand is reads from different memory locations (neither
> > > in BAR) apparently getting reordered after having kicked the chip. As
> > > a result, data read doesn't match its flag.
> > Nice description, thanks.
> > So writes are to the WB memory, and they are ordered by hardware.
> > And then, reads needs to be correctly ordered.  Did I correctly understand?
> 
> Well, as part of making this mix and match work as expected, the
> intent is to map the HCB memory uncacheable. Prior to the advent of
> VM_MEMATTR_*, the !x86 way of indicating this to bus_dmamem_alloc(9)
> was BUS_DMA_COHERENT. Then later in 2db99100, BUS_DMA_NOCACHE was
> hooked up to VM_MEMATTR_UNCACHEABLE for x86.
> As it turns out, this still differs across architectures today; arm
> still supports BUS_DMA_COHERENT only for requesting uncacheable DMA
> memory and x86 still uses BUS_DMA_NOCACHE only. On arm64 and riscv,
> BUS_DMA_COHERENT seems to effectively be an alias for BUS_DMA_NOCACHE.
There was a recent proposal by Michal Meloun (mmel) to fix this
architecturally, but I did not see a recent progress.

> 
> So in short, the intent is to map the HCB memory uncacheable, but it
> happens to end up as write-back on x86 currently.
> 
> What is the expected effect of VM_MEMATTR_UNCACHEABLE on load/store
> re-ording on x86? In your Mellanox MAC example you indicate that at
> least stores to bus space memory, which presumably is also mapped
> VM_MEMATTR_UNCACHEABLE, are executed in program order (but stores
> to VM_MEMATTR_WRITE_BACK mapped memory may get re-ordered). In,
> general, load/store re-ordering isn't only a factor of uncacheable
> memory, though.

On AMD:
Loads and stores to any UC memory locations are guaranteed to happen in
the program order. More, any access to UC memory flushes the CPU write
buffers. The manual explicitly allows reads from cacheable memory to
bypass UC accesses.

On Intel:
SDM says that 'I/O instructions' cannot be reordered with any other
reads and writes, and I assume that 'I/O instructions' include UC accesses.

> 
> If VM_MEMATTR_UNCACHEABLE on x86 guarantees that loads and stores
> to different locations of the same uncacheable memory are executed
> in program order
It certainly is, both for Intel and AMD, according to their manuals.

> allocating the HCB memory with BUS_DMA_NOCACHE
> may be an alternate approach to using barriers. In a quick test with
> e769bc77 reverted but BUS_DMA_NOCACHE passed in addition, this at
> least doesn't have a measurable impact on performance with real
> amd64 hardware.
> 
> > This is endorced by atomic_thread_fence_acq().  On x86, due to the TSO
> > model, it only uses compiler membar.
>  
> That doesn't mean that atomic_thread_fence_acq() is a portable
> approach for DMA and bus space across architectures, though.
Sure.

> 
> > > Several factors contribute to this scenario. First off, this hardware
> > > doesn't have shiny doorbell registers but is a rather convoluted design
> > > dating back to the early days of PCI, using a heavy mix of registers
> > > in BAR space and DMAed control data, with addresses being patched into
> > > programs that are transferred to the controller RAM by the driver or
> > > may reside in host memory etc. Additional things also doesn't work
> > > equally across all supported chips as only newer ones provide load/
> > > store instructions for example.
> > > As such, the operations of these chips might very well escape the bus
> > > snooping of more modern machines and optimizations therein. There are
> > > PCI bridges which only synchronize DMA themselves on interrupts for
> > > example.
> > I do not quite follow the part of 'escaping the bus snooping'.
> > I have no idea about this part for !x86, but on x86 hardware cannot
> > access RAM without routing the transactions through the memory controller
> > of CPU.
> 
> In theory, yes; I think to remember a talk about a 2017 paper by
> UoC people presenting that DMA snooping turned out to not work as
> advertised by Intel, though. That was an unexpected discovery,
> however, the paper actually had a networking topic, not sure what
> it was.
> 
> > > For drivers, we generally would want to express DMA synchronization
> > > and bus space access ordering needs in terms of bus_dmamap_sync(9)
> > > and bus_space_barrier(9) as there may be buffers, caches, IOMMUs etc.
> > > involved on/in the bus. These later are not taken into account by
> > > atomic_thread_fence_*(9). Apparently, this is also of relevance for
> > > x86, as otherwise BUS_SPACE_BARRIER_READ would be a compiler barrier
> > > there at most.
> > For WB memory, caches are always coherent (on x86), and IOMMUs do not
> > change that. It is interesting that IOMMUs on x86 might use non-coherent
> > access to the main memory itself for things like page tables reads or
> > updates, or fault log entries writes. But they definitely do not cause
> > non-coherence of the DMA.
>  
> What's needed in this case is consistency across different memory
> locations, not just coherency in a single one.
> 
> > > With things like index or bank switching registers, it's generally
> > > also not the case that there are pairs of bus space barriers, i. e.
> > > also reciprocal ones, see e. g. the example in the BARRIERS section
> > > of bus_space.9.
> > > 
> > > Due to the mess with these chips and depending on architecture,
> > > actually barriers for both, DMAed memory and bus space, might be
> > > required. That's why already before this change, powerpc already
> > > used both sync and eieio as well as the comment above the macros
> > > was already previously talking about using I/O barriers, too.
> > > 
> > > Actually, I would have expected this hardware to have aged out by
> > > now. However, apparently it still is a thing with OpenStack, so it
> > > makes sense to keep this driver working, with performance not being
> > > of great concern.
> > > I can change the driver back to duplicate *mb(9) as it was before
> > > this change or back this change out completely it you absolutely
> > > dislike it. I won't waste time with working on an alternate approach
> > > to what the Linux version does, though, especially not given that
> > > the Linux version presumably gets considerably more exposure and
> > > testing.
> > 
> > What I would like is removing global rmb/wmb/mb definitions. I doubt
> > that there is common semantic for them on different FreeBSD arches, they
> > are only similar by name.
> 
> Apparently, *mb(9) initially were brought over from Linux as-is
> as far as the instructions used go, with these macros having
> common semantic across architectures in Linux. The only exception
> where FreeBSD differs now is powerpc64, there rmb(9)/wmb(9) uses
> the weaker lwsync, while Linux uses sync. Depending on the store
> order the kernel is executed in, this may be correct.
> Anyway, some other drivers already provide their own copies of
> *mb(9), so removing them from atomic.h may lead to even more
> (possibly incorrect) diversity.

Point is that the use of *mb() is typically blindly copied from the
Linux sources, where it hopefully matches the Linux kernel memory
model, based on barriers.  FreeBSD uses C11-like model, and *mb() really
do not match the semantic of the Linux counterpair, because the reciprocal
barriers/fences must be aligned.