svn commit: r196119 - in stable/8/sys: . amd64/include/xen cddl/contrib/opensolaris contrib/dev/acpica contrib/pf dev/ata dev/cxgb dev/sound/usb dev/usb dev/usb/controller dev/usb/input dev/usb/mis...

Bjoern A. Zeeb bz at
Wed Aug 12 12:20:09 UTC 2009

On Wed, 12 Aug 2009, Bjoern A. Zeeb wrote:

> Author: bz
> Date: Wed Aug 12 10:32:20 2009
> New Revision: 196119
> URL:
> Log:
>  MFC r196118:
>    Put minimum alignment on the dpcpu and vnet section so that ld
>    when adding the __start_ symbol knows the expected section alignment
>    and can place the __start_ symbol correctly.
>    These sections will not support symbols with super-cache line alignment
>    requirements.
>    For full details, see posting to freebsd-current, 2009-08-10,
>    Message-ID: <20090810133111.C93661 at>.
>    Debugging and testing patches by:
>                  Kamigishi Rei (spambox,
>                  np, lstewart, jhb, kib, rwatson
>    Tested by:    Kamigishi Rei, lstewart
>    Reviewed by:  kib
>  Approved by:	re

Just as a follow-up here as well:

[ The following samples, etc are generally "thinking" arch=amd64: ]

There are two different kinds of places where we find dynamic per-cpu
(dpcpu) data:
(1) the so called 'master copy', that is a linker set, which holds the
     BSS initialized compile-time defaults.
(2) a copy for each PU copied to allocated memory.

The problem seen has been that single members of the set had been
un-aligned at run-time.

Dumping the linker set (master copy), things look like this for example:

ffffffff8168f8e9 g       *ABS*  0000000000000000 __start_set_pcpu
ffffffff8168f900 l    d  set_pcpu       0000000000000000
ffffffff8168f900 g     O set_pcpu       0000000000000068 pcpu_entry_sched_switch_stats
ffffffff8168f980 l     O set_pcpu       0000000000000800 pcpu_entry_modspace
ffffffff81690180 g     O set_pcpu       0000000000000038 pcpu_entry_epair_dpcpu
ffffffff81690200 g     O set_pcpu       0000000000000500 pcpu_entry_nws
ffffffff81690700 g       *ABS*  0000000000000000 __stop_set_pcpu

The members of the linker set (master copy) are all well aligned within
the set: for example pcpu_entry_nws: 0xffffffff81690200 % 128 = 0

Looking at elfdump for the kernel this is also what we would expect:
entry: 32
         sh_name: set_pcpu
         sh_type: SHT_PROGBITS
         sh_flags: SHF_WRITE|SHF_ALLOC
         sh_addr: 0xffffffff8168f900
         sh_offset: 21559552
         sh_size: 3584
         sh_link: 0
         sh_info: 0
         sh_addralign: 128                 <<<<
         sh_entsize: 0

The problem is with __start_set_pcpu, the symbol ld adds to mark
the beginning of the section. The address of __start_set_pcpu is
not well-aligned, not even pointer-aligned:
0xffffffff8168f8e9 % 8 = 1.

When now copying the 'master copy' to a dpcpu area the aligned
symbols become un-aligned.


dpcpu area starts at

Copyin the master copy from the objdump above starting at
__start_set_pcpu will put __start_set_pcpu at 0xffff0000 but
the first symbol pcpu_entry_sched_switch_stats at 0xffff0017


Two problems become obvious:
(1) the symbols are now un-aligned in the per-cpu area.
(2) due to the offset we did not copy the entire dpcpu area, so
     some symbols at the end might not have been properly initialized.

While (2) may lead to run-time problems it usually is not a problem
with memory corrution as the dpcpu area is usually allocated in
pages. So unless the dpcpu symbols end page aligned there should
be no corruption.

(1) in contrast may lead to other effects like a lock spanning
multiple cache lines thus no longer permitting atomic reads and
being open to races. The results are panics with truncated pointers
trying to access invalid memory regions, etc. On architectures
like arm, this will immediatly fault as arm does not allow
un-aligned reads.

So one solution to the problem would be to make sure we allocate
enough memory to also account for the offset to proper alignment
and then copying the 'master copy' to a possibly unaligned address
as well making the symbols properly aligned again:

dpcpu area starts at
     |*** unused *******|~~~~~~~~|---------------------------...
                0xffff0069       |

In this sample __start_set_pcpu would be at 0xffff0069 and
pcpu_entry_sched_switch_stats at 0xffff0080 and thus properly
aligned again.  With this things will work.

Looking further at the problem you may have already noticed that
the section for the 'master copy' starts at 0xffffffff8168f900
and that the __start_set_pcpu is outside of that section at
Looking at a section dump from `readelf -S kernel` you would
notice that the __start_set_pcpu directly follows the end of the
previous section.

The reasons for this are twofold:
(1) ld places the __start_ symbol at '.' (the location counter),
which at that point is at the end of the old section as the new
(set_pcpu) is not yet added with __start_set_pcpu = ALIGN(0).
(2) because we manually define the section, so that it is
writeable, ld at the point of writing the __start_ symbol does
not yet have any possible section alignment information. That
is the reason for the ALIGN(0) in (1).

An expected behaviour would be for ld to put the *ABS* at the
address where the section begins, once known or fixup the address.
This could arguably be a bug in ld we should fix post-8.0.

One possible workaround would be to force the __start_ symbol
and the section to be equally aligned and thus on the same address
using linker scripts.  The drawbacks are that we need to touch
the fragile linker scripts for each of the sections we add and
for all architectures individually.  As the enforcement of alignment
would be at a different place to the actual set creation, putting
the alignment in might be easily forgotten.
The advantage would be that we can always be sure that __start_
would be on the same address where the section starts.

Another solution is to put minimum alignment on the objects inside the
section in a way that it is only in a single place in the source code.
The section directive in the respective header file, that will
be included by each implementation file, is the ideal place for this.
While cache line alignment seems to be the widest alignment
restriction currently in use, one drawback, like with above ldscript
solution, is that a symbol could possibly enforce a wider alignment
restriction onto the section making the __start_ symbol and the
section beginning to diverge again. Example:
    0xffffffff8168f700  __start_set_pcpu
    0xffffffff8168f800  set_pcpu
    0xffffffff8168f800  pcpu_entry_sched_switch_stats
if we would put an alignment of 1024 on pcpu_entry_sched_switch_stats.
This is unlikely to happen.

With the minimum alignment, ld, at the time of placing the
__start_ symbol, already knows about the section alignment
and will place it correctly on the section beginning doing:
__start_set_pcpu = ALIGN(CACHE_LINE_SHIFT) at ".".

Summary: The minimum alignment seems to be the least-intrusive
solution and is believed to work for the moment. In addition
documenting that the dpcpu and similar sections will not support
super-cache line alignment.
The long term solution would be to fix ld to DTRT.

> Modified: stable/8/sys/net/vnet.h
> ==============================================================================
> --- stable/8/sys/net/vnet.h	Wed Aug 12 10:26:03 2009	(r196118)
> +++ stable/8/sys/net/vnet.h	Wed Aug 12 10:32:20 2009	(r196119)
> @@ -185,12 +185,14 @@ extern struct sx vnet_sxlock;
>  * Virtual network stack memory allocator, which allows global variables to
>  * be automatically instantiated for each network stack instance.
>  */
> +__asm__(
> #if defined(__arm__)
> -__asm__(".section " VNET_SETNAME ", \"aw\", %progbits");
> +	".section " VNET_SETNAME ", \"aw\", %progbits\n"
> #else
> -__asm__(".section " VNET_SETNAME ", \"aw\", @progbits");
> +	".section " VNET_SETNAME ", \"aw\", @progbits\n"
> #endif
> -__asm__(".previous");
> +	"\t.p2align " __XSTRING(CACHE_LINE_SHIFT) "\n"
> +	"\t.previous");
> #define	VNET_NAME(n)		vnet_entry_##n
> #define	VNET_DECLARE(t, n)	extern t VNET_NAME(n)
> Modified: stable/8/sys/sys/pcpu.h
> ==============================================================================
> --- stable/8/sys/sys/pcpu.h	Wed Aug 12 10:26:03 2009	(r196118)
> +++ stable/8/sys/sys/pcpu.h	Wed Aug 12 10:32:20 2009	(r196119)
> @@ -56,12 +56,14 @@ struct thread;
> extern uintptr_t *__start_set_pcpu;
> extern uintptr_t *__stop_set_pcpu;
> +__asm__(
> #if defined(__arm__)
> -__asm__(".section set_pcpu, \"aw\", %progbits");
> +	".section set_pcpu, \"aw\", %progbits\n"
> #else
> -__asm__(".section set_pcpu, \"aw\", @progbits");
> +	".section set_pcpu, \"aw\", @progbits\n"
> #endif
> -__asm__(".previous");
> +	"\t.p2align " __XSTRING(CACHE_LINE_SHIFT) "\n"
> +	"\t.previous");
> /*
>  * Array of dynamic pcpu base offsets.  Indexed by id.

Bjoern A. Zeeb                      The greatest risk is not taking one.

More information about the svn-src-all mailing list