ELF .data section variables and RWX bits

Sat May 24 17:00:23 PDT 2003

In article <3ECFB146.4000700 at example.com>, Shill <devnull at example.com>
writes
>I wrote a small program stub to measure execution time on an Athlon 4:

Okay.

>_start:
>  rdtsc
>  mov ebp, eax
>  xor eax, eax
>  cpuid
>  ; BEGIN TIMED CODE
>
>  ; END TIMED CODE
>  xor eax, eax
>  cpuid
>  rdtsc
>  sub ebp, eax
>  xor eax, eax
>  cpuid
>  neg ebp
>
>Note: CPUID is used only as a serializing instruction.

You write that like you know what it means, but I'm not convinced that
you do.

>Let n be the number of cycles required to execute the code
>between the two RDTSC instructions. At the end of the stub,
>ebp is equal to n modulo 2^32.
>
>The stub alone (consistently) requires 104 cycles to execute.
>
>So far, so good.
>
>I wanted to time the latency of a store, 

You have timed your programs, but I don't know what you mean here by
"Latency".

>so I declared a single
>variable within the .data section:
>
>SECTION .data
>
>X: dd 0x12345678
>
>I timed three different programs:
>P1) mov ebx, [X]               ; load i.e. read only
>P2) mov dword [X], 0xaabbccdd  ; store i.e. write only
>P3) add dword [X], byte 0x4C   ; load/execute/store i.e. read+write
>
>P1 requires 170 cycles.
>P2 requires 12000 cycles on average (MIN=10000 and MAX=46000)
>P3 requires 22500 cycles on average (MIN=14500 and MAX=72000)
>
>NASM gives the ELF .data section the following properties:
>  progbits (i.e. explicit contents, unlike the .bss section)
>  alloc (i.e. load into memory when the program is run)
>  noexec (i.e. clear the allow execute bit)
>  write (i.e. set the allow write bit)
>  align=4 (i.e. start the section at a multiple of 4)
>
>A cache miss might explain why P1 requires 170 cycles but it does not
>explain P2 or P3, as far as I know.
>
>My guess is that the first time X is written to, an exception occurs
>(perhaps a TLB miss) and the operating system (FreeBSD in my case) is
>required to update something, somewhere.
>
>Could it be that FreeBSD does not set the write bit for the page where X
>is stored until X is *actually* written to? But then why would P3 take
>much longer than P2?
>
>As you can see, I am very confused. 

Somewhat confused.

>I would be very grateful if an ELF
>and/or ld guru could step in and share his insight.

I don't know (off-hand anyway) ELF, ld or how FreeBSD diddles with page
attributes.  I do however know a fair bit about microprocessors work.

1. Ideally your data segment should be aligned on a 4096 byte boundary, 
   this is the normal Page size for IA32 (32-bit 386, 486, Pentium, 
   etc...) memory.

2. The big delays in the program that does the write are probably 
   because the OS loads the program data into "copy on write" pages.
     (This is generally favoured where the same page of a program 
     contains both data & code, or where the same program image may be
     shared by multiple concurrent instances.  Specifically for FreeBSD 
     Google-ing for "copy on write" returns lots of matches.)
   These indeed cause the OS to intervene on a write fault, creating a 
   new page with the same content before resuming the write operation.

   To get a reasonable time measurement you should;
   a.  Write to the data address before running your test, this moves
       the copy on write operation out of you measurement.
   b.  (optional) Put a loop around your test code that runs the test
       multiple times.  (Rather than run the program lots of times.)
       This will probably give faster times as your test code will
       likely be cached by the CPU the first time through.  (Exact 
       implementation of your loop, timing of interrupts, etc... will 
       affect this somewhat.)

3. On Pentium & Athlon processor families it is not very appropriate to 
   time single instructions.  These processors break down instructions 
   into RISC style mini-instructions.  They also do dynamic analysis of
   which instructions depend on each other, e.g. for register contents 
   being valid.  They then try to execute multiple RISC mini-
   instructions, corresponding to more than one x386 instruction, in 
   each clock cycle.

   Instructions may even be executed in a different order, (which 
   requires very clever work to undo when an interrupt is received or a 
   trap generated.)  Serializing instructions, such as CPUID, force the 
   CPU to complete out-of-order execution and only execute that 
   instruction.

   This means the time taken to execute an instruction is substantially
   affected by where it is placed in the program.  Conversely an extra
   placed in a program could either be executed with no extra cycles (it 
   "pairs" with another instruction) or add many extra cycles.  Loading 
   a register whose value is needed by subsequent instructions can do 
   this, (though L1 & L2 caches usually make this actually quit cheap).  
   Extra instructions will more likely hit behaviour of the L1 code 
   cache, e.g. by making loops too big to completely fit.

   It is more useful to do timing tests of your extra instructions in 
   context of function you are changing.

4. Read/modify/write instructions are compact and work well on the 8086 
   but not so well on the i486 & up.  Intel suggest using a register 
   based programming model: load register, modify register, store 
   register.  These instructions can usually be scheduled by a compiler 
   with other instructions for maximum speed.  (Sorry I can't find a 
   reference for this just now.)

Tony