UPDATE Re: making use of userland dtrace on FreeBSD

Thu Dec 27 05:47:09 UTC 2012

On 12/26/12 9:32 PM, Peter Wemm wrote:
> On Wed, Dec 26, 2012 at 8:41 PM, Alfred Perlstein <bright at mu.org> wrote:
>> On 12/26/12 8:21 PM, Peter Wemm wrote:
>>> On Wed, Dec 26, 2012 at 8:00 PM, Alfred Perlstein <bright at mu.org> wrote:
>>>
>>>> What would be the drawbacks?  I don't want to hurt freebsd for heavy
>>>> performance, but I think this functionality should work out of the box
>>>> for
>>>> most people.
>>> The drawbacks are mostly performance related.  It defeats a certain
>>> hardware optimizations for call/return on leaf functions.  It'll
>>> mostly affect things like math, crypto, compression and multimedia
>>> libraries (that's ffmpeg, bzip2/gzip/libarchive, openssl, etc) but, we
>>> generally don't seem to care about that sort of performance anyway, so
>>> what's one more loss?
>>
>> Can you clarify some?  If it was somewhat easy to re-add
>> -fomit-frame-pointer to critical libraries like this, then that would be OK?
> No, you can't add MD flags like this.  The way to do it is see things
> like PIC, WARNS, etc where you can do overrides of defaults on a
> directory basis, and respect the system-wide user overrides.
>
> Remember, -fno-omit-frame-pointer is the default on i386 (except at
> high -O levels with gcc, I dont know where clang, the default
> compiler, draws the line).  Other platforms don't even have frame
> pointers.  You can't just scatter that switch around the place.

Agreed!    It seems that -fno-omit-frame-pointer documentation is a bit 
strange, the manual page indicates:
>            -O also turns on -fomit-frame-pointer on machines where 
> doing so
>            does not interfere with debugging.
Then goes on to specify that under the actual option that it's turned on 
under -O, -O2, -O3, etc.

>
>> To be honest, I'm not sure if you're serious about "generally don't seem to
>> care" or just feel defeated on the issue and we should care.
> We took quite a performance beating because of not using the
> tuned-by-perl assembler code in openssl on amd64, for example.  This
> flows through to benchmarks on things like apache throughput with
> mod_ssl.  Or throughput on stunnel(1).
I don't recall if I was involved in that discussion, but that is troubling.

>
> My drive-by comment about not seeming to care any more is that people
> (except for Bruce) generally don't actually measure the performance
> impact of their changes any more.  The last time this was widespread
> was when Kris Kennaway used to be constantly abusing machines and
> reporting the effects as measured by ministat(1).
>
> If somebody were to say "this change makes world take 15% longer to
> compile but makes no meaningful affect on things like bzip2, openssl
> throughput etc" and posted the actual ministat output to back it up
> then there wouldn't even be a question on performance at all.  It'd
> only be "is 15% more build time worth ubiquitous dtrace?"  And thats a
> far easier thing to answer.
>
> A hand-wave leads to bikesheds.  Actual numbers are bikeshed repellant.
>
> I myself have killed patches that turned out to be premature
> optimizations because it actually didn't make any difference.  For
> example, I never committed the lazy tlb shootdown to AMD64 because it
> made things slower on the hardware of the day - opteron silicon had
> *hardware* address space tags on their TLB and the lazy shootdown code
> just added more synchronization work that just added overhead..  eg:
> buildworld was around 2% slower with the patches.
>
> Another example was the mtxpool code that caused cache line thrashing.
> If we cared about performance that would never have gone in. Sure, it
> compiled and worked, but the costs weren't quantified till much later
> and we realized how much trouble they were beyond a certain usage
> level.
>
> What's 2%?  It multiplies out.. 2% here, 1% there.. 3% over there,
> 0.5% somewhere else.. before you know it, there's a pretty big overall
> hit.
I see, well I will run some numbers and report back.

>
>>> Of course it wouldn't be required with dwarf unwinding awareness, but
>>> we don't have that.
>>>
>>> We have -fno-omit-frame-pointer on the amd64 kernel whenever debugging
>>> is compiled in because there's no unwinder for doing stack traces.  We
>>> need a dwarf2+ unwinder and somebody to instrument the call frame
>>> state through the remaining assembler code.
>>>
>> How much work is that exactly?  I've only been a gdb user, not a hacker.
> gdb has a stack unwinder.  kdb/ddb/stack(9) do not.  There's well
> established GPL code to do it, as well as libunwind and variants.
> Basically what this code has to do is run the dwarf2+ state machine to
> find all the call/return frames instead of assuming the compiler did
> it.  Heck, even glibc has a dwarf2 unwinder built into it as part of
> their exception processing system.
>
> I'm not entirely sure what more work src/lib/libelf and
> src/lib/libdwarf need.  It looks like its got just enough implemented
> to support the ctfconvert etc and doesn't have an unwinder in it.
>
This really seems beyond my skill level / time allotment.  Let's see 
where the numbers put us in terms of system performance and then we can 
make a call on it.

I'd rather take a few % of perf for the power of dtrace, but not if that 
% is double digits.

-Alfred