Nvidia, TLS and __thread keyword -- an observation

Wed Jun 18 01:42:45 PDT 2003

Marcel Moolenaar wrote:
> I'm not sure you understand the issue (I can easily be wrong, I just
> don't see the evidence in your statement). To support the __thread
> keyword, our thread library needs to create the TLS as defined in the
> binary and its dependent shared libraries by virtue of the .tdata and
> .tbss sections/segments, based on the image of the TLS as constructed
> by the RTLD for the initial set of modules (created for the initial
> thread) and amended by TLS space defined in the dynamicly loaded
> libraries; and the TLS has to be created for every new thread at the
> time the thread itself is created.

Most of these issues can be sidestepped.  The correct approach is
actually the Microsoft published model, which is a rehash of another
even older model:

For each shared object, be it a library or dlopen'ed .so, you need:

1)	Process attach (currently, .init)
2)	Process detach (currently, .fini)
3)	Thread attach (you are implying this with .tdata and .tbss)
4)	Thread detach (you are implying this with .tdata and .tbss)

Really, you want an explicit interface, rather than an implied one.
This may mean "implying" the creation of .tini and .tfin sections,
or some other approach, which deal with the .tdata/.tbss, or otherwise.

Actually, a means of putting the relocation table in a per thread
code table would also resolve the relocation issues that lead to
people wanting to put locks around the RTLD references, but it's
probably more correct to resolve this by serializing the thread
attach/detach process, instead.

> The static TLS model requires the least amount of work: add support
> to allocate the TLS image for every thread creation and point the
> thread pointer to it in a way compatible with the runtime spec.

There would be no difference between static vs. dynamic, for the most
part, if one were to use the .tini/.fin approach.  The trouble you
are anticipating here is actually all related to the fact of you
having defined things as belonging to a data interface, rather than
using accessor/mutator functions in order to operate and to hook the
thread attach/detach events (attaches are implicit for any existing
threads at time of load, and detaches are implicit for any existing
threads at time of unload -- meaning you need to deal with it in the
same fashion as create/delete, and you need to deal with out-of-order;
this is a restriction you already have to live with anyway).

> The dynamic TLS model requires more substantial changes and involves
> RTLD as well. This is the model that requires __tls_get_addr().

I don't believe that this is true.  I think the code examples omit
the case where you have triggered functions to deal with explicit
attach/detach events.  And all such events can be made explicit, and
serialized.  They are rare enough that there should be almost zero
cost relative to doing the same thing with static construction, which
would use a linker set to gather the the .tini/.tfin function lists
together for call on thread start/stop.

Now some general comments:

Realize that no matter how you approach this, there is going to be
additional runtime overhead to thread creation/deletion for implicit
TLS support, even if you get referencing it down to 1 instruction.

No matter what you do, you will be paying an increased runtime penalty
for thread creation/termination (join, exit, etc.)  in exchange for
your ability to use implicit TLS via compiler extension.

I expect that high performance requirement programs that use a lot
of threads will lean not to use __thread, in much the same way that
C++ programmers lean not to use RTTI or exceptions -- both available
language features, whose cost exceed their benefit.

I believe that thread lifetime is proportional to the desirability of
implicit TLS, and that thread count is inversely proportional.  People
writing code that expects to be used by threaded programs need to be
aware of this.  For example, an OpenGL interface onto a MySQL database
or an LDAP server or a DNS server would likely suck, due to the thread
impedence mismatch between the implementations.  Likewise, thread count
would make OpenGL undesiragle technology for implementing a web browser
that didn't explicitly and heavily rely on a worker thread pool and the
HTTP 1.1 persistent connections approach.  Even then, you would still
be screwed by any web servers that performed chunk-encoding, since it
would rob you of the "Content-Length:" header, and mean that in order
to signal end of data, they had to close the connection on you, losing
you your persistence.

Just some things to think about before you throw out explicit context
for frame rate improvements that are unusable to any code that isn't
a game or a benchmark... 8-(.

-- Terry