NVIDIA and TLS

Mon Jun 16 18:37:44 PDT 2003

On Mon, 16 Jun 2003, Julian Elischer wrote:
> 
> I'm not making comments about their qualifications in graphics,
> just that it's sad when teh threading interface is distorted by graphics
> people.. In effect by insisting on having the TLS values accessible
> at the lowest high-performace parts, they are excerting "un-natural"
> pressure on the development of threads. :-/

The design of the ELF TLS spec had nothing to do with OpenGL or other
graphics people.  When this support was added to the Linux C library
and toolchain, we started using it because it met our requirements for
high performance thread-local storage.  Granted, we worked with the
GNU libc developers to iron out a few issues, but we had nothing to do
with the specification itself.  The ELF TLS spec was designed to meet
the needs of a class of applications, which OpenGL happens to fall
into.  Fast thread-local storage is a good thing in general, beyond
the scope of 3D graphics alone.

> I'm saying that if it were don like this:
> 
> __thread local_context_t *lc;
> medium_level_OpenGL_function() 
> {
> 	int linestodo=1000;
> 	local_context_t *drawing_context;
> 	int i;
> 
> 	drawing_context = lc;
> 	for (i = linestodo; i; i++) {
> 		OpenGL_Low_level_Thingy(drawing_context, arg1, arg2);
> 	}
> 
> then the performance of TLS wouldn't be so crucial;
> It would still be relatively ok (maybe 5 instructions)
> but it wouldn't have to be 1 instruction
> 
> In fact if they were inpplemented in the following way:
> 
> __inline  OpenGL_Low_level_Thingy(local_context_t *drawing_context,
> arg1, arg2)
> {
> 	__asm "blah blah " /* load args to known regs */
> 	call library entrypoint  /* with args in known regs...*/
> }
> 
> you would have the fastest version of all without
> any requirement for making the TLS so specialised. (purely register
> transfer)

Umm, you clearly don't understand what I've been talking about.
Upon entry into libGL, i.e. when an OpenGL API entrypoint is
called by the application, things like the current context or
current dispatch table are fetched from thread-local storage.
Internally, we pass these pointers around as required.

So, we might have something like this:

	void glBegin(GLenum mode)
	{
		// Grab the current dispatch pointer from TLS
		__GLdispatch *dispatch = GET_CURRENT_DISPATCH();

		// Call into the driver backend
		dispatch->Begin(mode)
	}

or, in x86 assembly:

	glBegin:
		mov %gs:__gl_dispatch at ntpoff, %eax
		jmp *__begin_offset(%eax)

(note that Andy mentioned this example in his original email)

This would jump to a function inside the driver like this:

	void __internal_Begin(GLenum mode)
	{
		__GLcontext *ctx = GET_CURRENT_CONTEXT();

		do_something(ctx, ...);
		do_something_else(ctx, ...);
		// and so on
	}

Once we're inside the driver, we know what the current context
or other thread-local variable's value is.  Two critical
points:

1) We have to fetch the value from TLS at least once per entry
   into the driver.
2) Some of the driver backend functions are very small,
   typically the more performance critical it is the smaller
   it is.

In general, you want to avoid things like pthread_getspecific()
inside the dispatch layer and your 6-instruction implementation
of glColor4f or glNormal3f (which can be called millions of
times per frame).

> I have no intention of wanting yuo to context switch to
> another thread.. I'm just saying that it's a pity you don't just
> go teh route that other libraries have and make that time critical
> fucntions just have the value at hand already.

If the time critical function is a 6-instruction function
at the top-level of the API (that is, called directly from
the dispatch layer), how do you get this value other than
looking it up out of thread-local storage?  Caching that
variable in TLS with a fast access mechanism qualifies as
"keeping the value at hand" in my books.

> No, I think you misunderstand..
> 
> A single thread would always use the same context.
> I'm not saying otherwise..
> I'm just wishing that you would keep the value of its address
> aroundin a local stack variable a bit more instead of deriving it
> with %gs all the time.

It is, once you've looked it up.  Problem is, if the only
work you do inside the library is copy three floats off the
stack (parameters to the GL API call) into a DMA buffer,
set a bit in a bitmask and return, the time spent accessing
the current context becomes a large percentage of the time
you spend in the library for that call, period.  Understand?

> If the leaf functions on OpenGL were to be implemented with 
> asm interfaces (you said they were hand optimised anyhow)
> and the callers would cache the drawing context pointer in a local
> register, then My ability to give you a TLS pointer in
> 
> 
> getTLS:	lea %eax,%gs(mumble)
> 	movl mumble(%eax), %eax
> 	ret
> 
> would be fast enough as the cost of the extra function call would be
> amortised over many low-level calls.
> (actually it'd probably be faster than what you have now I think)

There is no caller of these functions.  In the fast paths, there
is no function call, once you get inside the library.  That's
the whole point.

	1) Application calls OpenGL function.

	2) OpenGL API dispatch function looks up the dispatch
	   pointer from TLS, and jumps through it.  2 insns.

	3) Driver function looks up the current context out of
	   TLS, copies some data into a buffer, and returns.
	   Maybe 6 insns or so.

	4) Application continues on.

The cost of TLS access becomes significant when the driver is
doing less than a dozen non-TLS-lookup instructions for the
important API calls.

-- 
Gareth Hughes (gareth at nvidia.com)
OpenGL Developer, NVIDIA Corporation