Regarding: `cmpb $0, %fs:__tls_guard@tpoff`, the per-function-call overhead is due to dynamic initialization on first use requirement:
> Block variables with static or thread(since C++11) storage duration are initialized the first time control passes through their declaration (unless their initialization is zero- or constant-initialization, which can be performed before the block is first entered). On all further calls, the declaration is skipped. --- https://en.cppreference.com/w/cpp/language/storage_duration
> If you know x does not need dynamic initialization, C++20 constinit can make it as efficient as the plain old `__thread`. [[clang::require_constant_initialization]] can be used with older language standards.
Regarding `data16 lea tls_obj(%rip),%rdi` in the general-dynamic TLS model, yeah it's for linker optimization. The local-dynamic TLS model doesn't have data16 or rex prefixes.
Regarding "Why don’t we just use the same code as before — the movl instruction — with the dynamic linker substituting the right value for tls_obj@tpoff?"
Because -fpic/-fPIC was designed to support dlopen.
The desired efficient GOTTPOFF code sequence is only feasible when the shared object is available at program start, in which case you can guarantee that
"you would need the TLS areas of all the shared libraries to be allocated contiguously:"
With dlopen, the dynamic loader needs a different place for the TLS blocks of newly loaded shared libraries, which unfortunately requires one more indirection.
Regarding "... and I don’t say a word about GL_TLS_GENERATION_OFFSET, for example, and I could."
`GL_TLS_GENERATION_OFFSET` in glibc is for the lazy TLS allocation scheme. I don't want to spend my valuable time on its implementation...
It is almost infeasible to fix on the glibc side.
> the per-function-call overhead is due to dynamic initialization on first use requirement
Thanks - I didn’t realize this was mandated by the standard as opposed to “permitted” as one possibility (similarly to how eg a constructor of a global variable can be called before main or upon first use or anywhere in-between according to the standard). Updated the post with this point
> The desired efficient GOTTPOFF code sequence is only feasible when the shared object is available at program start, in which case you can guarantee that “you would need the TLS areas of all the shared libraries to be allocated contiguously”
Indeed I didn’t mention -ftls-model=initial-exec originally (I now added it based on reader feedback; it can work when it will work, which for my use case is a toss-up I guess…), but my point is that you could allocate the TLSes contiguously even if dlopen was used, and I describe how you could do it in the post, albeit in a somewhat hand-wavy way. This is totally not how things were done and I presume one reason is that you don’t carve out chunks of the address space for a use case like this as described in my approach - I just think it would be nice if things worked that way.
Actually sounds like it isn't mandated by the standard after all; it's mandated for block thread_locals but not for thread_locals in the global scope:
3.7.2/2 [basic.stc.thread]: A variable with thread storage duration shall be initialized before its first odr-use (3.2) and, if constructed, shall be destroyed on thread exit.
This allows the constructor to be called at any point before the first use, similarly to "normal" globals, though implementations made different tradeoffs in these 2 cases
My blog post is about a relocation format. I investigated a few schemes and concluded that LEB128 is the best for my use case.
There are multiple reasons including super simple implementation:
Seems pretty OK too and doesn’t force a branch per byte. (Imagine a memcpy instead of the first line if the strict-aliasing UB annoys you.) I guess it does do somewhat more work if the majority of the inputs fits in a byte.
Big thanks for the recent performance changes!
The "many small inefficiencies" point resonates – it definitely shows how performance is hurt in many small areas.
(I aim to write blog posts every 2-3 weeks, but this latest one was postponed...
I wrote this in relatively short time so that the gap would not be too long, and I really should take time to refine the post.)
> Block variables with static or thread(since C++11) storage duration are initialized the first time control passes through their declaration (unless their initialization is zero- or constant-initialization, which can be performed before the block is first entered). On all further calls, the declaration is skipped. --- https://en.cppreference.com/w/cpp/language/storage_duration
From https://maskray.me/blog/2021-02-14-all-about-thread-local-st...
> If you know x does not need dynamic initialization, C++20 constinit can make it as efficient as the plain old `__thread`. [[clang::require_constant_initialization]] can be used with older language standards.
Regarding `data16 lea tls_obj(%rip),%rdi` in the general-dynamic TLS model, yeah it's for linker optimization. The local-dynamic TLS model doesn't have data16 or rex prefixes.
Regarding "Why don’t we just use the same code as before — the movl instruction — with the dynamic linker substituting the right value for tls_obj@tpoff?"
Because -fpic/-fPIC was designed to support dlopen. The desired efficient GOTTPOFF code sequence is only feasible when the shared object is available at program start, in which case you can guarantee that "you would need the TLS areas of all the shared libraries to be allocated contiguously:"
With dlopen, the dynamic loader needs a different place for the TLS blocks of newly loaded shared libraries, which unfortunately requires one more indirection.Regarding "... and I don’t say a word about GL_TLS_GENERATION_OFFSET, for example, and I could."
`GL_TLS_GENERATION_OFFSET` in glibc is for the lazy TLS allocation scheme. I don't want to spend my valuable time on its implementation... It is almost infeasible to fix on the glibc side.
reply