
TLS performance overhead and cost on GNU/Linux - ingve
http://david-grs.github.io/tls_performance_overhead_cost_linux/
======
rmk
Title should be renamed to start with 'Thread Local Storage...', to make it
clear it's not Transport Layer Security...

------
_ikke_
TLS here is referring to Thread Local Storage, not Transport Layer Security.

~~~
atonse
Thanks because I kept reading it and even though I read about thread local
storage, I kept reading to see how this sped up TLS sockets :)

------
koverstreet
Really want to know who in the hell thought it was necessary to do lazy
allocation - we don't do lazy allocation for .bss, why would it be necessary
for .tbss? Especially since kernel is already doing lazy allocation (when
pages are first touched)?

AFAICT there's no good reason why TLS access couldn't be just a normal
segmented access and nothing more, dynamic loading or no... same as it is in
the kernel.

Edit: And it's not like these are big in practice, at least in anything I have
installed:

    
    
      find /lib /usr/lib -type f|grep '\.so$'|xargs -n1 size -A|&grep -E '^.(tdata|tbss)'|awk '{print $2}
    

says that the absolute biggest is 152 bytes, and most are a lot smaller.

~~~
dfox
Only sane implementation of TLS without aditional indirection in presence of
dynamic linking would involve having per-thread pagetables, which on at least
x86 mostly negates any performance benefits of threads vs. processes. Kernel
code does not have this problem as kernel modules are not demand paged and
there is nothing they can be shared with, so kernel can do whatever fixups it
wishes to do.

x86-style segmentation has essentially nothing to do with TLS apart from the
fact that it can be abused to implement it without having to reserve
additional useful global registers (as it is implemented on typical RISC
platforms).

TLS areas are in general very small because very often only thing that is
stored there is pointer to some internally managed per-thread heap allocated
struct that is then accessed directly (reasons for this are twofold: it is
faster in tight loops and there are platforms that do not support linker-
visible TLS).

~~~
koverstreet
I think you're confusing a couple different issues...

At the core, percpu variables/sections in the kernel are done exactly the same
way as the TLS is implemented in userspace, except with gs instead of fs.

The thing that I'm complaining about is that on top of that, glibc is doing
lazy allocation for no apparent (sane) reason.

------
defined
Please expand "TLS" to "Thread-local Storage" in the misleading title.

~~~
defined
Why a downvote for this? I don't understand. Others have pointed this out,
too.

------
BrandonBradley
Acronym soup

