More

MaskRay · 2025-02-17T19:00:07 1739818807

Regarding: `cmpb $0, %fs:__tls_guard@tpoff`, the per-function-call overhead is due to dynamic initialization on first use requirement:

> Block variables with static or thread(since C++11) storage duration are initialized the first time control passes through their declaration (unless their initialization is zero- or constant-initialization, which can be performed before the block is first entered). On all further calls, the declaration is skipped. --- https://en.cppreference.com/w/cpp/language/storage_duration

From https://maskray.me/blog/2021-02-14-all-about-thread-local-st...

> If you know x does not need dynamic initialization, C++20 constinit can make it as efficient as the plain old `__thread`. [[clang::require_constant_initialization]] can be used with older language standards.

Regarding `data16 lea tls_obj(%rip),%rdi` in the general-dynamic TLS model, yeah it's for linker optimization. The local-dynamic TLS model doesn't have data16 or rex prefixes.

Regarding "Why don’t we just use the same code as before — the movl instruction — with the dynamic linker substituting the right value for tls_obj@tpoff?"

Because -fpic/-fPIC was designed to support dlopen. The desired efficient GOTTPOFF code sequence is only feasible when the shared object is available at program start, in which case you can guarantee that "you would need the TLS areas of all the shared libraries to be allocated contiguously:"

    # x86-64
    movq ref@GOTTPOFF(%rip), %rax
    movl %fs:(%rax), %eax

With dlopen, the dynamic loader needs a different place for the TLS blocks of newly loaded shared libraries, which unfortunately requires one more indirection.

Regarding "... and I don’t say a word about GL_TLS_GENERATION_OFFSET, for example, and I could."

`GL_TLS_GENERATION_OFFSET` in glibc is for the lazy TLS allocation scheme. I don't want to spend my valuable time on its implementation... It is almost infeasible to fix on the glibc side.

yosefk · 2025-02-17T19:44:27 1739821467

> the per-function-call overhead is due to dynamic initialization on first use requirement

Thanks - I didn’t realize this was mandated by the standard as opposed to “permitted” as one possibility (similarly to how eg a constructor of a global variable can be called before main or upon first use or anywhere in-between according to the standard). Updated the post with this point

> The desired efficient GOTTPOFF code sequence is only feasible when the shared object is available at program start, in which case you can guarantee that “you would need the TLS areas of all the shared libraries to be allocated contiguously”

Indeed I didn’t mention -ftls-model=initial-exec originally (I now added it based on reader feedback; it can work when it will work, which for my use case is a toss-up I guess…), but my point is that you could allocate the TLSes contiguously even if dlopen was used, and I describe how you could do it in the post, albeit in a somewhat hand-wavy way. This is totally not how things were done and I presume one reason is that you don’t carve out chunks of the address space for a use case like this as described in my approach - I just think it would be nice if things worked that way.

yosefk · 2025-02-18T06:13:51 1739859231

Actually sounds like it isn't mandated by the standard after all; it's mandated for block thread_locals but not for thread_locals in the global scope:

3.7.2/2 [basic.stc.thread]: A variable with thread storage duration shall be initialized before its first odr-use (3.2) and, if constructed, shall be destroyed on thread exit.

This allows the constructor to be called at any point before the first use, similarly to "normal" globals, though implementations made different tradeoffs in these 2 cases

MaskRay · 2024-11-04T01:38:39 1730684319

I have placed a lot of focus on code navigation. Here is what I mentioned in my post:

  nmap('J', '<cmd>Telescope lsp_definitions<cr>', 'Definitions')
  nmap('<M-,>', '<cmd>Telescope lsp_references<CR>', 'References')

  nmap('H', '<cmd>pop<cr>', 'Tag stack backward')
  nmap('L', '<cmd>tag<cr>', 'Tag stack forward')

  nmap('xn', function() M.lsp.words.jump(vim.v.count1) end, 'Next reference')
  nmap('xp', function() M.lsp.words.jump(-vim.v.count1) end, 'Prev reference')

MaskRay · 2024-08-12T07:33:57 1723448037

The scheme proposed in this blog post is also called PrefixVarInt.

Signed integers can be represented with either zigzag encoding or sign extension. For the most common one-byte encoding, zigzag encoding is a worse scheme. https://maskray.me/blog/2024-03-09-a-compact-relocation-form...

My blog post is about a relocation format. I investigated a few schemes and concluded that LEB128 is the best for my use case. There are multiple reasons including super simple implementation:

    static uint64_t read_leb128(unsigned char **buf, uint64_t sleb_uleb) {
      uint64_t acc = 0, shift = 0, byte;
      do {
        byte = *(*buf)++;
        acc |= (byte - 128*(byte >= sleb_uleb)) << shift;
        shift += 7;
      } while (byte >= 128);
      return acc;
    }
    
    uint64_t read_uleb128(unsigned char **buf) { return read_leb128(buf, 128); }
    int64_t read_sleb128(unsigned char **buf) { return read_leb128(buf, 64); }

mananaysiempre · 2024-08-12T22:28:56 1723501736

  static int64_t read_sprefix(unsigned char **buf) {
    uint64_t x = *(uint64_t *)*buf;
    unsigned n = stdc_trailing_zeros(x) + 1;
    assert(n <= 8); /* handles values up to 2**56 - 1 */
    *buf += n;
    return (int64_t)(x << 64 - 8*n) >> 64 - 7*n;
  }

Seems pretty OK too and doesn’t force a branch per byte. (Imagine a memcpy instead of the first line if the strict-aliasing UB annoys you.) I guess it does do somewhat more work if the majority of the inputs fits in a byte.

MaskRay · 2024-08-11T17:32:05 1723397525

Here is the glibc feature request: https://sourceware.org/bugzilla/show_bug.cgi?id=31959 ("Feature request: special static-pie capable of loading the interpreter from a relative path")

MaskRay · 2024-07-22T06:47:56 1721630876

I almost use rr every day, along with a gdb frontend: cgdb.

rr record /tmp/Debug/bin/llvm-mc a.s && rr replay -d cgdb

I've have success story with some bugs only reproducible with LTO. Without rr it would be a significant challenge.

It would be nice if Linux kernel could be debugged with rr. Does anyone have success with kernel under rr+qemu ? :)

rapiz · 2024-07-22T15:54:15 1721663655

what's the benefit of using cgdb while you can use gdb layout src?

MaskRay · 2024-07-22T01:50:26 1721613026

tl;dr

- Mapping symbols describe data in code and instruction sets transition (e.g. A32<=>T32).

- Pending LLVM integrated assembler patch that will eliminate almost all mapping symbols without breaking disassemblers https://github.com/llvm/llvm-project/pull/99718

- RISC-V ISA extension `$x<ISA>` (unimplemented yet) would raise question where the relocatable files are more heterogeneous.

- Mach-O LC_DATA_IN_CODE describes ranges.

- Compressed .strtab and .symtab might be beneficial.

MaskRay · 2024-07-02T06:04:33 1719900273

Big thanks for the recent performance changes! The "many small inefficiencies" point resonates – it definitely shows how performance is hurt in many small areas.

(I aim to write blog posts every 2-3 weeks, but this latest one was postponed... I wrote this in relatively short time so that the gap would not be too long, and I really should take time to refine the post.)

MaskRay · 2024-07-02T05:54:35 1719899675

Thanks!

MaskRay · 2024-07-02T05:35:03 1719898503

https://mirrors.edge.kernel.org/pub/tools/llvm/ provides a PGO-optimized LLVM toolchain. It is likely much faster than Distro provided Clang.

You might also want to replace the malloc with mimalloc/snmalloc, which might yield ~10% performance boost.

Keyframe · 2024-07-02T10:02:22 1719914542

oh this is nice! Thanks!

MaskRay · 2024-07-02T05:29:58 1719898198

Thx. Fixed