Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Caches are everywhere, and systems have now 5 layers of different memories, from L1 to a network attached storage, plus accelerators memories such as GPU.

I would like a language designed to target a hierarchical-memory system. I would like a language that forces me write single-threaded batched (or "blocked") algorithms with low cache misses.

How many pieces of code are out there where an easy x4 speedup could be achieved today if there were written with batched operations from the start? (It also shows how limited are the compiler in autovectorizing)

Rust gives me guarantes regarding memory safety, I would like a language that gives me guarantees that SIMD instructions and 2 cache levels are correctly used, without having to read the source code and compiler output.



Having worked on hash table implementations in C, and having done everything to minimize cache misses, e.g. using tiny 8-bit bloom filters within a cache line to avoid further cache line probes, I now prefer Zig to C, because I believe it makes memory alignment far more explicit in the type system.

You can even align the stack memory for a function and this is all upfront in the documentation. You don't need arcane compiler specific pragmas. Zig just makes it easy. Zig's alignment options are so powerful and neat and available compared to C, right down to allocations with custom alignments, and all first class at the language level. Compare that with C's malloc() and posix_memalign(). Implementing a Direct IO system in Zig recently was also a breeze.

I also appreciate Zig's approach to memory management, where even the choice of allocator is considered important, and for things like async/await, Zig's explicitness around memory requirements is brilliant. Zig's @frameSize builtin (https://ziglang.org/documentation/0.7.1/#frameSize) will tell you exactly how much memory a whole chain of functions will need to run async, including their stack variables. You can even choose where you want their async frames to be stored: global, heap or stack.

Again and again, Zig's design decisions have just been spot on. Huge kudos to Andy Kelley and the Zig communities.


  > How many pieces of code are out there where an easy x4 speedup could be achieved
  > today if there were written with batched operations from the start? (It also shows
  > how limited are the compiler in autovectorizing)
Even before auto-vectorization, I'd love a functional automated "loop unrolling with interleave" that works on large functions. There is a pragma for this in Clang, but when I checked it in clang-9 it didn't work. I'll have to try again as v9 is a bit old now. When this is well supported it will avoid easily "filling the pipe" on multiple issues cores when doing batched operations, without having to manually unroll the loops as is done in VPP for example: https://gerrit.fd.io/r/gitweb?p=vpp.git;a=blob;f=src/vnet/ip...

Manual unrolling works, but getting the same effect with a simple pragma on top of the loop looks so much more attractive ;)




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: