Hacker News new | past | comments | ask | show | jobs | submit login
Towards fearless SIMD, 7 years later (linebender.org)
176 points by raphlinus 3 days ago | hide | past | favorite | 173 comments





I’ve said it before and I’ll say it again: Rust feels like a Python developer’s idea of a high-performance computing language. It’s a great language for many kinds of applications — just not when you need to squeeze out every bit of performance from advanced hardware.

Even before getting into SIMD, try using Rust for concurrent, succinct, or external-memory data structures. It quickly becomes clear where the friction is.

Cargo is fantastic — clean, ergonomic, and a joy compared to many toolchains. But it’s much easier to keep things simple when you don’t have to support dozens of AVX-512 variants, AMX, SME, different CUDA generations, ROCm, or any of the other modern hardware capabilities.

Standardising SIMD in the standard library — in Rust or C++ — has always been a questionable idea. Most of these APIs cater to operations that compilers already auto-vectorize reasonably well, and they barely touch the recent capabilities of SIMD. Just consider how hard it is to build any meaningful abstraction over the predicate/register models across AVX-512, SVE, and RVV.

RVV aside, this should illustrate the point: https://www.modular.com/blog/understanding-simd-infinite-com...


I've written a fair bit of SIMD code in Rust, and it definitely had lots of sore spots.

The main advantage was that, because Rust doesn't use TBAA, it's completely legal (and safe, if you use bytemuck) to freely cast pointers and values around. TBAA in C++ makes it much easier to hit undefined behavior.

But also, because of various miscompilations, Rust refuses (or at least refused) to pass SIMD arguments in registers, so every non-inlined function call passes arguments via the stack. There were also miscompilations if you enabled a target_feature for just one function, so we ended up just passing `-C target-cpu=...` globally, and if we wanted to support a different microarchitecture, we just recompiled the whole program. On top of that, there's no good way to check to see what microarchitecture you're compiling for, so we had to resort to specifying the target cpu in multiple places, with comments reminding us to keep the places in sync.


I don't think Rust is particularly problematic here. As long as you don't want to do funky things like use immutable argument memory as temporary scratch space (with you restoring the values afterwards of course), all it means is some `unsafe`ing at worst, compared to C/C++. And there are some safe abstractions you can make over loads/stores (everything else being safe, even if not yet marked as such).

Do agree that a standard SIMD type is rather pointless, if not immediately, then in like 5 years. (and, seemingly, both Rust and C++ are like over 10 years behind on SIMD, so they're already out-of-date)

Maybe somewhat useful if you just want the simple ~8x speedup, and not squeeze out the last 1.4x or whatever, but autovectorization should be capable of covering a significant amount of such.


… except for byte-level processing, variable-length codecs, or mixed-precision numerics. That never works with autovectorization and can’t be solved with general-purpose SIMD wrappers. For me the solution was to implement those manually, and even at a scale of just 2 libraries I’ve eneded up with somewhat different project layouts & dispatch mechanisms: https://github.com/ashvardanian/SimSIMD , https://github.com/ashvardanian/StringZilla

One big family not covered there, is sparse data-strictures and related algorithms. I’ve only started integrating scatter/gather in AVX-512 and SVE, and on synthetic benchmarks both look promising: https://github.com/ashvardanian/less_slow.cpp/releases/tag/v...

Those should probably unlock a much wider set of applications for SIMD, but designing libraries for those may benefit from yet another project structure.


You started with this... take:

> Rust feels like a Python developer’s idea of a high-performance computing language. It’s a great language for many kinds of applications — just not when you need to squeeze out every bit of performance from advanced hardware.

And went on to say that Rust in particular is problematic for:

> byte-level processing

It's particularly odd for you to say this given that the memchr Rust crate is just as fast as stringzilla for substring search. And is generally faster in cases where the needle is invariant, because stringzilla doesn't have APIs for amortizing searcher construction.

We've had a discussion about this before where I provided receipts[1] and we have not had a meeting of the minds on this point. The thing I'm trying to achieve here is to point out that your claims are contested and there is evidence that you're wrong. And so I'd caution readers to also in turn question your higher level claims about Rust being a "Python developer's idea of a high-performance computing language."

[1]: https://old.reddit.com/r/rust/comments/1ayngf6/memchr_vs_str...


Hey! I just mean that there is a very large category of developers, generally coming from the Python world, expecting that switching to Rust is supposed to solve every performance-oriented problem, providing a State-of-the-Art solution magically :)

Sadly, programming doesn't seem to work that way. There are always tradeoffs. Rust does some things very well, but it wouldn't be my first choice for others.

MemChr is a lovely package, and there are a few other really cool SIMD projects in the Rust ecosystem. Regardless, my development velocity for HPC-related projects is higher in C/C++, and I get much more flexibility to leverage newer hardware features, like AMX and SME.


Don't worry his comment is obvious clickbait (complete with a plug for a random blogpost written by him). People that matter (people that are actually writing simd for their day job) can immediately spot a poseur (especially relative to you and Raph).

To be clear, I don't support this. StringZilla is a real and useful project, and its performance is competitive with memchr. They aren't a poser.

(There are reasons to use StringZilla over memchr beyond performance. StringZilla provides a number of interesting string operations beyond just substring search. The memchr crate is far more specialized.)


Thanks, Andrew! Love your work too - keeps me on my toes ;) PS: Hoping to share a few more Rusty bits later this year.

Ah yeah, gather/scatter are indeed a rather problematic thing for autovectorization. That said, with no-alias info (which Rust has a lot of) it's possible: https://rust.godbolt.org/z/zTfo9nxhd.

Unfortunately it doesn't get autovectorized without the unsafes, but theoretically it should be possible-ish for bounds checking to be autovectorized (most problematic aspect being that it might be hard to annoying-to-impossible to ensure that in the case of multiple panic/UB sources the proper one happens first).

I'd imagine in any non-trivial situation you'd want a custom layer of abstractions over whatever the language provides for all languages. For that a portable-simd thing is actually a rather good base, on which you could add custom arch-specific abstractions/ops as desired.

Not sure what's problematic with mixed-precision (I know SVE is rather weird for mixed-width elements, but that's about it?), though I primarily don't care about float stuff generally. Also no clue what's problematic with byte-level stuff.

Indeed there are still a bunch of things that you want proper manual SIMD for (hell the SIMDful project I work on has an entire DSL for doing nice SIMD), but autovectorization still covers a good amount.


> except for byte-level processing, variable-length codecs, or mixed-precision numerics. That never works with autovectorization and can’t be solved with general-purpose SIMD wrappers.

Counterexamples: Chromium's byte-level HTML scanning, several var-len bit packing codecs, and Gemma.cpp's matmul is mixed-precision (fp8->bf16->fp32->fp64). All written with the Highway general-purpose SIMD wrapper. Please revise your post or expand upon the structure/dispatch concern.


Interesting references! I remember that Gemma.cpp used Highway, but I haven't checked the others much.

Here is a puzzle, then. Let's say we are checking a single register of bytes for element-wise equality with another register of the same size. In AVX2, the output is another YMM register of 0xFF or 0x00 values. In AVX-512, for full ZMM-wide comparisons, it's a 64-bit mask in the K register.

I struggle to see a good way to abstract such things, even for two consecutive SIMD generations on x86.


A simple enough abstraction is to have the mask type have inputs of both element size and count; so for a ≤256-bit product of those you do the homogeneous-bit elements, and for 512-bit you do the packed-bit values. Conversion methods on the mask can convert those to either an explicit full vector, or a packed bit integer, as you need. (and clang can optimize out unnecessary conversions between the two)

Yes indeed, this is what we do :) There is an opaque Mask type for which operations such as CountTrue, AllFalse etc are provided. If you really want the one or other representation, VecFromMask and BitsFromMask/StoreMaskBits convert as required. The former is a no-op on AVX2.

As to the AVX2 comparison, for ICL I think we can do two of those per cycle, so no more than the throughput of an AVX-512 mask comparison. The question is what we do with it afterwards - for VQSort (one of the few applications where comparisons are really the bottleneck?) we are much happier to have the packed bits, because that can feed into vpcompress or a LUT implementing that.


Sure, that would make sense, but in AVX-512, there is also a comparison variant for 2x256-bit inputs that outputs a 32-bit mask and another one for 2x128 inputs and a 16-bit output mask. I would use those in different ways, occasionally mixing with the old AVX2 variant, depending on what I’m doing. I have tried to create a generalizable SIMD framework several times in the past, mostly in 2015-17, and still don’t have a good abstraction even between AVX2 and AVX-512.

You'd just produce the ≤AVX2-style result; clang will switch to using the mask-returning comparison instrs if the usage ends up being a mask (gcc doesn't do such fancy things, so I guess getting the last bits of speedup from this strategy depends on how much control over compiler selection you choose to have).

And if you really care about doing different things for ≤AVX2 vs AVX-512 masks, you're definitely in the "squeeze out the last 1.4x" camp and not the "simple ~8x speedup" one. And, realistically, you won't care about this on every single comparison, so where you really want to explicitly use mask-returning comparisons you could just switch to using intrinsics directly (or a different abstraction over them) temporarily.

As a side-note, mask-returning comparisons have 2x less throughput than homogeneous-bit-returning ones for xmm/ymm as far as uops.info data goes[1] (and use port 5!), so this strategy is kinda just what you want really.

(another strategy can be to just return the arch-specific result type, still preferring the ≤AVX2 style (with, if desired, a separate set of comparison ops for the AVX-512-mask output), and making mask-consuming ops polymorphic over the two)

[1]: https://uops.info/table.html?search=cmpeqb%20ymm%29&cb_lat=o...


> Do agree that a standard SIMD type is rather pointless, if not immediately, then in like 5 years. (and, seemingly, both Rust and C++ are like over 10 years behind on SIMD, so they're already out-of-date)

What language would you consider to have cutting-edge SIMD support?

I've dipped my toes into SIMD with Rust, [1] on stable with platform-specific intrinsics (SSE2, AVX2, NEON). I would have liked to use stable `std::simd`. I learned that (particularly on AVX2) getting things into the right lanes efficiently is a pain. I would have liked to just use `simd_swizzle!` for that part, and mix that with intrinsics calls. My approach of writing a small C++ or unstable Rust program that does the swizzling and then copying the intrinsics operations it chose into my program's "source" code worked, but I prefer to not have a manual copy'n'paste step between compilation and assembly.

If there's something much better out there in another language, well, I'd be very interested to see it.

[1] I wrote this: https://github.com/infiniteathlete/pixelfmt/blob/main/docs/s...


Don't think any language has standardized SIMD that's particularly nice; Highway is probably a quite nice library on C++, though I haven't used it enough to get comfortable.

The thing I use for my projects is Singeli[1], a DSL specifically made for SIMD stuff (though it's capable of generally sanely doing abstractions over types/operations/loops; it's just a fancy code generator). Obligatory disclaimer that I'm one of the two people working on its design. It's far from a nice experience starting from nothing, but it's pretty nice for what I do.

CBQN's the main place it's used, can click around its source: https://github.com/dzaima/CBQN/tree/develop/src/singeli/src

Its goal isn't necessarily to unify architectures, but rather make it as easy as possible to make abstractions that do; as such its built-in includes for x86 don't have arbitrary shuffling, but do provide a sane interface over the cases that are supported in a single instruction (not including constant creation/loading), and those can run on NEON unchanged (assuming they're ran on 128-bit vectors, of course, as NEON doesn't support larger ones); and, with Singeli just generating C/C++ currently, you can just map in __builtin_shufflevector if desired. e.g. here's your AVX2 `pre`:

    include 'skin/c' # defines infix a+b & a*b etc to run __add/__mul/... (yes, those aren't here by default, and you can define custom infix/prefix ops)
    include 'arch/c' # defines __add & __mul to do C ops
    include 'arch/iintrinsic/basic' # not necessary for a shuffle, but provides basic x86 arith ops
    include 'arch/iintrinsic/select' # x86 shuffles; there's similar 'arch/neon_intrin/basic' & 'arch/neon_intrin/select' for NEON

    fn pre(inp: [32]i8, out: *[32]i8) : void = {
      store{out, 0, vec_shuffle{16, inp, merge{ # 16 specifies to repeat per 16-elt lane
        range{8}*2+1,            # lower half: 8 Y components; compile-time index calculations
        range{4}*4, range{4}*4+2 # upper half: (4 * U), (4 * V).
      }}}
    }
As a more fancy thing, I've got this working (via bodging together the definitions in CBQN with some sugar to make this pretty; not including all that boilerplate here), compilable to SSE2/AVX2/NEON producing a 4x unrolled core loop, plus tail handling (via reading past the end and doing a load-blend-store if necessary because that's what CBQN's fine with; could easily define a fancy_loop such that it does a scalar tail though). (also can be compiled to RVV via currently-unpublished mappings; no need to unroll for RVV; can choose to do either a stripmined loop or one with a separate tail):

    fn sigmoid{E}(r:*E, x:*E, n:ux) : void = {
      def V = arch_preferred_vector{E}
      @fancy_loop{V,4}(r in tup{'dst',r}, x, M in 'mask' over n) {
        # this loop body is generated 3 times for x86 & ARM - with x being a 4-elt tuple (core unrolled loop); a 1-elt tuple and no masking; a 1-elt tuple and masking
        if (any_hom{M, ...(x!=x)}) {
          emit{void, 'abort'}
        }
        r{x / __sqrt{1 + x*x}}
      }
      # were it not for a bug in tuple loop var mutation in Singeli having undesired pervasion, this would be possible:
      # @fancy_loop{V,4}(r, x, M in 'mask' over n) {
      #   if (...) ...
      #   r = x / __sqrt{1 + x*x}
      # }
    }
    export{'sigmoid', sigmoid{f32}}
(e.g. generated C for AVX2: https://godbolt.org/z/KTeGsazKP)

(I'm not actually particularly expecting interest in Singeli; I just like writing stuff :) )

[1]: https://github.com/mlochbaum/Singeli


Thank you!

> Highway is probably a quite nice library on C++, though I haven't used it enough to get comfortable.

Yeah, I've seen several references to it, including this article, so it's on my list to check out some time.

> The thing I use for my projects is Singeli[1], a DSL specifically made for SIMD stuff

That is...not what I'd imagined you'd say! But I am intrigued.


Thanks for posting this, I'll take a look. It wasn't on my radar, but the idea of doing a DSL specifically for SIMD is something I've been thinking about and also starting to explore myself.

One of the things I find interesting about SIMD is that a lot of behavior that is “undefined” for scalar types in C-derived languages is explicitly fully defined when you use SIMD intrinsics with the same integral types. UB exists to hide the fact that major CPU architectures give different results for basic ALU operations in some cases. SIMD makes no such pretense of abstraction. If I am using AVX-512 I explicitly get the full Intel architecture experience, the implementation details are not hidden behind UB. Same with ARM, etc.

For example, shift overflows are masked on x86, zero-filled on ARM, and undefined in C/C++. In SIMD-land, none of this is hidden and so you design your code to leverage the reality that those instructions behave differently, whereas in C/C++ only the behavior they have in common is “defined”.

The vector ISAs are sufficiently different from each other (and normal CPUs) that it is like trying to build a compiler that can automagically produce optimized code for both CPUs and GPUs from the same source tree. I am not optimistic that this will happen anytime soon. AVX-512 essentially started life as a GPU ISA, which probably explains the interesting fact that a modern x86 CPU core has more AVX-512 registers than x86 registers.


Of course. Many things are UB because dictating a policy for all machines doesn’t make sense. Whereas AVX is a specification for a specific hardware capability.

> Even before getting into SIMD, try using Rust for concurrent, succinct, or external-memory data structures. It quickly becomes clear where the friction is.

It's the exact opposite for me: I use concurrent data structures more often in Rust than I do in C++ because I don't have to worry about dumb data race bugs. If one of my Bevy systems is slow, I slap par_iter() on the query and if it compiles it probably works, or at least fails for a not-stupid reason.


Concurrent data structures are rarely faster than "lock + non-concurrent data structure" and if you're putting constructs like par_iter() in a lot of places, there's a good chance you would be better off with the "dumb" pattern than the concurrent data structure.

The same goes with Arc - if you're using it a lot there's a good chance that code with a lot of Arcs is slower than equivalent GC-ed code.


While it's true that par_iter() uses a concurrent data structure under the hood, it's specifically designed to use work-stealing to avoid needing threads to communicate.

Why would putting a lock over a global workqueue be faster than per-thread workqueues that don't require inter-thread communication (except in the case where work-stealing is required)?


Atomics are very expensive operations. Lock/unlock is two atomics. Many concurrent data structures will end up doing many more atomic operations than you expect.

The general wins of concurrent data structures come when you really are accessing them truly concurrently - as in when many threads on many cores are heavily contending for access and you need to make global progress.


Sure! However, the work-stealing queue in rayon [1] uses three atomic operations instead of the two atomic operations for a mutex for a global lock. The difference, however, is the three atomic operations for the thread-local queue should be uncontended, whereas a global lock on a global work queue would experience contention from every thread trying to access it for jobs.

Between the choices of "single work sharing queue with a big mutex on it that all threads access for work" vs "per-thread work-stealing queue that's uncontended for the cost of one extra atomic," in what situations would the work-sharing queue with the global mutex outperform? Perhaps if there's a small number of jobs, and there's not enough time for the work-stealing algorithm to distribute jobs to the worker threads before the work-sharing algorithm has already finished.

[1]: https://github.com/crossbeam-rs/crossbeam/blob/423e46fe20471...


Run a benchmark. With low contention, the lock will outperform. Atomics are very expensive assembly instructions.

Fedor Pikus has a good talk on this at cppcon 2019.


I've seen the talk! The issue with using a global lock on a global work queue is that, unless the work items have drastically different compute times, there _will_ be high contention on the lock.

I ran a benchmark [1], which shows that this is correct:

Results on quad-core Intel Linux box:

$ hyperfine target/release/testit 'env USE_RAYON=1 target/release/testit' Benchmark 1: target/release/testit Time (mean ± σ): 2.526 s ± 0.139 s [User: 4.709 s, System: 11.425 s] Range (min … max): 2.391 s … 2.730 s 10 runs

Benchmark 2: env USE_RAYON=1 target/release/testit Time (mean ± σ): 174.1 ms ± 0.9 ms [User: 212.1 ms, System: 121.1 ms] Range (min … max): 173.1 ms … 175.4 ms 16 runs

Summary env USE_RAYON=1 target/release/testit ran 14.51 ± 0.80 times faster than target/release/testit

Results on M1 Pro:

$ hyperfine target/release/testit 'env USE_RAYON=1 target/release/testit' Benchmark 1: target/release/testit Time (mean ± σ): 692.2 ms ± 8.3 ms [User: 491.4 ms, System: 5693.6 ms] Range (min … max): 683.2 ms … 704.5 ms 10 runs

Benchmark 2: env USE_RAYON=1 target/release/testit Time (mean ± σ): 63.0 ms ± 2.1 ms [User: 97.7 ms, System: 47.0 ms] Range (min … max): 61.0 ms … 71.2 ms 44 runs

Summary env USE_RAYON=1 target/release/testit ran 10.99 ± 0.39 times faster than target/release/testit

[1]: https://play.rust-lang.org/?version=stable&mode=debug&editio... (I'm just using the rust playground as a pastebin; the actual benchmarks were run locally)


Ah, yes. It's good that you ran a benchmark. However, if I read the code correctly, your version with a lock does an extra memcopy. When the work function is as short as AES encoding of a block of memory, that memcopy is quite a big cost and the queue is going to be quite heavily contended.

> Atomics are very expensive operations.

Atomics are very expensive when contended - which is also the case where locks would introduce blocking and reduced performance. Uncontended atomics are relatively cheap.


I suggest you run a microbenchmark. Uncontended atomics are some of the most expensive assembly instructions out there. They also acquire global system locks on certain stages of execution.

Yep, including draining the store buffer.

We've gotten our ThreadPool barrier+wait to use only acq/rel, but not yet the work stealing. Does anyone have experience with that already?


FWIW, work stealing can be pretty expensive. It is only efficient under a narrow set of assumptions about the workload.

What do you think about 'par_iter' having to wait for work imbalance to return execution back? With 'when_all' like primitive one can continue execution on any thread without losing one for waiting. ps as someone who does not have rust job i would like to see an example how rust deals with task based systems if you have public one at hand ofc.

The rayon library uses work stealing for this. Its parallel iterators offer some control of splitting strategies and granularity, so you can tune the trade-off between full utilization and cost of moving things between threads.

Additionally, in Bevy, independent queries (systems) are executed in parallel, so there's always something else to do, except your one worst loop.


Rust is easy to understand as "a language by browser writers for writing browsers." That statement alone gives you most of Rust's design choices:

* Safety over everything else

* Very good single threaded performance

* Javascript-like syntax and a Javascript-like package manager

I have been working on some software in Rust recently that needs bit and byte manipulation, and we have "unsafe" everywhere and hugely complicated spaghetti compared to the equivalent code in C.


> I have been working on some software in Rust recently that needs bit and byte manipulation, and we have "unsafe" everywhere and hugely complicated spaghetti compared to the equivalent code in C.

I'm curious what makes this so different from my experience. I rarely ever have to write "unsafe", and I'm writing quite low-level engine code that certainly uses bit manipulation. In fact, crates like bitflags and fixedbitset make it so easy that I tend to get dinged in code reviews for using bit flags when structs of booleans would be simpler :)


> I'm curious what makes this so different from my experience. I rarely ever have to write "unsafe", and I'm writing quite low-level engine code that certainly uses bit manipulation.

Perhaps your usecase is similar enough to eg JavaScript engines? Because that's a usecase that browser writers would at least have in mind?


It's not. It's Bevy.

What effects this is separating your unsafe into unsafe abstractions. If you're careful you don't have to write too much unsafe. Granted it's not easy.


A game engine is much more similar to a browser engine than you think. The operations you have discussed, using bits as flags, are also the most basic forms of binary-level manipulation out there. Things like tagged pointers and bit flags fit nicely and neatly into an encapsulated unsafe abstraction (provided you want to add an extra 1000 lines of code for it).

As far as I can tell, any time you want to rely on the exact binary layout of something in memory, you need unsafe. As a corollary, any time you want to bit cast from one type to another, you need unsafe. This means that things like succinct data structures and building network protocols need quite a bit of unsafe everywhere. The former needs you to do things like "store 14 bits of X here and 12 bits there..." The latter needs control of bit and byte layout because you want to carefully eliminate implementation-defined compiler behavior.


Take a look into bytemuck or zerocopy. I haven't used unsafe when doing byte level manipulation in a long time.

> As far as I can tell, any time you want to rely on the exact binary layout of something in memory, you need unsafe. As a corollary, any time you want to bit cast from one type to another, you need unsafe.

No, that's what bytemuck is for. If bytemuck didn't exist, sure, I'd be using a lot of unsafe.


Bytemuck handles the latter, not the former. In the applications I am working on, both matter.

I can't quite parse your statement, but if you mean that bytemuck doesn't let you "rely on the exact binary layout of something in memory", then yes, it does: it lets you cast from a Pod type to another Pod type, which exposes the memory layout of both types.

And a jet engine and IC engine are engines, but putting your car engine into a Boeing and vice versa would be an unwise decision.

I'd argue you're over-abstracting the differences. They have different purposes. Game engines need high performance, while browsers need to enable wide selection of APIs.


> And a jet engine and IC engine are engines, but putting your car engine into a Boeing and vice versa would be an unwise decision.

Sure, but you'd expect the design software for a car engine to be at least half-way relevant for a jet engine in a pinch.

Compared to eg the software an architect would use to design a bridge.


Browsers need very high performance, too.

To be fair to my analogy, both ICE and Jet engines need high RPMs too, and pull power but not on the same level.

Browsers layout and rendering will have some elements similar to that of a game, but it isn't on the same level. A game can usually assume that it's trusted to run that shader, in a browser that's a vector for attack. Security design will impact game engines and browsers differently.

Granted, people are doing their darnedest to make games in the lowest common denominator technology, i.e. Electron.


> * Javascript-like syntax and a Javascript-like package manager

I think this is not serious criticism. The "javascript-like package manager" reference actually refers to fixing a major problem with the developer experience playing legacy programming languages such as C or C++. Java has those, .NET has those, every single mainstream programming language has those. Except C or C++.

Rust might be riddled with "the emperor has no clothes" aspects, but having a package manager is not it.


That's not a negative aspect of Rust, and I'm not picking out a list of things I dislike about Rust. Making a statement that isn't unequivocally positive does not equal criticizing something.

Cargo is, however, a similarity with JS. On the whole a good one. Also, cargo works much more like npm than maven, for example.


Cargo has been co-created by Yehuda Katz, who worked on Ruby's Bundler before. Cargo has been designed after npm, so it definitely took lessons from it, but it doesn't make sense to just broadly attribute this to Rust being JavaScript-adjacent.

The Rust syntax is not coming from JavaScript. It even has conflict with it, using `let` and `const` differently, since the `let` in Rust comes from Ocaml, not JS.

Both JS and Rust copy from the same C/C++ roots. Rust's curly brace flavor is more similar to Go and Swift. The original author of Rust liked a lot of languages with different syntaxes, but the C-like syntax has been a pragmatic choice to avoid putting off the target audience of C++ programmers:

http://venge.net/graydon/talks/intro-talk-2.pdf


Most likely because many keep forgetting that bit and byte manipulation in C is a mix of implementation defined and UB, depending on how it is coded.

What's the problem of using toolchain-specific features instead of behavior defined by the standard? Isn't this the bread and butter of embedded development and the reason why some behavior is purposely left undefined in the standard?

Security exploits is the problem, because what common people that don't read WG14 mailings think the word undefined means, and what everyone else involved with creating compilers understand what they are allowed to do, is not the same.

> Security exploits is the problem, because what common people that don't read WG14 mailings think the word undefined means, and what everyone else involved with creating compilers understand what they are allowed to do, is not the same.

Your comment lacks credibility. Your hypothetical scenario would only be conceivable if a) a team was well versed enough in C to adopt a specific toolchain to leverage implementetion-defined behavior that leveraged behavior left undefined by the C standard, b) somehow the same team decides on a whim to replace their toolset with some other random toolset without eve being aware of their toolset-specific code. This is far from a realistic scenario, and reads more like mindless complains about UB coming from a place of ignorance.


CVE database, the ongoing liability laws in cybersecurity across several nations, and companies that also happen to be C and C++ compiler vendors, are my credibility.

It is incredible how for the last 50 years we keep getting ad-hominens from folks that think actually to know anything at all about security, and only clueless junior developers don't know what they are doing.


> [...] and the reason why some behavior is purposely left undefined in the standard?

It's one of the saner reasons why there's UB in C and C++. But there's lots of crazier reasons.


> It's one of the saner reasons why there's UB in C and C++.

It is the reason why the standard purposely leaves some areas undefined.

C and C++ detractors talk a lot about UB but they always show their understanding on the subject is at best superficial. The parrot UB as if it was this major gotcha, when it is literally behavior the standardization committees make a call to not define behavior to purposely leave it open, so that any implementation can still be conforming even if they decide to implement behavior not defined in the standard. It is that simple, but somehow people parrot UB behavior as if it was this major gotcha. Baffling.


Baffling is the ignorance in the ways of WG14 and WG21, while pretending to be a know it all, maybe update yourself on the relevant papers for C2y and C++26, aimed at clearing up references to UB with erranous behaviour, or implementation defined.

This subthread is kinda strange with people claiming IB and UB are the same, considering that C/C++ have had clearly delineated definitions of undefined behavior versus implementation-defined behavior for decades in their term definitions.

No, we do understand the difference and the comment still stands. UB is a choice by the standard committee.

Signed integer overflow is undefined because it’s not even clear if you can detect it happens in all implementations. Do you want a conditional after every integer add?


> Signed integer overflow is undefined because it’s not even clear if you can detect it happens in all implementations. Do you want a conditional after every integer add?

Huh? You can just make it implementation defined, and most implementations would declare it to work like twos-complement wrap-around. Just like unsigned integer overflow is already defined in the standard.

Where does the conditional you are talking about come from? Unsigned integer overflow doesn't have any conditionals either.

Compilers like GCC already support wrap-around for signed integers with a command line option. It's spelled "-fwrapv", and I don't think it involves any conditionals.


> You can just make it implementation defined

No you can't. Let me repeat again. One very possible implementation is to crash on signed integer overflow (`-ftrapv`). But if it cannot easily detect it, then it must either: a) check every integer add. b) use some less reliable detection method (random sampling, etc).

To be well defined means the implementation will reliably do the same thing. A procedure which sometimes does one thing or another, is not well defined.

> Unsigned integer overflow doesn't have any conditionals either.

That's because there is no detection needed. The instruction set for unsigned integers is designed to do that.

> -fwrapv

Here is the documentation:

> This option instructs the compiler to assume that signed arithmetic overflow of addition, subtraction and multiplication wraps around using twos-complement representation.

https://gcc.gnu.org/onlinedocs/gcc/Code-Gen-Options.html

What this means is that if your hardware wraps on signed integer overflow, you can tell GCC that you are perfectly happy with that behavior. Then GCC understands and will handle it just fine.

If your hardware does something else, `fwrapv` will not help you.


But an implementation wouldn't choose to specify its behavior for an operation to be something that it knows it cannot do efficiently, if the goal is to be efficient, for extremely obvious reasons.

If your hardware does twos complement, the compiler should choose to define the implementation-defined behavior to be twos complement. If the hardware traps, the implementation-defined behavior would be to trap. And so on.

And, as long as the goal isn't to exploit the overflows for fancy optimizations unrelated to what the hardware does, that'll perfectly fine and produce code that doesn't do anything extra over what you'd expect, and won't ever summon cthulhu (as long as the behavior the implementation defines isn't to summon cthulhu).

(that's all assuming that there's a single best definable specific behavior for a given op for a given architecture, which isn't actually true; e.g. ARM scalar shifts do `shift %= width`, but SIMD shifts shift the other direction on a negative amount, and give 0 on `abs(shift) >= width`, only looking at the low 8 bits of the shift amount)


> (that's all assuming that there's a single best definable specific behavior for a given op for a given architecture, which isn't actually true; e.g. ARM scalar shifts do `shift %= width`, but SIMD shifts shift the other direction on a negative amount, and give 0 on `abs(shift) >= width`, only looking at the low 8 bits of the shift amount)

In that case, your implementation defined behaviour could still be that your compiler is allowed to non-deterministically pick between any of the allowed behaviours. But nasal demons would still be verboten.


> your implementation defined behaviour could still be that your compiler is allowed to non-deterministically

No. Non-determinism is a key difference between undefined and implementation defined.

Remember back in school - a function is well defined if it has exactly one output for every input.

Let me turn this question around since we keep going in circles. If the standard said the words you just said instead of “undefined” how would that make C better? Is there some bad implication I am missing?


> No. Non-determinism is a key difference between undefined and implementation defined.

Actually, I think that's the difference in the C spec between implementation defined and unspecified. Eg function arguments can be evaluated in any order the compiler feels like, but it's not UB (because the only thing that's allowed to vary is the order, nasal demons are still verboten.)

You are right that it seems like the C standard expects implementation defined behaviour to be deterministic. That's annoying and should probably be fixed.

> If the standard said the words you just said instead of “undefined” how would that make C better? Is there some bad implication I am missing?

A lot better! If you know that eg shifts (or signed integer addition) always produce a value (even if non-deterministically) out of a documented set of valid options, that would make your program much more easy to reason about for the programmer, than UB's nasal demons that fly back in time.

Informally, many people already try to reason about their C programs like this.

See also how eg malloc returns an address, but apart from some basic requirements on alignment etc, the spec doesn't specify what address gets returned: the implementation can pick non-deterministically.


Yes, we are in complete agreement.

> If your hardware does something else, `fwrapv` will not help you.

Yes, and both you and your compiler know what your hardware is doing.

If you have hardware that does something else, you could do another implementation defined thing. Eg one's complement or trapping or whatever it is your hardware is doing.

>> Unsigned integer overflow doesn't have any conditionals either.

> That's because there is no detection needed. The instruction set for unsigned integers is designed to do that.

It's designed to do that on some (well, almost all) computers. Just like it's designed to do the same for signed integers. Most computers don't even have a special instruction for signed integer addition, because they are twos-complement machines.

---

Implementation defined could also be:

'Do whatever your processor does with an 'add' instruction; but don't let anything weird travel backwards in time, or make additional assumptions about execution paths not happening.'


> Do whatever your processor does with an 'add' instruction; but don't let anything weird travel backwards in time, or make additional assumptions about execution paths not happening.'

That’s exactly what the standard means by undefined. Anything more you are reading into it is an absurd hypothetical. If you find such an implementation - let me know so I can avoid it. Clang and gcc do reasonable things - as you have pointed out there are helpful flags to configure it too.


> Clang and gcc do reasonable things - as you have pointed out there are helpful flags to configure it too.

Without these specific flags, they don't do 'reasonable things'. They happily assume that UB can't happen and use that information to declare certain code paths dead and optimise them away.

The simplest example is something like this:

    signed char x = 20;
    while(x < x+1) {
        do_something();
        x++;
    }
GCC and Clang when told to optimise will happily assume that you have an infinite loop here. (Unless explicitly told otherwise via the compiler flags we mentioned.)

That’s a completely reasonable thing to happen - no computer melting here. And as you said you can configure your desired behavior.

Did you know you can accidentally write infinite loops in rust?


The parent post has a mistake - it should use an "int" type instead of "signed char", otherwise the implicit promotion done by a+b breaks the desired showcase.

That fixed, it wouldn't be an infinite loop if taking the "x+1" to do what the hardware would do, i.e. twos complement; "x < x+1" would become false for x==INT_MAX; and yet clang considers it UB, and thus the loop is UB once it hits the wraparound point, and that can result in unintentional computer melting: https://godbolt.org/z/833KzYY5G

Of course you still generally need to have whatever harmful code you don't want ran here to still be present in the binary somewhere, and "melt_computer();" is a rather weird thing to have, but a "perform_heavy_work();" or "tolerate_high_temps_for_a_bit();" or "void debug_override_temperature_readings(){...}" are more realistic. (of course you might need special permissions to do those, but, uh, it's possible to get said permissions, especially if the point of the software is to operate those things.. and there are plenty of harmful things one can do without special perms)

(edit: actually, even without that char↔int fix, the parent post still shows UB because infinite loops are UB in C (as long as the condition isn't a compile-time constant) (also the x++ is UB on reaching the char limit); copy-pasting it directly into harmless_loop still results in computer melting)


No, infinite loops aren't generally UB in C. Only when they essentially "don't do anything".

Assume that `do_something();` does some IO, and C is perfectly happy with that loop.

You could have also added extra conditions to make the loop look less infinite to the C compiler. Eg do a check in the loop body, and optionally break from it.

> Of course you still generally need to have whatever harmful code you don't want ran here to still be present in the binary somewhere, [...]

Well, assume that we are eg looking at the 'sudo' program. In any case, almost any code can become harmful, if memory gets corrupted, and if attackers control the input.


Oops, yeah, need to remove the `do_something();` for the loop to be UB.

> Well, assume that we are eg looking at the 'sudo' program. In any case, almost any code can become harmful, if memory gets corrupted, and if attackers control the input.

Not really; OOB stores, sure, but, other than that, even something basic like username comparison or something resulting in OOB loads on too-long names could technically be completely safe if OOB loads were defined as "either returns an arbitrary value, or crashes", just resulting in superfluous rejections.

And, in a safe language, no matter how wrongly you'd implement flag/argument parsing (besides some equivalent of just straight up passing the args to system() or equivalent), as long as the final thing actually processing the request & comparing passwords doesn't assume anything specific of the parsed internal data format, it could cause no actual exploitable issues.


> and that can result in unintentional computer melting:

If your program already has the capability to melt your computer, then you can accidentally trigger that with a bug. That’s a massive if, and that’s a risk of any bug in such a program.

Once again, you can already configure the behavior to avoid this (ftrapv, or fwrapv) which is as much as rust will do.


But UB extends that to "a bug anywhere can potentially cause anything to go wrong", whereas traditional logic bugs require the bug to be functionally related to the bad-if-misused code (for which you can do things like carefully vet the potentially-harmful code and be sure your program will never do anything harmful regardless of how many bugs all other code in the program has).

See all the exploits that go from a single OOB write (or other sources of UB) to arbitrary code execution; it's really not that hard. Whereas such spooky-bugs-at-a-distance is just plain impossible in safe Rust or Java.


> Whereas such spooky-bugs-at-a-distance is just plain impossible in safe Rust or Java.

There are all kinds of bugs in these compilers all the time. Not to mention what Java might do with a multithreading bug. I don’t think you can assert that.

And once again - you can match those languages intended treatment of signed integers by using a compiler flag.

I think clang should insert a ret at the end of the function. Gcc chooses to infinite loop instead which is far better. This is unfortunate for clang.


> There are all kinds of bugs in these compilers all the time. I don’t think you can assert that.

That's a separate discussion; generally you need specific conditions in the source code to have compiler bugs affect your code (and for any given compiler bug it's much more likely for it to visibly break your program (i.e. be a clear release blocker) than quietly break on some specific user input; assuming you have appropriate testing).

I don't think I've ever even heard of any exploitable cases of spooky-bug-at-a-distance in Java, whereas such are commonplace for C projects.

Bad multi-threading in Java still won't produce spooky bugs at a distance - worst you may get is a long torn on a 32-bit boundary on 32-bit systems, or reordered reads/writes.

> I think clang should insert a ret at the end of the function.

......But UB means it's not required to do that... ...that's, like, the main thing the discussion is about. This isn't any form of issue or bug in clang to be fixed, it's intentionally chosen that that behavior is fine on UB.

(to be clear, despite all this, I think UB is a fair enough thing, and am perfectly fine with compilers "exploiting" it for perf (I've even had a case of compiler optimizations on integer overflow fixing a bug! found out when I realized I didn't run certain tests on ubsan/debug builds), and, indeed, outside of OOB stores, is generally not too exploitable; but it's still very far from not being potentially very problematic (and the "potential"ness is generally independent from specific code), especially for projects where safety matters)


Yeah you definitely wanna review section 3 of the C standard again because it literally doesn't say that lol

Please read the context. let me quote myself

> A conforming implementation may chose to melt your computer - and you can choose to not use that implementation. But clang and gcc will never melt your Linux.

And for the 3rd time, a real world case that might ruin your computer is an embedded device without memory protection.


Your conception of UB is flawed.

This has nothing to do with memory protection. If a process has the ability to do something, then it may do so. It wouldn't be because GCC and clang choose to melt your computer, it is because your program has a bug, and the consequences of that bug can be anything, including jumping into random executable code that happens to ruin the computer.


What are you responding to here?

> If a process has the ability to do something, then it may do so

You agree with me. Once again, on an embedded device writing over a buffer may do absolutely anything. Thanks to memory protection, what a process can do is more limited. The C spec is written for both cases.

No gcc and clang will not insert code to start rm’ing files when you overflow a signed integer.


> gcc and clang will not insert code to start rm’ing files when you overflow a signed integer.

It could. If it detect that a code path cannot be taken without causing overflow, it will assume this code path cannot be taken and will optimise it by removing it. No need even to return from the function. If you reach this code anyway, it can run whatever functions is on the binary after it. If you're unlucky that's a function to remove temporary files which would be ran with bogus arguments.


If you are unlucky, it's going to execute whatever user data a malicious attacker added by carefully crafted input.

(Yes, these days that's harder, but you can still do a more complicated version of the same attacks. See return oriented programming.)


You are upset we aren’t familiar with papers and proposals which are not yet agreed on as standard? And for unreleased C++ versions?

What's upsetting is comments that confidently state inaccurate or clearly wrong statements, thereby spreading misconceptions

> What's upsetting is comments that confidently state inaccurate or clearly wrong statements, thereby spreading misconceptions

The bulk of the people in this thread clearly have no grasp of what UB is, even though they are very vocal in the way they parrot their misconceptions and outright absurdities. They would never write half the absurd and misguided claims they did if they were even aware the standards explicitly use the term "non-portable" to define UB.


Nah, thankfully liability legislation is finally happening so folks will actually bother to learn standards and how compilers approach them.

Or eventually face consequences.


Now we are waiting for new legislation and overhaul of the tech system to make your point?

Nah, it is already here depending on the country one lives on, I don't need to make any point.

Also I stand by my comments even if they hurt feelings from people which probably never openened a single copy of ISO standards, not even the table of contents.


Ok, I thought you were telling us we wont get UB until we read papers which are proposals for future standards. If you find any corrections in this thread about existing standards, please comment.

You’re describing implementation-defined behavior, not undefined behavior. IB does something but the committee doesn’t say what. UB does anything and the compiler can assume you never intended to cause it.

Exactly. The classic example is invalid pointer deref. It’s too costly to check every deref against every allocation (outside of special debug modes). So the system usually can only detect if there is a virtual memory page fault, in which case it can crash.

In embedded systems without virtual memory there is no validity checking at all. Or it could periodically check (random sampling).

So the standard makes the reasonable choice to leave it undefined. If you can detect it and crash, that’s great. If not and you accidentally overwrite your programs instructions, anything can happen.

Rust only avoids UB in so far as it relies on default clang behavior on modern hardware.


> Rust only avoids UB in so far as it relies on default clang behavior on modern hardware.

That's not true.

Rust behaviour avoids UB by not compiling with invalid reference deref because it checks life time.

Clang's behaviour is likely a crash but anything can still happen. It is literally undefined.


> Clang's behaviour is likely a crash but anything can still happen.

You act as if clang itself is random. Clang will do and handful of things none of them which will melt your computer.

A given implementation may choose to do anything. Clang and gcc on major operating systems do reasonable things.

> It is literally undefined.

Ok but I’m explaining why undefined is the right thing.


The code generated by clang can do anything.

Ok, clang will not directly melt my computer because clang just generates code, but imagine this:

     if (temperature_too_high) 
          lower_temperature();
     
if somewhere else the code access invalid pointer, clang may decide to remove this condition altogether because it thinks it must be dead code, for example. And running this code will melt the computer even though the programmer thought this wouldn't be possible.

That’s not true.

A conforming implementation may chose to melt your computer - and you can choose to not use that implementation. But clang and gcc will never melt your Linux.

Once again, a real life example where that is relevant is an embedded device with no memory protection. If you overwrite a buffer you can overwrite the OS code - breaking your computer and requiring you to flash the memory.

It seems there is a lot of misguided accusations going around about who misunderstands UB.


> But clang and gcc will never melt your Linux.

If you use gcc or clang to compile your kernel, they will happily melt your Linux.


If you're referring to what might happen when you overwrite a buffer in privileged kernel code, yes - but that is true regardless of clang, gcc, or what the C standard says about UB.

Re Rust, that’s simply not true. Rust relies on compile time validity checks that have nothing to do with virtual memory.

You confused two parts. I am not saying that Rust has the same pointer deref, I’m saying that if you try to specify rust you will find parts that implicitly rely on clang or hardware default. In other words the behavior is unspecified.

The borrow checker has a formal proof and it is completely independent from hardware.

Gatorade gives you electrolytes.

> Exactly. The classic example is invalid pointer deref. It’s too costly to check every deref against every allocation (outside of special debug modes). So the system usually can only detect if there is a virtual memory page fault, in which case it can crash.

You could still define what happens.

Eg you could define that whatever error might happen because of invalid pointer deref can't travel back in time. (UB famously can travel back in time: UB at any time during the execution makes the whole execution UB.)


UB doesn't allow arbitrary time travel; any time travel you see should be explainable by happening to run code that undoes what was done previously, and where it's not possible to explain it as such it's a compiler bug (ref: wg14 member: https://news.ycombinator.com/item?id=40836898).

This is now clearly standardized, but even without that it's pretty trivial to see how compilers would generally abide by this - if you call an external function that does arbitrary things, it may exit(), so the call can't be optimized out even if it's followed by UB. (printf & co are effectively arbitrary calls for these purposes for multiple reasons). And all other behavior is just writing things to memory, which can be undone by happening to write what was there before. (ok there's also volatile, and a bug in clang (not gcc though): https://github.com/llvm/llvm-project/issues/102237)


Check the ensuing discussion under the comment you linked to. Some limited time travel is still allowed for UB.

There's still no time travel in any of those posts; some may require an explanation of "the compiler replaced the UB division with a call to 'bar'" or something, but "calling 'bar'" is still a subset of the "anything" that UB can result in. If there's a specific example you think doesn't follow that, I'd appreciate a specific reference.

(and singron's mention of SIGSEGV handling is entirely irrelevant; the C standard only cares about the abstract machine; in general real memory state can temporarily appear in arbitrarily weird states in practice if interrupted by a signal even without UB, e.g. https://godbolt.org/z/91ModhbEY will have all a[0..15] written even if there's a handled trap on the write to b[0] (or otherwise an interrupt between the two SIMD writes) (using restrict for simplicity, but taking an 'int n' argument and using that for the loop bound will result in ~the same thing without restrict, just with less direct assembly); the same applies even to Rust, though of course with Rust you shouldn't be able to have a situation where you trap on a safe write; could still have a random interrupt happen in the middle though)


And then it was a decision on the part of compiler vendors to define it to do insane things like time travel. Not the standards committee. Compiler vendors could have just as well defined it to wrap.

Some compiler decided to define overflow as wrapping. Such as GCC/clang when passing the -fwrapv flag.

Most projects don't use that flag though, why not?

Nite that if you assume -fwrapv, you're not writing in C anymore, you're using a vendor specific dialect.


Everyone writes in vendor-specific dialects of C. Both POSIX and Win32 define behaviour that isn't defined in C (such as rules for unaligned pointers), while also undefining behaviour that is defined in C (such as what happens if you call fopen when one of the functions in your program is called "open"). Even on embedded platforms, there's no "main" function in freestanding C - your entry point is a vendor-specific extension.

> Everyone writes in vendor-specific dialects of C.

That's not true. Some people try to write fully spec compliant C programs that would behave the same on every compliant C compiler.


Time travel is a consequence of UB anywhere in an execution making the whole execution UB.

The committee could have said that UB only makes the rest of the execution UB, not what happened before.


> It is the reason why the standard purposely leaves some areas undefined.

No. It is _one_ of the reasons, not _the_ reason.

For example, signed integer overflow. If all you were worried about was the behaviour of the underlying hardware, you would declare this _implementation defined_. (With the vast majority of implementations today picking twos-complement.)

Instead, even compilers that target two-complement machines leave this deliberately undefined, so they can target exploit this corner for performance gains, without actually have to properly prove that leaving out certain checks would be sound under twos-complement.


Exactly this. Most of the art of it is avoiding UB and sticking to implementation-defined behavior.

I love your username, btw :)

> Just consider how hard it is to build any meaningful abstraction over the predicate/register models across AVX-512, SVE, and RVV.

Note that Highway mentioned in the post does take care of this, which is no easy feat but also a proof that it is doable.


> Rust feels like a Python developer’s idea of a high-performance computing language.

I might be wrong, but I think it sounds more like Rust doesn't move as far away from the C or C++ way of doing things as you want it to. At the very least, Rust is no worse than C or C++ at any of the things you mentioned.


In my (biased) experience, Rust is much harder to use for advanced projects, than C and C++. On the bright side, it’s also harder to misuse :)

I've found it much easier to use from writing a web service to writing a high performance DB that outperforms RocksDB and clearly people are using it for things like writing operating systems as well as game engines. I'm not sure what in your mind falls under "advanced projects" but I suspect it's something like number crunching (although you link StringZilla so not sure).

I'm still not seeing any description of specific challenges you feel are harder in Rust than in C/C++. In my mind Rust is completely equivalent in being able to accomplish the same tasks.


> easier to use from writing a web service to writing a high performance DB

Have you looked at what the actual implementation code of tokio / axum etc looks like?

I can't comment on DB because that is quite far out of my wheelhouse but regarding game engines, I've almost universally seen people revert to storing objects in arrays and passing around "handles" which is literally pointers with extra steps... But I guess at least you are protected from some issues there.

Rust forces absurd levels of abstraction onto code very early and if you are wrong you need to make sweeping refractors.

I really want to love the language, but building bottom up is extremely difficult with it, it may be Stockholm syndrome and familiarity but I can generally get much further into a project much quicker in C++.

In Rust it feels like I'm fighting the language and it's not the borrow checker.

There are so many great ideas in Rust and I keep trying it because of them, don't get me wrong I do think a safer language is the future but Rust isn't there yet. Every time I try it it's better though.


I have actually looked at the interior of Tokio. While there's a lot of complexity there, most of that complexity is in making a highly performing multi-threaded work-stealing runtime rather than anything Rust-specific. Indeed, I find it easier to reason about than in C++ because the ownership rules are enforced and where they're violated is clearly annotated with unsafe & lots of documentation explaining why it's safe. I suspect you'll see similar things in axum but I can't speak to that codebase specifically.

I don't think you'd be able to do something meaningfully easier in C+. I think it is Stockholm syndrome. It took me about 2 months to get familiar that it wasn't uncomfortable anymore and 8 months to reach the "I can generate code faster than I ever could in C++".

For context consider that I'd spent the prior ~11 years coding on and off in C++ professionally and the prior 10 years before that as a kid learning C++ so C++ was definitely a language I felt at home in. I spend almost no time compared to C++ trying to make the build system work and I can quickly pull in high quality components others have written vs in C++ where I either have to implement it myself, spend time on build integration, or figure out a way to do without.


Oh I completely agree Tokio is a complex piece of software, but my point was you need to implement a decent amount of that even for trivial cases where Tokio isn't needed and therefore the default response is to just pull in Tokio.

Part of the complexity for me at least in async Rust was that it is some weird mix between coroutines and completion handlers. It's a pretty nice abstraction with once you wrap your head around it but how many Rust devs would be able to do that?

Should I find a performance regression and need to dive in to the code it takes a lot of unknown unknown to be explored because what is actually happening is hidden from you by the abstraction and the Async book's response is "don't worry about that just use tokio". And the "advanced" section is incomplete.

I've almost always found the complexity of these systems to be reasonable when I've dug through it, but I had to manually mess with Rust code to understand the async model, the book failed to explain what it is and when it clicked it made a ton of sense.

Maybe my issue is that Rust just forces abstractions down your throat and says "trust me you need it" but I would much rather discover why and even find a more relevant abstraction.

The last example I made still stands, there is so much complexity caused by abstractions that the wgpu "book" recommends an older version of winit because the current version made API changes that requires a lot of code change. Winit 0.30 released almost a year ago, I understand that it is likely a volunteer maintainer there but if the language allowed for easy refactoring why has it been a year?

The fact that the webgpu standard is still in active development likely adds to that.


Async rust is purely a state machine equivalent to c++ coroutines. There’s not really any more magic than that within the language itself. In other words, async rust is syntactic sugar and nothing else.

Tokio is a work stealing runtime that then layers other requirements on it (eg futures must be send to support work stealing). This is similar to Apple’s libdispatch.

I’m not sure what performance problems or other magic you’re referring to but it’s hard to answer non-specific gripes with information. It sounds like you’re saying it should be easier to roll your own runtime and there’s slow progress on standardizing some of the pieces. But there’s a lot of runtimes you can pick from (async-std, snol, glommio, monoio, etc). It’s fine that implementing a runtime is complex because a lot of that complexity is inherent (Rust adds some extra boilerplate but that’s all it really is)

I haven’t really found Rust’s async to be all that hard to grok. It’s got some annoying boilerplate in some places where there’s impedance mismatches in very rare hyper specific cases but those don’t normally come up.


From what I played around with Rust futures are not entirely coroutines but a mixture between coroutines and sender / receiver model. I'm pretty sure C++ will end up somewhere similar in the end or at least most libraries will model it that way.

Effectively the language has built around the idea of a runtime being needed at all whereas C++ coroutines ( and coroutines in general ) don't need a runtime. They are just state machines with syntactic sugar.

In Rust something that get's awaited is expected to inform the runtime of when it should be continued. You cannot just have a bunch of coroutines that call each other, you must build a basic runtime even if it is just doing normal coroutines.

Once that clicked to me it made a lot more sense but it was pretty annoying seeing syntax for coroutines but not being able to just use them as coroutines.

And I have no specific performance gripes, it's more that instead of just needing to understand he CS fundamentals, I also need to understand how the language decided to implement those CS fundamental and abstract them away.

I think possibly being able to grok Rust async model being easy for you is slightly clouding the complexity of it. If you know the complexity of what is needed to do mutltithreaded async well then it's not hard to wrap your head around why it was done that way, and as I said I found the implementation beautiful when I understood it, but there was no documentation that mentions it and the only documentation is surface level.

As one of the other comments here said, it's like python devs wanted to write a low level high performance language. All the knobs are there but they decided to hide them away behind abstractions that theoretically make the happy path easy, but slightly off the happy path you get thrown into the deep end. In C / C++ world there isn't any kiddie pool, all the uglyness is on display.


> All the knobs are there but they decided to hide them away behind abstractions that theoretically make the happy path easy, but slightly off the happy path you get thrown into the deep end. In C / C++ world there isn't any kiddie pool, all the uglyness is on display.

That's a perfect characterization - you can still get to the ugliness when you need it (& I don't think it's such a sheer cliff but YMMV). The "python devs wanted to write a low level high perf language" was derogatory suggesting that those authors didn't know what they were doing and crippled the language. It's quite the opposite I think though - they knew exactly what they were doing & made it super easy for developers to write correct code with a simpler mental model of the code.

As for "async as coroutines", you're right that it's a bit more complicated than C++ coroutines, but that's why there's an explicit coroutine mechanic headed your way: https://doc.rust-lang.org/std/ops/trait.Coroutine.html


> The "python devs wanted to write a low level high perf language" was derogatory suggesting that those authors didn't know what they were doing

I didn't mean it as derogatory at all actually, more a matter of it's being pushed to developers that have never used a systems level language and had to deal with the types of algorithms at the systems level, and actually most of them can even do it because the abstractions are great. But those same developers can't see how those abstractions could be implemented and wouldn't know where to look.

Hey at least this chain of comments has helped me understand my dislike of where Rust is positioned and as I said in another comment the language authors are makeing the necessary changes to appease people like me, I'm happy to hear about the corouting mechanic because it adds another checkbox that Rust is at the very least trying to position itself correctly.

You are the first rust "evangelist" that I've had he privilege to chat with that actually didn't make me want to drop the language and go back to C++.

You pointed out where rust is strong without just shouting "you are holding it wrong". Thanks for the great experience.


Neat. I think Rust even today basically is good enough to almost never touch c++ again. The strengths outmatch the weaknesses significantly and you end up at a significantly higher output velocity because the software is more stable and higher quality.

> I didn't mean it as derogatory at all actually, more a matter of it's being pushed to developers that have never used a systems level language and had to deal with the types of algorithms at the systems level, and actually most of them can even do it because the abstractions are great. But those same developers can't see how those abstractions could be implemented and wouldn't know where to look.

While there are certainly people who came to Rust from higher languages and Rust’s safety guarantees makes system- level programming more accessible, the language authors and many early users are very clearly systems engineers and understand the space. That’s why noalias is such a critical part of the language and it turned out that LLVM implemented that incorrectly in a bunch of places because no c++ code actually exercised that attribute all that much (and the noalias is a good chunk of the reason why a straight port of c/c++ code often ends up performing slightly better). Rust is the first language I’ve seen where I’d feel comfortable designing a high performance system and then having someone with a web background also contributing without creating all sorts of problems. That’s why network just is impossible in c++


Commenting on two loosely-coupled things here!

The WGPU/Winit compatibility situation is indeed a mess. (Throw EGUI in there too if you're into that...) I was able to get all of them onto the latest, but it was an adventure of troubleshooting, and asking for help on Discord, Github etc. I think this could have been avoided by adding current and/or practical examples to the WGPU repo and/or docs. So, documentation problem maybe? Most of the examples are too provincial or trivial to use as a basis for building a practical project. (Or upgrading one!)

Count me as someone who can't grok Async. I've tried several times. It doesn't stick to my brain, and/or I'm not sure how to keep it from propagating through the code. I've come to the realization that I will probably never get it, and/or see eye-to-eye with the rust community on this. Most of the web ecosystem will remain off-limits.


> I can't comment on DB because that is quite far out of my wheelhouse but regarding game engines, I've almost universally seen people revert to storing objects in arrays and passing around "handles" which is literally pointers with extra steps... But I guess at least you are protected from some issues there.

Regarding game engines, the empirical evidence shows that Bevy has been moving at least as fast as, if not faster than†, comparable C++ engines. I listed all the features I landed in the past year here: https://news.ycombinator.com/item?id=42945730

We love to argue on message boards about theoretical or anecdotal productivity of programming languages, but at the end of the day the only thing that matters is what people have actually been doing with the language, and for Rust the answer is "quite a lot, actually".

†In my view, Bevy has been moving significantly faster than comparable C++ engines, but I don't need to argue that to make my point.


Don't take it as an attack on game engines in Rust. It really isn't, it's more an observation that people happily work around the fundamental language design and that should be an indicator to the language authors.

And they are definitely listening though, when I last took a dive into Rust I faught a lot of weird borrow checker edge cases that I could easily verify as correct. The borrow checker now no longer complains about some of those cases.

As I said I think it's a language maturity thing. It will get there.


The borrow checker changes you’re talking about is several years old. The next gen borrow checker should solve the remaining annoyances but that’s years away.

But I disagree that Rust code breaks down to emulating pointers by way of handles.


Not entirely what I was saying, it's one of the patterns I've seen and those that use arrays and indexes into arrays as "handles" will also vehemently disagree with me when I say they just implemented "we have pointers at home".

Those are pointers, there's slightly more protections with them but depending on how you model it use after free bugs are back on the table when doing that. At least buffer overflows shouldn't be possible since that is runtime checked.


It’s possible use after free isn’t actually possible because Rust’s ownership system lets you express lifetimes that the compiler enforces (or you enforce it at runtime yourself).

But handles aren’t uncommon even in c/c++ - it’s nothing to do with the language and more how the author thinks the particular problem domain is best modeled.


> I'm still not seeing any description of specific challenges you feel are harder in Rust than in C/C++

It's harder to model domains in Rust. C has the same problem and it is a big reason why the industry standard for game dev is C++


Care to give an example? I think GUIs remain the softest domain modeling area, but that’s about how do you do next gen GUI toolkits that are super high performance, safe, and lower overhead vs what Rust has today. But that’s an underserved niche anyway vs c++ toolkits or electron.

The other weak parts might be ecosystem immaturity (eg Unreal vs Bevy) but that’s not a language modelling issue.


Rust makes it hard to represent mutable graphs. If I have a value that can be updated by multiple GUI nodes, I have to architect my way around the borrow checker in Rust. Is it safe? yes. Is is performant? for a skilled Rust dev, yes. But it takes me >25% more time to do compared to C++. 25% of a year long project is 3 months

Could you expand on this? What exactly doe "model domains" mean?

They wrote 2 whole paragraphs about SIMD.

And to me SIMD does not seem significantly harder in Rust than in C++ but I've only done a little Rust SIMD so I'm willing to be wrong on that (although others have said it's not much harder than in C++ so not sure).

In my experience Rust is much easier for advanced projects. Once you get into high performance code, C++ takes much more time, for two reasons.

1. With C++ you have to think a lot. Like, a lot. To convince yourself that what you are doing is safe, and that the lifetimes and races of various objects are safe. Rust takes a huge load off, by failing to compile your mistakes. You know that for the code that doesn't use "unsafe", if it compiled, then you don't even have to think about races or lifetimes.

2. C++ has so many places where copies sneak in. Tracking down needless copies in C++, for object types where it's not as simple as "just disable copy construct & copy assign", can be tricky and is extremely brittle to future changes of code. And not just for the CPU cost of copies, but RAM costs too.

And I say that as someone who's been coding C++ on a daily basis since the 1990s.


I don’t think Rust is worse at SIMD code, though.

Python doesn't have anything remotely similar to Cargo that isn't written in Rust (uv is). It's exactly backwards. Python is benefitting from Rust philosophy.

I guess it comes down to application. If you don't attempt to find the most general solution, you can dodge those pitfalls. Case in point, abstracting over AVX-512, SVE, and RVV may be tough, but picking one is fine (On nightly only for now), can with the right abstractions can be almost as straightforward as using normal scalar values. I don't have a solution on the CUDA variants either; have been hard-coding that as well... (Cudarc lib with CUDA-version feature gates and GPU-series-specific code). Haven't hit a brick wall yet, but might... or might not.

> Rust feels like a Python developer’s idea of a high-performance computing language.

I'd say the language complexity (and it keeps piling more things) means it's pretty far from Python's ideas.

Julia is closer.


Rust is an extraordinary simple language. Compare it to the complexity of C++.

We can say that Mars is more similar to Earth than Jupiter even though Sagittarius A* exists.

Unsure which celestial body is supposed to be which in this comparison :p

If you only compare against C++, maybe. Otherwise, no. C++ and Rust are in a class of their own when it comes to compile times and complexity.

Well C++ is the only major language (other than C itself) that occupies the same space. So it's natural that they are compared. And Rust is an order of magnitude simpler language than C++ almost comical web of intricacy.

I’m seriously interested; what is the best language/franework/method that will squeeze out every bit of performance from advanced hardware that you know of?


Seems rustc nightly does successfully vectorize the first sigmoid example: https://rust.godbolt.org/z/e1WYexqWY

Also there's progress on making safe intrinsics safe: https://github.com/rust-lang/stdarch/pull/1714


Very interesting! I posted a vector and quaternion lib here a few weeks ago, and got great feedback on the state of SIMD on these things. I since have went on a deep dive, and implemented wrapper types in a similar way to this library. Used macros to keep repetition down. Split into three main sections:

  - Floating point primitives. Like this lib. Basically, copied `core::simd`'s API. Will delete this part once core::simd is stable. `f32x8`, `f64x4` types etc, with standard operator overloads, and utility methods like `splat`, `to_array` etc.

  - Vec and Quaternion analogs. Same idea, similar API. Vec3x8, Quaternionx8 etc.

  - Code to convert slices of floating point values, or non-SIMD vectors and quaternions to SIMD ones, including (partial) handling of accessing valid lanes in the last chunk.
I've incorporated these `x8` types into a WIP molecular dynamics application; relatively painless after setting up the infra. Would love to try `Vec3x16` etc, but 512-bit types aren't stable yet. But from Github activity on Rust, it sounds like this is right around the corner!

Of note, as others pointed out in the thread here I mentioned, the other vector etc libs are using the AoS approach, where a single f32x4 value etc is used to represent a Vec3 etc. While with this SoA approach, a `Vec3x8` is for performing operations on 8 Vec3s at once.

The article had interesting and surprising points on AVX-512 (Needed for f32x16, Vec3x16 etc). Not sure of the implications of exposing this in a public library is, i.e. might be a trap if the user isn't on one of the AMD Zen CPUs mentioned.

From a few examples, I seem to get 2-4x speedup from using the x8 intrinsics, over scalar (non-SIMD) operations.


Why do the x4/x8 types seem to be the default in rust?

A portable SIMD feature should encurage portable SIMD and not a specific vector register size.


Amen! I really do not understand this. It has been 7 years since SVE was introduced.

Writing an application in terms of a specific lane count loses performance portability - either it's too many, or too few, for the particular CPU. And it also enables/encourages antipatterns like putting RGB in Vec4.


This sounds like a great idea. I went with this approach because I'm new to SIMD, so I aped the most promising API (core::simd), extending it naturally.

I need to think through the consequences. It might involve feature gates, and/or an enum. So, for example, instead of:

  pub struct f32x8(__m256);
It might be this internally, with some method to auto-choose variant based on system capability?:

  pub enum f32_simd {
    X8(__m256),
    X16(__m512), 
  }
etc. Thoughts?

I'm not proficient in Rust, but API wise I'd conceptually define types like f32xn or f32s, which have the number of elements that fit into a vector register for your target architecture, so 4 for NEON/SSE, 8 for AVX and 16 for AVX512.

I can recommens lookig at the highway library: https://github.com/google/highway



I’ve been playing around with SIMD since uni lectures about 10 years ago. Back then I started with OpenMP, then moved to x86 intrinsics with AVX. Lately I’ve been exploring portable SIMD for a side project where I’m (re)writing a Numpy-like library in Rust, mostly sticking to the standard library. Portable SIMD has been super helpful so far.

I’m on an M-series MacBook now but still want to target x86 as well, and without portable SIMD that would’ve been a headache.

If anyone’s curious, the project is here: https://github.com/IgorSusmelj/rustynum. It's just a learning exercise for learning Rust, but I’m having a lot of fun with it.


A problem for RISC-V is going to be that there's currently no way for user code to detect the presence of RVV. I have no idea how you can do multiversioning with that limitation.

The solution is to ask the OS to detect it for you. Linux offers a syscall for this (riscv_hwprobe). Has the drawback that it requires OS support, of course. But RVV requires OS support anyway (e.g. managing mstatus, saving vector registers on context switch), so that seems fine to me.

There is some work on an OS-agnostic feature detection C API: https://github.com/riscv-non-isa/riscv-c-api-doc/blob/main/s.... Still quite new though, and potentially might change (as it did a month ago).

Or also getauxval? Highway has code to check for this, including that vectors are at least 128 bits: https://github.com/google/highway/blob/master/hwy/targets.cc...

Oh? Isn't that what this does? std::arch::is_riscv_feature_detected!("v")

Hmm… now that I actually experiment with it, I can't get it to return `true` on hardware that does support it, unless I also compile with -Ctarget-feature=+v. And if I do, then the binary crashes with SIGILL before getting to that point on hardware without rvv.

So if it's always equal to cfg!(target_feature="v"), then what does that even mean?

I have created https://github.com/rust-lang/rust/issues/139139


It's a shame that the `MISA` CSR is in the 'Privileged Architecture' spec, otherwise you could just check bit 21 for 'V', but that appears to only be available in the highest privilege machine-mode.

Presumably your OS could trap attempts to read the CSR and allow it, but if not then it's a fatal error and your program shits the bed, otherwise you rely on some OS-specific way of getting that info at runtime.


What happens there when you try to execute a missing RVV instruction ? On other archs you get a SIGILL which you can handle.

With RISC-V being an open ISA a vendor can freely add some non-RVV thing on the instruction space used by RVV if it doesn't desire to support RVV. And that already exists with Xtheadvector (aka RVV 0.7.1) where thead made hardware with pre-ratification RVV that's rather incompatible with the ratified RVV1.0 but still uses generally the same encoding space.

It's "reserved" which is basically the same as C's UB - anything can happen (nasal dragons etc.) so you can't rely on it.

I guess being unpriviliged this still guarantees the dragons are restricted your process, otherwise the chip would have a security problem. So as a heuristic you could fork your process, and try to execute SIMD operations (detecting known nonstandard versions and failing them), and if results seems fine send a "ok" flag up a pipe. (or compile this into a separate test executable)

Just because it's reserved doesn't mean anything can happen. On x86_64, you get a clearly defined error when you use reserved bits and the like. I don't know if RISC-V is the same, but if it isn't it's because they chose to be vague, not because that's what it means to be reserved.

For what it's worth, the RISC-V unprivileged spec says:

> The behavior upon decoding a reserved instruction is UNSPECIFIED.

> Some platforms may require that opcodes reserved for standard use raise an illegal-instruction exception. Other platforms may permit reserved opcode space be used for non-conforming extensions.

The RVA23 says: (https://github.com/riscv/riscv-profiles/blob/main/src/rva23-...)

> Implementations are strongly recommended to raise illegal-instruction exceptions when attempting to execute unimplemented opcodes or access unimplemented CSRs.

and has an optional extension "Ssstrict":

> Ssstrict No non-conforming extensions are present. Attempts to execute unimplemented opcodes or access unimplemented CSRs in the standard or reserved encoding spaces raises an illegal instruction exception that results in a contained trap to the supervisor-mode trap handler.




Join us for AI Startup School this June 16-17 in San Francisco!

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: