I particularly suspect that if something like Cranelift gets evolved more then it will eventually reach throughput parity with llvm, likely without actually implementing all of the optimizations that llvm has. It shouldn’t be assumed that just because llvm has an optimization that this optimization is profitable anywhere but llvm or at all.
Final thought, someone should try this with B3. https://webkit.org/docs/b3/
People think the C compiler community is dominated by GCC and Clang, and it is, but there are literally 1000s of implementations out there in the wild. Most are necessary, because we need code generated for some obscure processor architecture that's completely proprietary, but you can create that "backend" in LLVM itself - it's a new target architecture instead of e.g x86.
The great thing about LLVM is that it's effectively the quickest (and probably the best) way to generate machine code without putting in too much effort, for a language. Whether that language be a research language or an existing industry language (say, C), that kind of establishment is hugely valuable.
A great example of a good monoculture is the Go monoculture. Sure, there's gccgo, but the proportion of people using that vs. the reference implementation is minimal, and that reduced fragmentation is actually a good thing for practitioners (which most engineers are, not PL researchers).
It’s good to have divergence. Competition is good. Otherwise people stop trying new things.
Not that it substantially distracts from your point, but I strongly doubt this. Or did you mean a heavily restricted subset of C++? A C++ front end alone is so complex to build that these guys make a living off of licensing their front end code: https://www.edg.com/
(Fun fact: Microsoft rebuilt IntelliSense for C++ on the EDG front end. Yes, that Microsoft with the MSVC compiler. See https://devblogs.microsoft.com/cppblog/rebuilding-intellisen... and https://old.reddit.com/r/cpp/comments/bdt8ep/does_msvc_still...)
Even without compatibility cruft, you're looking at multiple 100k LOC if their code base is anything to go by. That's man-years, not man-months...
There have been write a c++ compiler classes that did it in a semester. I think they also do the standard library. (but that might be a year long course)
I'd be curious where these classes draw the line. Do you happen to have a syllabus or so? I don't doubt you can implement a meaningful portion of C++ in a semester, but converting 500 pages of standardese into code within as many hours seems like an impossible goal for a class to me.
LLVM was created to replace emitting C, by providing programmers a way to turn source code into a representation that is lower-level than C without having to write the whole optimization and assembly code generation pipeline.
IBM had several LLVM like projects during the 70's, and that is how their surviving IBM i and z/OS work anyway, with language environments that AOT at installation time.
Likewise there were projects like Amsterdam Compiler Kit among others during the early 80's.
Biggest issue is the cultural aversion to returning structs and tagged unions.
I'm guessing this could create some divergence in terms of what is supported by the compiler but I'm curious how much that would matter in reality - for day-to-day serious project development. I'm not familiar with language dev at the compiler level, so I'm curious to hear if that's practical or sane.
If rust was a huge community okay, but face it, they are not. It is better therefore to focus their efforts where they can make a difference. A new x where the existing ones are just fine (this includes well maintained) is a waste of resources.
There are many possible good answers to the above question. However I'm not sure they apply, and worse I believe they will split resources that could be used to make something else better.
So.. the claim that the Rust community is not big enough to achieve this is wrong, since they have already done it..
The reason they are doing it, is that LLVM is not fine: it is super _super_ slow. People want Rust to compile instantaneously, and are willing to pay people full time to work on that.
D, for example, compiles much faster than C and C++, and does this by having their own backend for unoptimized builds. I don't know how big the D community is, so I can't compare its size to the Rust community, but they did it, and it payed of for them big time, so I don't see why it wouldn't pay off for Rust as well.
What I said was rust is better off focusing on problems that are not solved well by other people. A fast modern web browser (with whatever features is lacking) for example.
Source? LLVM is fast for what it does.
What people usually complain about is rustc being slow overall, not the LLVM passes.
The LLVM phases are usually the dominating factor in Rust compile times (the other big single contender is the linking phase). However, when the Rust developers point this out, they are also careful to mention that this may be due to the rustc frontend generating suboptimal IR as input to LLVM; we can both acknowledge that LLVM is often the bottleneck for Rust compilation while also not framing it as a failure on LLVM's part (though at the same time it is uncontroversial to state that LLVM does err on the side of superior codegen versus minimal compilation time, hence the niche that alternative compilers like Cranelift seek to fill).
1) much easier for Rust community to contribute to the compiler from end-to-end.
2) lower coordination cost with LLVM giving complete, Rust-focussed control over code generation/optimisation. Think about e.g. fixing noalias.
3) lower maintenance cost for LLVM integration/fork.
It's also obvious that this needs to be weighed against the loss of LLVM accumulated technology and contributors. This is easy to underestimate (although I think 2)/3) are also easy to underestimate).
Sure it is harder to contribute to the backend, but does it matter? I've been doing c++ for years and never looked at the backend.
I'll grant lower coordination costs. However I believe they are not outweighed by the advantages of the other llvm contributions.
If they need to fork llvm that is a problem. Either merge it back in and be done (with some tests so whatever they need is not broke), or there is a compelling reason as llvm won't work with their changes.
You can file bug reports, but not every part of the project is going to receive the same level of attention or care from core developers, and not everyone has the same priority. For example the Glasgow Haskell Compiler had to post-process LLVM generated assembly for years because we lacked the ability to attach data directly next to functions in an object file (i.e. at an offset directly preceding the function). Not doing this resulted in serious, meaningful performance drops. That was only fixed because GHC developers, not any other LLVM users, fixed it after finding the situation untenable after so long. But it required feature design, coordination, and care like anything else and did not happen immediately. On the other hand the post-processing stuff was a huge hack and broke in somewhat strange ways. We had other priorities. In the end GHC, LLVM, and LLVM users benefitted, but it was not exactly ideal or easy, necessarily.
On the other hand, "normal" code generation bugs like register misallocation or whatever, caused by extreme cases, were occasionally fixed by upstream developers, or patches were merged quickly. But absolutely none of this was as simple as you think. LLVM is largely a toolchain designed for a C compiler, and things like this show. Rust has similarly stressed LLVM in interesting ways. Good luck if your language has interesting aliasing semantics! (I gave up on trying to integrate LLVM plugins into our build system so that the code generator could better understand e.g. stack and heap registers never aliased. That would have resulted in better code, but I gave up because it turns out writing and distributing plugins for random LLVM versions your users want to use isn't fun or easy, which is a direct result of LLVM's fast-moving release policy -- and it is objectively better to generate worse code if it's more reliable to do so, without question.)
Finally, LLVM's compilation time issues are very real. Almost every project that uses LLVM in my experience ends up having to either A) just accept the fact LLVM will probably eat up a non-negligible amount of the compilation time, or B) you have to spend a lot of time tuning the pass sets and finding the right set of passes that work based on your design and architecture (e.g. earlier passes outside of LLVM, in your own IR, might make later passes not very worth it). This isn't exactly LLVM's fault, basically, but it's worth keeping in mind. Even for GHC, a language with heavy "frontend complexity", you might suspect type checking or whatever would dwarf stuff -- but the LLVM backend measurably increased build times on large projects.
> Either merge it back in and be done
It's weird how you think coordination costs aren't a big deal and then immediately say afterwords "just merge it back in and be done". Yeah, that's how it works, definitely. You just email the patch and it gets accepted, every time. Just "merge it back in". Going to go out on a limb and say you've never actually done this kind of work before? For the record, Rust has maintained various levels of LLVM patches for years at this point. They may or may not maintain various ones now, but I wouldn't be surprised if still they did. Ebbs and flows.
I'm not saying LLVM isn't a good project, or that it is not worth using. It's a great project! If you're writing a compiler, you should think about it seriously. If I was writing a statically typed language it'd be my first choice unless my needs were extreme or exotic. But if you think the people working on this Rust backend are somehow unaware of what they're dealing with, or what problems they deal with, I'm going to go out on a limb and suggest that: they actually do understand the problem domain much, much better than you.
Based on my own experience, I strongly suspect this backend will not only be profitable in terms of compilation time, which is a serious and meaningful metric for users, but will also be more easily understood and grokked by the core developers. And Cranelift itself will benefit, which will extend into other Rust projects.
C++ isn't a great language, but learning C++ is the least difficult part of the problem to contributing to llvm.
I think this works so well because language designers tend to understand compilers better than they understand other software.
It's extreme but it's a good idea because it treats compilation time like an actual budget, which it is. You can't just add things endlessly. But it's not easy to achieve in practice.
It would be nice if a gcc replacement compiler made speed of building code the goal. I'll even accept speed of building the compiler after it was compiled with gcc (clang, msvc...) as the benchmark if that is faster.
Just curious, do you have any examples of this "limitations" you speak of? Sounds like a very interesting read.
It’s not even about what language you’re compiling. It’s about the IR that goes into llvm or whatever you would use instead of llvm. If that IR generally does C-like things and can only describe types and aliasing to the level of fidelity that C can (I.e. structured assembly with crude hacks that let you sometimes pretend that you have a super janky abstract machine), then llvm is great. Otherwise it’s a missed opportunity.
Do you have any examples off-hand? I presume caring about patchpoints and OSR is as fair gain to start with?
Also if you want to use llvm as a backend for your project and expect to build llvm as part of a vendored package, the llvm libraries with debug symbols on my machine was about 3GB. Also not ideal.
Like, LLVM tries not to add UB, but design choices it made to support optimization with UB do sometimes result in new UB being introduced, like the horror show that happens with `undef` and code versioning.
So, I think that optimizing with UB internally is fine but only if it's some kind of bounded UB where you promise something stronger than nasal demons.
Do you mind expanding more on these points or directing me to some places where I can learn more about them? Compilers are a fairly new field for me, so anything I can learn about their design decisions and tradeoffs are worth their weight in gold.
Another tool for compiler research using modern approaches with type safe languages.
FreePascal for example has its own x86, arm, mips, sparc and powerpc backends
And GCC being a pain to work with is a deliberate decision by Stallman to avoid his baby being expanded upon by corporations
LLVM's extensible architecture was its most critical property; its permissive license is an unfortunate side effect of a rewrite.
If GCC had come up with the "GCC Runtime Library Exception" way back in the day, and provided a modular architecture, half the innovation happening around LLVM might have happened around GCC instead. (Might, not "would have"; we can only speculate on what alternate history might have occurred.)
It turns out that hasn't happened yet with LLVM and allowing such things under LGPL may have worked.
To me, it sounds more like political propaganda from a few idealists who want to justify why - instead of participating in a joint project - they want to develop everything themselves from scratch in their favourite technology. For this there is, nota bene, also a common term: "Not invented here" syndrome.
Why is that strange? You now have a diverse set of frontends and a monoculture on the backend. A world with Chrome, Chromium, Edge, Brave, and the Yandex browser is still a browser engine monoculture.
For example, if you're worried that one of the compilers might be malicious, you can use the other compiler to check on it: https://dwheeler.com/trusting-trust
Even if you're not worried about malicious compilers, you can generate code, compiled it against multiple compilers, and sending inputs and see when they differ in the outputs. This has been used as a fuzzing technique to detect subtle errors in compilers.
This still requires the use of a use of trusted compiler though. Comparing two compilers arbitrarily shows if there is consensus, it does not give guarantees about correctness.
From the link.
In the DDC technique, source code is compiled twice: once with a second
(trusted) compiler (using the source code of the compiler’s parent), and then
the compiler source code is compiled using the result of the first
compilation. If the result is bit-for-bit identical with the untrusted
executable, then the source code accurately represents the executable.
As discussed in detail in that dissertation, if you are using diverse double compiling to look for malicious compilers, the trusted compiler does not have to be perfect or even non-malicious. The trusted compiler could be malicious itself. The only thing you're trusting is that the trusted compiler does not have the same triggers or payloads as the compiler it is testing. The diverse double compiling check merely determines whether or not the source code matches the executable given certain assumptions. The compiler could still be malicious, but at that point the maliciousness would be revealed in its source code, which makes the revelation of any malicious code much, much easier.
You're absolutely right about the general case merely showing consistency, not correctness. I completely agree. But that still is useful. If two compilers agree on something, there is a decent chance that their behavior is correct. If two computers disagree on something, perhaps that is an area where the spec allows disagreement, but if that is not the case then at least one of the compilers is wrong. The check by itself won't tell you whirch one is wrong, but at least it will tell you where to look. In a lot of compiler bugs, having some sample code that causes the problem is the key first step.
they kept getting owned until they supposedly found a pretty dump hack which just appended the backdoor to the final compilation on the build server...
no clue if it was just a story though, as i personally havent experienced anything like that before.
The quote reformatted:
> In the DDC technique, source code is compiled twice: once with a second (trusted) compiler (using the source code of the compiler’s parent), and then the compiler source code is compiled using the result of the first compilation. If the result is bit-for-bit identical with the untrusted executable, then the source code accurately represents the executable.
Is this guy human? This is amazing, and this guy should be given an award.
It's great to have all three, as they each have different characteristics in terms of speed, generated code, debug support, platform support, etc. Supporting these three also helps maintain proper semantic separation of code gen from front end.
I find Rust (the spec, though also the implemenration) quite safe and practical (a balance). It deserves some independent implementations to secure a long and stable future.
On the other hand, I want to use it on non-ARM embedded platforms, where current cross-compilation through C produces unusably big binaries. I dream this might increase hope for that, too, eventually.
Where is the Rust spec? Unless something happened really quickly that I was not aware of there is only the implementation.
“Compiling development builds at least as fast as Go would be table stakes for us to consider Rust“
Go was designed from the ground up to have super fast compile times. In fact, there are some significant language issues related to that design decision.
Using one of the primary design goals that impacted language structure as "table stakes" is almost certainly going require a lot of effort with some serious unintended consequences.
Improving compilation times sounds good. Aiming high is good. But reaching "best of breed performance" is major initiative.
There are a number of technical implementation challenges in the compiler.
It is a large project, and Rust's got a really intense stability policy.
The compiler was bootstrapped very early, when the rate of change of the language itself was still "multiple things per day." This introduced significant architectural debt.
There have been multiple projects that have re-written massive parts of the compiler, and more ongoing. For example, non-lexical lifetimes required inventing an entire additional intermediate language, re-writing the compiler to use it, and making sure that everything kept working while doing so.
More recently, the compiler has been being re-done from a classic, multiple-pass architecture to a more Roslyn, "query-based" one. Again, this is being done entirely "in-flight", while keeping a project that's used by a lot of folks stable. The rust-analyzer has made this project even more interesting; a "librarification" strategy is being undergone to make the compiler more modular.
For some numbers on this kind of thing, https://twitter.com/steveklabnik/status/1211667962379276288/... and https://twitter.com/steveklabnik/status/1211717308143587334/...
I know the code won't stop running, but I wonder how soon it stops being idiomatic. If it's not idiomatic, it's harder to maintain due to unfamiliar style and structure. Does Rust have measures to deal with this issue?
It is not a problem. A lot of processing steps go on before the meat of the work gets done; many new idioms end up boiling away entirely as part of this process. Like, the borrow checker doesn't even know about loops; by the time the code gets there, it's all been turned into plain old gotos. The further you get into the compiler, the simpler of a language it gets, and everything is defined in terms of sugar of the next IR down.
During my time in rust, the major changes in idiomatic code have been around Results/Errors, async/futures, and a few macros and syntactic sugar goodies have evolved. None of these evolutions were problematic to migrate to, and all of them were moving in the right direction, IMO.
And while I'm sure the folks who work on these languages are wonderfully intelligent people, let's dispel this notion that you need to be a super genius to implement a compiler or something like that!
It seems magical, like one of the hardest things you could program-- but take a look through crafting interpreters, if you will: http://craftinginterpreters.com/
"Nothing is particularly hard if you break it down into small jobs." - Henry Ford
I modified my original question to avoid a potential distraction from what I want to talk about. Thanks!
It's up to Rust's compiler to verify all the contracts made, because in order for the binary to be "zero-cost", none of those checks are being done at runtime. If you looked at the output assembly, you would see what looks like very carefully written code that shows no explicit signs of protecting itself. Ie. there's no swaths of boilerplate assembly doing borrow checking, out of bounds checking, etc.
The canonical example is iterator chains with complex logic compiling down into vectorised and unrolled loops. Powerful logical abstractions are used by the compiler to generate code that does what it says to do without the runtime cost of closures or whatever other logical but not mechanically necessary things you have.
Iterators can't go out of bounds, so the compiler can elide those checks. There are still some runtime costs at the intersection of safety, ergonomics, and performance. Bounds checking, overflow checking. But they have escape hatches and are relatively rare in the language. Most things do compile out.
Here is a chart from last September showing where the time goes in compiling a large Rust codebase (rustc itself):
(Scroll down to the large horizontal bars once dependencies have been built.) (Sorry if GitHub is down at the moment; try later if it doesn't load.)
The blue part of each bar is time in the frontend, the purple part is time in LLVM. The largest bar (rustc) spans 105 seconds in LLVM out of 140 total, or 75% in LLVM. Many of the subcrates are even more dominated by LLVM time, for example look at rustc_metadata or rustc_traits where >95% of compile time is spent in LLVM.
A magic instant backend is unrealistic, so Rust will need to move some of the current backend work to frontend work for things that can be done more efficiently on the frontend. But the fact remains that there is an opportunity for big improvement from a much faster backend.
I'm basically a layperson when it comes to this topic, but that's my understanding of one of the potential benefits of a different backend.
For example, a bit over 75% of the time needed to compile the regex crate can be attributed to codegen- and optimization-related events, with a bit over 64% of that time spent in LLVM-related events specifically . Granted, I'm not certain whether this is a release or debug build, but it does show that there is room for significant wins by switching backends.
As for why C can compile quickly with an LLVM backend while Rust can't, I'm not sure. I've read in the past that rustc generates pretty bad LLVM IR to pass to the backend, and it takes time for LLVM to sort through that, but there's probably some other factors in there too.
Part of the reason LLVM runs so much faster on C than on Rust is that Clang is smarter about generating less/better IR from the start, so LLVM's optimizer has less of a hole to dig itself out of.
Clang is in a worse spot than rustc is in terms of emitting good LLVM IR, since it has no IR aside from the AST. By contrast, rustc has MIR which is more amenable to optimizations. At this point I'm fairly sure the problem is just that C code naturally generates less IR than Rust code does. All those function calls that go into iterators, array indexing, containers, etc. etc. add up quick.
MIR optimization can help close the gap but that still involved generating "garbage" that takes extra time to codegen (in Debug) or optimize out (in Release).
Being smarter about generating IR, whether MIR or LLVM IR, is still an area rustc has a lot of room to improve, even given Rust's idioms. E.g. stuff like this: https://github.com/rust-lang/rust/issues/69715
(My screen is small so it's tough for me to read these results, to be honest...)
Also of note, this blog post isn't speculation; they posted numbers from actually doing it.
Their secret? Multiple backends with different kinds of optimizations.
You don't need to compile for the ultimate release performance when in the middle of compile-debug-edit cycle.
You don't need to compile your program in one go using GHC's LLVM backend, many times a GHCi session is more than enough.
You can't use C's casts (undef for out of range float -> int conversions, for example), arithmetic (undef for signed overflow), or shift operators (implementation-defined behavior for signed right shifts, undefined behavior for left shifts into the signbit or shift counts not in [0, n)). You can work around these by defining functions with the semantics that your language needs, but they get gross pretty quickly (they are both much more verbose and more error-prone than having an IR with the semantics you really want, and they require optimizer heroics to reassemble them into the instructions you really want to generate). Alternatively, you can use intrinsics or compiler builtins, but then you're effectively locking yourself to a single backed anyway, and might as well use its IR.
The issues around memory models (especially aliasing, but also support for unaligned access, dynamic layouts, etc) are worse.
Even LLVM IR is too tightly coupled to the semantics of C and C++ to be easily usable as a really generic IR for arbitrary languages (Rust, Swift, and the new Fortran front end have all had some struggles with this, and they're more C-like than most languages). C is much worse in this regard.
The behavior of shift operations on signed integers will be fixed in C++20 and C2x, as part of the effort to require twos complement representation. It is a massive potential source of UB in currently standardized C and C++.
All the other problems listed remain.
* It's missing several useful operators, such as classic bit manipulation (count trailing zero, byteswap), or even 8- and 16-bit arithmetic. Checked arithmetic is another useful one that's not present (or even really possible in C's ABI).
* Signed integer overflow is UB.
* Utterly no support for SIMD types.
* Proper IEEE 754 floating-point control is kind of spotty, although it tends to be as bad or worse in most other languages.
* ABI control is poor. You can't come up with any way to return multiple register values, for example.
* Anything that's not a vanilla function isn't supported. No uparg function support (required for Pascal), multiple entry points (required for Fortran), or zero-cost exception handling (required for C++). Hell, even computed goto isn't actually supported.
And all of this is assuming you have strong control over how you expect implementation-defined behavior (e.g., sizeof(int)) to work.
But even the "better C" languages tend to not really attempt to expand C structurally. The changes amount to fixing the egregious semantics (fixed-size types, no int-promotion, define signed overflow, etc.), add vector types and other operators, maybe tweak ABI a little bit, and add a whole lot of syntactic sugar. And those languages that explore beyond C's limited structural repertoire do so at the cost of C's specificity.
That said, ever since the last time someone asked me this kind of question, I've been trying to design a portable assembly language.
>> I've been trying to design a portable assembly language.
Couldn't something like Forth fulfill this role?
Do you mind expanding on this or pointing me to places where I can read more?
LLVM's floating point instructions assume that there is no floating point environment . And there's no real facility to indicate that floating point instructions might be affected. To remedy this, they've been working on adding constrained floating point intrinsics.
 More specifically, that the environment is set up to the default rounding mode (round-nearest), all exceptions are masked, and no one will ever care about sticky bits.
It looks like the only major compiler that actually supports IEEE 754 correctly is icc. MSVC, gcc, and clang all optimize floating-point operations without considering if the dynamic environment is the same.
On a more theoretical level there has been some research on what a better intermediate language would look like. One project I found interesting was Mu VM, which offers some niceties for compiling languages with a garbage collector.
IR's should be terse, simple and dumb. I'm not sure any "real" programming language fits that.
When you have a higher level language with more accurately defined semantics, running it all through C would risk introducing undefined behaviour.
With an IR you can control and define the semantics more closely to what your language needs.
No, it wouldn't. When you target C you need to write a proper backend for its abstract machine, rather than naively rewriting code, of course.
The C abstract machine is a fine IR, specially the later editions of the standard.
Outside of the language-specific front end, compilers generally have no knowledge of the programming language itself. There is no technical advantage to transforming Rust into C when it comes to the middle and back ends, which form the bulk of the compiler.
There are no language-specific optimization opportunities. There are, of course, restrictions on what you can do in some languages that eliminate optimization opportunities, but you're not suddenly going to be able to take advantage of those opportunities by transforming your code into a langue that lacks the restrictions, because then you change the semantics of your code.
There is a key one: the ability to use any C compiler out there (including proprietary ones). This allows you to target all platforms out there.
Plus very few platforms have only support for C and nothing else, unless we are speaking about esoteric embedded CPUs.
Then you can also add custom CPUs and systems, FPGAs, etc. Those are way more rare, but still something people use daily in some industries.
This "C is faster than C++" is a bit dated by now.
"I said C++ is not faster than C" implies that C++ compilers don't beat C compilers, which as many in HPC, HFT and GPGPU computing domains know is false for years now, and no restrict doesn't help that much against template metaprogramming and constexpr.
Template metaprogramming and constexpr doesn't help being faster in HPC or GPGPU, it helps reduce the redundancy of your code, for example if you want a generic algorithm on float, double, int, complex.
What helps speed is being able to control memory allocations and having the tool to place the data required on registers, L1 cache or L2 cache as required by your kernel (and similarly for GPU).
On current architectures, what is hard to optimize is memory and data movement, if your data is at the wrong place or not prefetched at the right time it will be literally 100 times more costly than a saved addition from constexpr.
"Scientific Computing: C++ Versus Fortran" (1997)
"Micro-Optimisation in C++: HFT and Beyond"
"The Speed Game: Automated Trading Systems in C++"
"When a Microsecond Is an Eternity: High Performance Trading Systems in C++"
Anyway, I stand by what I say and I'm backed by my high performance code:
- Writing matrix multiplication that is as fast as Assembly, complete with analysis and control on register allocations, L1 and L2 cache tiling and avoiding TLB cache miss:
- Code, including caveat about hyperthreading: https://github.com/numforge/laser/blob/master/laser/primitiv...
- The code is all pure Nim and is as fast/faster than OpenBLAS when multithreaded, caveat, the single-threaded kernel are slightly slower but it scales better on multiple cores.
- I've also written my own multithreading runtime. It's scale better and has lower overhead than Intel TBB. There is no constexpr, you need type-erasure to handle everything people can use a multithreading runtime for, same comparison on GEMM: https://github.com/mratsim/weave/tree/v0.4.0/benchmarks/matm...
- More resources on the importance of memory bandwidth: optimization convolutions https://github.com/numforge/laser/wiki/Convolution-optimisat...
- Optimizing matrix multiplication on GPUs: https://github.com/NervanaSystems/maxas/wiki/SGEMM, again it's all about memory and caches optimization
- Let's switch to another domain with critical perf need, cryptography. Even when knowing the bounds of iterating on a bigint at compile-time, compiler are very bad at producing optimized code, see GCC vs Clang https://gcc.godbolt.org/z/2h768y
- And crypto is the one thing where integer templates are very useful since you know the bounds.
- Another domain? VM interpretation. The slowness there is due to function call overhead and/or switch dispatching and not properly using hardware prefetchers. Same thing, C++ constexpr doesn't help it's lower-level, see resources: https://github.com/status-im/nimbus/wiki/Interpreter-optimiz...
Also all the polyhedral research, and deep learning compiler research including the Halide compiler, Taichi, Tiramisu, Legion, DaCE confirm that memory is the big bottleneck.
Now since you want to stop on the theory and you mentioned HPC, pick your algorithm, it could be matrix multiplication, QR decomposition, Cholesky, ... Any fast C++ code (or C, or Fortran or Assembly) that you find will be fast because of careful memory layout and all level of caches, not constexpr.
If you have your own library in one of those domains I would be also very happy to have a look.
As a simple example, let's pick an out-of-place transposition kernel to transpose a matrix. Show me how you use constexpr and template metaprogramming to speed it up.
Here is a detailed analysis on the impact of 1D-tiling and 2D tiling: https://github.com/numforge/laser/blob/master/benchmarks/tra..., throughput can be increased 4x with proper usage of memory caches.
I guess that is why NVidia has spent 10 years doing hardware design to optimize their cards for C++ execution.
Apparently that was wasted money, they should have kept using C.
I switched to practical applications and walk the talk showing my code, and then you back off and want to go back to opinions.
I see now that you want to back myself with experts since reproducible code and runnable benchmarks is not enough.
Apparently you recognize Nvidia as an expert so let's talk about CuDNN where optimizing convolution is all about memory layout, source: https://github.com/soumith/convnet-benchmarks/issues/93#issu... and it's not about C vs C++ vs PTX.
Or let's hear about what Nvidia says about optimizing GEMM: https://github.com/NVIDIA/cutlass/blob/master/media/docs/eff..., it's all about memory locality and tiling.
Or maybe Stanford, the US government and Nvidia Research are also wrong when pouring significant research in Legion? https://legion.stanford.edu/
> Legion is a data-centric parallel programming system for writing portable high performance programs targeted at distributed heterogeneous architectures. Legion presents abstractions which allow programmers to describe properties of program data (e.g. independence, locality). By making the Legion programming system aware of the structure of program data, it can automate many of the tedious tasks programmers currently face, including correctly extracting task- and data-level parallelism and moving data around complex memory hierarchies. A novel mapping interface provides explicit programmer controlled placement of data in the memory hierarchy and assignment of tasks to processors in a way that is orthogonal to correctness, thereby enabling easy porting and tuning of Legion applications to new architectures.
Are you saying they should have just called it a day once they were done with C++?
Or you can read the DaCE paper on how to beat CuBLAS and CuDNN: https://arxiv.org/pdf/1902.10345.pdf, it's all about data movement. 6.4 Case Study III: Quantum Transport to optimize transistors heat dissipation, Nvidia strided matrix multiplication was improved upon by over 30%, and this part is pure Assembly, the improvement was about better utilizing the hardware caches.
But then since you saw it was a lousing battle going down that path, you pulled the hardware rabbit trick out of the magician hat.
So we moved from C++ is not faster than C assertion, to memory layouts, hardware design and data representation.
Now you are even asserting that it's not about C vs C++ vs PTX, and going down quantum transport lane?
There is no C vs C++ issue, you keep saying that constexpr and template metaprogramming matter in high performance computing and GPGPU, I have given you links, benchmarks and actual code that showed you that what makes a difference is memory locality.
Ergo, as long as your language is low-level enough to control that locality, be it C, C++, Fortran, Rust, Nim, Zig, ... you can achieve speedups by several order of magnitude and it is absolutely required to get high-performance.
Constexpr and template metaprogramming don't matter in high performance computing, prove me wrong, walk the talk, don't drink the kool-aid.
There are plenty of well studied computation kernels you can use: matrix multiplication, convolution, ray-tracing, recurrent neural network, laplacian, video encoding, Cholesky decomposition, Gaussian filter, Jacobi, Heat, Gauss Seidel, ...
Regarding features that may not be properly optimized, there are exceptions and move semantics I guess, but exceptions are often avoided like the plague anyway and code can be refactored to get the same effect as move semantics.
Am I missing something?
(C itself is not specified very thoroughly but C - a C implementation - is, in the sense that it only does one thing for a given line of code)
If by speed up you mean compile times and not runtime behavior then there's also some unstable compiler flag that allows adding specific llvm passes.
It would greatly improve the reading experience of your blog if you could make clickable the footnotes/references.
For example when you say:
> I’ve taken the chart from the 2016 MIR blog post
I have to scroll to the end of the page to find the blog post (and then scroll back to resume reading). If  were clickable it would be great. It would be even better if [MIR blog post] were an actual link itself.
e.g LLVM output is A, but the new one is B, how do they deal with different results between backends?
On a small project, personally I use --release sometimes during development because the compile time doesn't matter that much and the resulting executable is much faster: if I don't use --release I can get a misleading sense of UX during development.
I had a graph traversal program written in Python. I ported it to Rust, and the runtime was identical -- 68.4 seconds, down to the tenth of a second. (Kinda blew my mind -- I had to triple check that I was running and timing what I thought I was!) I had a bit of a crisis of faith.
I poked at it a few times over the next week, then finally got on the IRC channel and quickly received the advice mentioned above. Same input, with --release: 6.2 seconds.
As a developer I usually have a pretty powerful machine, and I've found that debug mode is a good way to approximate slow computers, and something that is unbearably slow in debug will bother some users later on.
The performance degradation might not be even, but generally it approximates pretty well a slower system, without having to use a slower system in my experience. You can deal with the few edge cases individually.
If you really care about performance on slower computers, then at some point you'll need to use one for real.
But at least this way you have a fast feedback loop.