That is, they are compiling C with GCC, but Rust uses LLVM as a backend. Compiling their C code with clang reveals that it is also slightly slower than the C code compiled by GCC, but not faster than Rust.
The actual conclusion here is that the GCC toolchain is slightly faster than the LLVM toolchain for this particular use case, but that isn't really news - it happens all the time.
If one want to focus this on Rust, then the downside of Rust vs C here is that C has a GCC toolchain that happens to be faster for this use case, but Rust does not.
To answer your question: Yes, that particular benchmark was done with GCC and yes, it would have been better to use clang/LLVM for a more fair comparison.
We of course also compared gcc with clang/LLVM and found that clang was 0.8% slower, so it does not explain the difference entirely.
> We of course also compared gcc with clang/LLVM and found that clang was 0.8% slower, so it does not explain the difference entirely.
I think it would be quite useful and interesting to figure out why there is still a small difference between clang and Rust. There might be some low hanging fruit in how Rust generates LLVM-IR that could be fixed.
Another thing to point out: our 6% - 11% difference is already quite low. Netbricks [1] has a similar comparison between a Rust and a C network function (only the NF, driver in C in both cases) and they find 14% (LPM) to 20% (synthetic NF) difference.
Fun fact: we've a system where Rust is faster than C despite still using more instructions, so yeah, neither instructions nor cycles tell the whole story...
> Fun fact: we've a system where Rust is faster than C despite still using more instructions
I think Bryan Cantrill did a pretty good explanation on differences you might see when rewriting something in Rust[1], and one of the things he looked at to see what was going on was Cycles Per Instruction. Instruction count itself means little if the instructions themselves have very different performance profiles and require different amounts of cycles to complete.
Edit: You are tracking and reporting that, so it's not like I'm telling you anything you don't know. I still think the included article is well worth reading though.
Not only that, but your tight loop that takes up most of the execution time might have fewer instructions, whereas the rest of the program might have more, making the overall total higher but "instructions per time spent in function" lower.
Since the resources of that paper are not available anymore, do you happen to know if they used Clang for compiling their C code ?
A 10% perf difference between LLVM and GCC for different applications is in the order of magnitude of what all the hundreds of Phoronix benchmarks show every time a new version of these toolchains is released for general applications (e.g. not for micro-kernels).
The Rust toolchain distributed by the Rust project bundles an LLVM version that's very close to LLVM master. IIRC, this version has some patches on top to allow Rust to query some backend information, but I'm not sure if this information is still up-to-date.
Rust can, however, work with an external LLVM. When installing Rust through a package manager, e.g., on Linux, typical Linux distros like Debian, RH, Ubuntu, etc. configure Rust to use the system-wide LLVM binaries. So if you have Debian, and say LLVM 8.0 installed, you can just compare the installed clang 8.0 with the Rust version installed by the Linux distro.
That will not compare the "bleeding edge" clang vs Rust toolchains, but it would be a fair comparison of Clang vs Rust at a particular point in those toolchains lifes.
I think that comparing the best implementation of Rust vs the best implementation of C is an interesting thing to compare.
But if that's what this post is comparing, most of the content is probably incorrect, because the main reason C binaries perform better is because the C backend of the C implementation used is better than the Rust backend _for this particular application_. This isn't really news. There are hundreds of benchmarks comparing C and C++ using the GCC and LLVM backends, and each is slightly better than the other, depending on the application. You don't even need to write code in two languages to show this.
The authors appear to be aiming to compare how language differences affect the quality of the generated code. For that, you want to compare two equivalent-quality language frontends using an equivalent-quality optimization pipeline and code-generation backend.
This is, in general, not feasible. But for all languages sharing the same LLVM or GCC backend, doing this comparison is trivial, and would limit the differences to either the frontend implementation, or the language itself.
That actually depends. I'm some cases I've seen Clang pull off superhuman feats of vectorization, especially on ARM64. You paste code in Godbolt Compiler Explorer and GCC produces fairly readable assembly, and Clang uses every trick in the book to vectorize. In those cases Clang can be substantially faster. But code needs to be written with vectorization in mind for that to happen (short loops of known size, and a few other restrictions like that). Most of the time it's a few percent behind GCC.
I similarly found that C was slightly faster than Rust in a microbenchmark called LPATHBench: https://gist.github.com/bjourne/4599a387d24c80906475b26b8ac9... That was with clang which performed significantly better than gcc. clang appears to generate very good code tree traversals.
But this was way back in 2016. Numbers from microbenchmarks that old are hardly relevant anymore.
Yeah, their initial explanation is that its the "safety features", but I was under the impression that literally everything related to rust's safety happens at compile time. Their big selling point is the "zero cost abstractions".
Bounds checking is a “zero cost abstraction” in the sense that it is both “pay for what you use” and “you couldn’t implement it yourself any better.” That said, the implemention is not as fast as it theoretically can be, because the compiler isn’t always removing bounds checks that are provably true.
When const generic lands, you will be able to make those assertions! A basic implementation is in nightly.
This is taking zero-cost abstraction to the extreme, and I think waters down the concept to the point that it almost isn't useful.
One can argue that any feature is a zero-cost abstraction for the exact set of trade-offs it has (in the limit: "you couldn't implement a feature that executes this exact sequence of instructions 'mov ...' any better if you did it by hand").
I think focusing on a hypothetical perfect coder is more useful, someone who never fails an assertion or bounds check. This person does not need any safety/convenience features (unlike us real humans), but they can still use some such features without a penalty. Bounds checks are not one of them. (But... At least they don't have much of a pervasive/whole program cost.)
Maybe! I think that the concept is a bit more focused than most people think it is. I can appreciate this perspective though. It's a fuzzy concept, so there's decent room to poke and prod around the edges :)
Yeah, that's fair. I think there's a strong argument that bounds checking satisfies the "don't pay for what you don't use" rule (although one could say that every slicing having to carry around its length, doubling it's size, is a pervasive cost, but that's fairly weak, given the other things it enables), but I think the other rule is much, much less clear.
You're mis-understanding what "you don't pay for what you don't use" means. It means that a program that does not contain [] in it does not include code for []. There's no "mandatory runtime" code for []. It doesn't mean you can use [] in some special way and somehow get less code.
And even if we were talking about your understanding, I don't see why get_unchecked would be disqualified. unsafe is a part of the language for a reason.
I know, the meaning of "zero cost abstraction" is really "zero cost to not use the abstraction."
But that's like saying if you don't use rust you don't pay for it. Just because there is the unsafe escape hatch in the language, you don't get to label the language as zero cost abstraction. Because practically speaking there is a lot more runtime cost to rust than people will tell you.
Many langauges (almost anything without a runtime) meet that definition of zero cost abstraction then.
If you use a Rust construct that does bounds-checked accesses, you're explicitly using bounds checks. You're not paying for anything beyond the bounds checks that you're using.
If you use a Rust construct that elides the bounds checks (e.g. get_unchecked), there are no bounds checks, and you don't pay any cost for the fact that Rust supports a bounds-checked subscript operator.
If you want to compare Rust's subscript operator, do so with an explicitly bounds-checked version of the C code. That's the appropriate comparison for "zero-cost abstraction". Because the whole point of "abstraction" is it's not a change to the observed runtime behavior, it's just a language abstraction.
I don't entirely agree. As an analogy, suppose that you want to pass a pointer to an object as a function argument. As the programmer, you expect the object will necessarily remain allocated as long as the function uses it, since it's kept alive elsewhere, but you might have made a mistake. Before Rust, your options would be:
1. Use a C-style raw pointer. This has no overhead but is unsafe.
2. Use a reference counted or garbage collected pointer. This provides safety but is a non-zero-overhead abstraction. In other situations, reference counting can be necessary to manage unpredictable object lifetimes, but in this case we're assuming it would only be used to guard against mistakes, so the cost represents overhead. Even if some language implements reference counting just as fast as you can implement reference counting in C, that doesn't make it zero-overhead.
But Rust adds a new option:
3. Use a borrow-checked reference. This is safe, yet zero-overhead compared to the unsafe option, as long as your usage pattern can be expressed in terms of Rust lifetimes.
Going back to array indexing, the analogous situation is that you want to access an index that you, the programmer, expect to always be in bounds. Option 1 corresponds to unchecked indexing, and option 2 corresponds to bounds-checked indexing. But Rust brings nothing new in this area: there is no option 3.
Yet an option 3 is theoretically possible: safe unchecked indexing if you can prove to the compiler that the index can never be out of bounds. This can be done in languages with dependent types or built-in proof systems. (It can even be done in Rust with a certain neat hack [1], but with a lot of limitations.)
I'm not saying Rust is bad for not having dependent types. As far as I know, it's an open research problem how to make those features feel ergonomic and easy to use, even to the extent that Rust lifetimes are (i.e. not totally). And on the flipside, bounds checks usually don't cause very much overhead in practice, thanks to CPU branch prediction, plus the optimizer can sometimes remove them.
But I'd say that Rust's choice not to go there (...yet...) justifies calling its current approach a non-zero-overhead abstraction.
It's literally the definition. And very little qualifies. An abstraction is zero-cost if there's no runtime penalty, including if the code you'd write by hand to accomplish this goal is no better than what the compiler synthesized for you via the abstraction.
If the use of the abstraction forces your data model into a suboptimal representation, it's not zero-cost. If the compiler emits code that's worse than the straightforward manual implementation, it's not zero-cost. If the emitted code involves runtime checks that aren't necessary without the abstraction, it's not zero-cost.
For example, reference-counting is an abstraction over tracking the lifetime of your object. In some cases (where ownership isn't clear) the retain count introduced by reference counting is required and therefore not a cost, but in most cases the point at which the object ends up freed is actually predictable, and therefore the cost of all the reference counting is something that would have been avoided without the abstraction. Therefore, reference-counting as a replacement for manual alloc/free is generally not zero-cost.
Or how about iteration. Rust has an external iterator model (where you construct an iterator and run that, rather than passing a callback to the collection). In most languages an external iterator model is a non-zero cost, because you need to construct the iterator object and work with that, which is a bit of overhead compared to what the collection could do with an internal iterator model. In Rust, external iterators are frequently zero-cost because they usually get inlined. So you can write a high-level loop that includes a filter and map and the end result is very frequently just as good as (if not better than) what you'd have done if you wrote the iteration/filter/map by hand without Rust's iterator abstraction.
I understand what you're saying, but that's not how I understand people use the term "zero cost abstraction". That term usually refers to things you can use and pay zero cost for using it. It does not usually refer to abstractions that impose a cost for use, but no additional cost if don't use them, as you suggest.
A quick Google search of the top hits for the term seems to align with my understanding.
Maybe there is a term for what you are talking about, like "pay for what you use".
What part of my comment made you think I didn't understand that?
Was it when I wrote: "it really [is] zero cost to not use the abstraction"? Or when I wrote: "if you don't use rust you don't pay for it"?
I understand the concept fine. It is rust's marketing gimmick and not a useful tool to describe the language.
Rust is no more "zero cost" than C++, Fortran, Ada, Objective C, and even D can even make the claim to some extent (you can opt out of the GC and use malloc/free directly). I'm there are plenty more I'm missing. If you use the Java GC that doesn't collect (used for real-time programs), even that probably fits the description.
Under your description an opt in generational GC is zero cost. You can't have the vector be checked, (along with other small performance issues), but use the excuse that you can write code in a way that doesn't use checked access or have that feature's cost. I can do that in a lot of languages.
Also to some extent you should take the ecosystem into consideration where a large chunk of it goes to lengths to not use unsafe such as using a vector for homogeneous objects to get around pointer semantics. That incurs a huge cost, and that should be counted against the language because it makes the other ways too difficult or the main libraries used rely on those techniques. (and if you don't count rust's ersatz standard library, the you can't count Java's standard library since you can always write Java code that doesn't use a GC or extra features - I've done it a few times).
I saw so much promise in rust, but is seems to have gone to complete crap.
> What part of my comment made you think I didn't understand that?
This part:
> But that's like saying if you don't use rust you don't pay for it. Just because there is the unsafe escape hatch in the language, you don't get to label the language as zero cost abstraction.
Rust isn't "zero-cost" because of the unsafe hatch; that's completely orthogonal. It's zero-cost because if you don't use a feature you don't pay for it. The fact that you need unsafe to get non-checked subscripting isn't particularly relevant to the fact that using non-checked subscripting in Rust means you're not paying for the existence of checked subscripting.
> Under your description an opt in generational GC is zero cost.
You're conflating implementation with semantics. If you have a choice between different allocation strategies that all result in the same observable runtime behavior, using a garbage collector over manual alloc/free is a cost. With manual alloc/free there's no runtime tracking to determine when an object should be freed, it's entirely static. Using a GC dramatically simplifies this from the developer's perspective and avoids a whole class of bugs, but comes with a runtime cost. Meanwhile for single-owner models, Rust's Box type has no runtime overhead compared to alloc/free, since there's no runtime tracking, the alloc and free points are determined statically by the compiler.
C++ popularized the term. And if C++ is commonly described as having designed its features to be zero-cost (or you only pay for what you choose), there's nothing wrong with Rust describing the same concept with the same term.
What languages dont have a runtime? Even C has one (albeit a very small one). Nobody labels Rust the language as a zero cost abstraction (thatd be silly - there is a cost to learning it!). Rather, they try to provide zero cost abstractions. A great example is that there are no green threads in rust because they consciously removed them as they penalized rust performance regardless of whether said people used green threads or not.
This is exactly how you'd write the feature, by hand, if you were implementing the language.
That the optimizer could, but does not, do as much optimization as it theoretically can, means that it has more work to do. But that's different than the feature being written in a sub-optimal way.
I don't know if you edited your comment, or if it was just my pre-coffee reading comprehension missed the part "because the compiler isn’t always removing bounds checks that are provably true."
This got me wondering why we don't include dedicated hardware for bounds checking in CPU architectures. Intel made an attempt with MPX but from a brief glance on Wikipedia it looks like a fail (slower than pure software approaches like AddressSanitizer, among other issues).
I really like dependent types as a concept! But they're not really in any mainstream languages, so I haven't had much opportunity to play around with them. :(
There are! It is done in compiled Lisp and Scheme runtimes such as SBCL to remove provably redundant bounds checking. For example, in (when (< x 4) (when (>= 0 x) (...))) SBCL will infer that if x is an integer, it must have one of the values 0, 1, 2 or 3. The knowledge is useful because if x is used as the index of an array containing 3 or more items the bounds checking can be elided.
FWIW it isn't hard to write safe Rust code that elides bound checks, but one needs to write proper code.
If one pin points a performance issue due to bound checks, it is also always trivial to disable them in a safe way by writing proper `unsafe` code when it makes a difference.
Bounds checks are always on. Integer overflow checks are disabled in release mode. Bounds checks are necessary for memory safety, whereas integer overflow isn't
there are unsafe method for vec that allow you unchecked access to the array, but you need to to write it in unsafe block (some code that is performance sensitive does this).
The explanation I heard from a Rust core dev (explain why Rust tends to be slightly slower than C++) is the borrow checker makes programmers to use a slightly different coding style. Were C programmers tend to pass references around Rust programmers tend to take copies of to avoid the borrow checker.
Passing the smaller reference will be faster in single threaded code. In multi threaded code I'd expect making a thread local copy to be a win, and indeed multi threaded Rust programs tend to be quicker than their C/C++ counterparts.
I don't know... while it may not be the case here, I wouldn't be exactly shocked if the fact that Rust restricts your ability to do some things ends up having a performance penalty, even if any safety checks themselves happen at compile time.
The very different load/store amounts, higher rate of instructions being retired and seemingly better L1 behaviour seem interesting though, does compile their C code under clang yield the same behaviour or is it closer to GCC's?
Currently Rust doesn't make proper use of its aliasing rules because of bugs in LLVM. This can reduce performance of code by forcing unnecessary memory accesses. As an experiment you could try the `-Zmutable-noalias=yes` option, which should enable these optimizations (but exposes you to those bugs). See [masklynn's comment](https://news.ycombinator.com/item?id=20948779).
In general Rust suffers a bit because LLVM is primarily a C++ compiler backend. For example infinite loops without side-effects are UB in C++, but supposed to be well defined in Rust. This is even an issue for C, because `while(1) {}` is a valid in C but UB in C++. https://github.com/rust-lang/rust/issues/28728
Yeah, enabling noalias optimizations is on our todo-list. I think that’s one of the most interesting performance features of Rust. It could be the thing that makes Rust faster than C, ultimately.
TBF it wasn't just LLVM: in https://github.com/rust-lang/rust/issues/54878 "nikic" and "comex" provided C test cases which failed in both clang[0] and gcc[1]. Though GCC has since fixed the issue while it remains open on LLVM.
The infinite-loop Rust issue is old, and the asm "" fix had been known for a long time. I wonder what is behind this seeming perf over correctness decision.
C is the wrong language to compare to. More precisely, Rust should be able to beat C, routinely. Presumably it will, when it gets more optimizer attention. With any luck, that improvement can go into LLVM proper, and speed up many other languages besides.
Why should Rust beat C? First, it does not suffer from the pointer aliasing faults C has. Second, const really means const, where in C the compiler has to assume any pointer-to-const might be use to write through. Because alias analysis is ironclad, copies can be elided even if some function borrows a reference to the copy.
At a larger scale, improved abstraction tools mean more powerful libraries may be written and used, that would be unavailable to the C coder.
It is routine for C++ programs to be substantially faster than a C program that attempts the same job, even though C++ suffers from the same aliasing flaws C has. In the finance world, you would be laughed out of the room for proposing C for a performance-critical task.
This is why articles about other languages comparing them to some fraction of C performance are amusing. C is a low bar. If you can't beat C, you are Doing It Wrong. In Rust's case, that would be "still doing it wrong"; we may assume fixing this is on the schedule, after various "getting it right" and "compiling faster" goals are met.
It is routine for C++ programs to be substantially faster than a C program that attempts the same job, even though C++ suffers from the same aliasing flaws C has. In the finance world, you would be laughed out of the room for proposing C for a performance-critical task.
That depends on how you write your C++ programs. Virtual functions, runtime type information, STL, RAII ownership can all be performance hits in C++ when compared to C.
Rust has optimization issues of its own apparently.
C forces you to write code closer to what assembly looks like than Rust or C++. That's good if you can write good assembly, and bad if you cannot. That's particularly bad for writing robust code that doesn't have security vulnerabilities, because the C programmer is required to manually write all checks correctly, while C++ and Rust have abstractions that automatically write the checks for you.
You don't have to use them in those languages if you don't want to.
As I mentioned in the other thread, I think this article is comparing GCC vs LLVM, so the info in the post is kind of moot. But if the authors really believe that Rust bound checks are the culprit of the performance difference, they can just remove them. Takes one line of code, and would show if they are right or wrong.
Indeed, RAII is a zero-cost abstraction compared to running the same destructors manually. On the other hand, C++ makes it really easy to write programs that copy and destroy things when it isn't necessary.
For example, if a function takes a `std::string` as an argument (by value), any string you pass in will be copied into a new allocation, which will then have to be deallocated. That's fine if the function really needs its own allocation – but it might not. In that case you can avoid the copy by changing the argument type to `const std::string &` or `std::string_view` (the latter being new in C++17)... but the difference is subtle enough that even an experienced programmer might not notice the extraneous copy.
Don't believe me? Consider that in 2014, "std::string was responsible for almost half of all allocations in the Chrome browser process"! [1]
(Rust does a better job here by requiring more explicitness if you want to make expensive copies.)
Oh, there's also an issue where the presence of a destructor pessimizes the calling convention for passing and returning objects of that type by value, but only slightly, and the issue will be addressed in the future. [2]
even if you take the string as `const std::string &` you will still end up with implicit string construction if someone passes a char* . Sure you can use std::string_view as an argument type to prevent this simple case, but it doesn't work in all cases.
Consider a std::map<std::string, Blah> for example, if you have a char* and you want to index in to the map you are going to also end up with an intermediate string construction b/c STL associative containers don't have heterogenous lookup. Note that this was fixed in std::map::find in C++14 [0], but still is there for operator[].
Given a modern compiler and library, a by-value string temporary can be passed down a chain of calls with no allocations beyond the first, and returned, likewise.
Passing a reference means the optimizer cannot optimize accesses, because it doesn't know what other pointers might be aliasing it. string_view has the same problem.
Quotes about Chromium and Firefox are likewise obsolete. Old code, old coding standards. Neither uses RAII, so they pay a 20-30% runtime penalty. With a modern library, short strings do no allocation. (IIRC Firefox uses 16-bit characters, so they get less.)
That said, if the runtime performance of code trafficking in string objects matters, you are Doing It Wrong.
> Given a modern compiler and library, a by-value string temporary can be passed down a chain of calls with no allocations beyond the first, and returned, likewise.
For calls (as opposed to returns): If you use std::move, yes. Otherwise, no. But using std::move is similar to changing the type in that it requires noticing the problem first.
> Passing a reference means the optimizer cannot optimize accesses, because it doesn't know what other pointers might be aliasing it. string_view has the same problem.
For `const std::string &`, the optimizer has to assume that the string pointer and length could change, which is indeed a problem as it has to keep reloading them if you call, e.g., `c_str()` or `size()` multiple times (with other things in between). For both `const std::string &` and `std::string_view`, the optimizer has to assume that the string data could change, but not the pointer or length, which is much less of a problem because you don't usually repeatedly load the same piece of string data in a loop. Therefore, `std::string_view` is a decent choice.
Passing `std::string` by value does indeed have the advantage that the compiler indeed could theoretically assume the string data is not aliased. But I just checked and none of Clang, GCC, MSVC, or ICC actually do so:
But moving can pessimize string-passing and -returning. Just write the code in the clearest way possible, and optimize hot paths where it turns out to matter, according to measurements. People have been demonstrated to be very poor at picking which those are, a priori.
Quotes about Chromium and Firefox are likewise obsolete
Yeah, because neither of those teams had any idea what they were doing, obviously. Even with C++ 11 at their disposal. But somehow, interestingly enough, you do.
Neither uses RAII, so they pay a 20-30% runtime penalty
Who knows where you pulled that number from.
I guess then, according to you at least, there are no performance penalties to be paid for any C++ language features. Please, continue to impart your knowledge and unreferenced benchmark tests on us all.
RAII protects you from, e.g., leaking memory. But C can just leak memory, and there isn't really anything substantially wrong with that (you can't get UB from leaking memory).
Checking error values, releasing all resources, doing bound checks, and all the other things that both C++ and Rust do by default are more expensive than doing nothing. Sure, they are zero-cost, in the sense that if you were to write C code that does those, the C code wouldn't be faster. But as mentioned, C code is not required to do these.
With C, you have to manually write the code for doing those things. With C++ and Rust, you have to write the code to opt-out of doing those things (with Rust, removing a bound check is a one liner).
It's a matter of language defaults. I know what the better default for me is.
You wrote that in C++ and Rust it is necessary to "opt out" of properly freeing memory. That is a false statement. It is not just incidentally false, it is fundamentally false.
If your constructor news memory, then if you don't write a destructor, then nothing deletes the memory. You don't need to "opt out". It is hard to guess where you get this idea.
It is bad form to code memory leaks, so normally one doesn't. Instead, one normally allocates in such a way that the automatically-generated destructor frees the memory.
But neither case requires writing more code to "opt in" or to "opt out". It is the same in Rust.
We all know the acronym stands for "Resource Allocation Is Initialization" - and that in the real world the two often have no reason to be semantically coupled. IOW lots of optimization patterns involve decoupling initialization from allocation and the RAII meme encourages programmers to codify the coupling because C++'s shortcomings make it attractive.
That is what RAII stands for, but in practice good use of C++ RAII is all about explicitly separating interface, allocation, and ownership, such that you only pay for as much initialization as you need when you need it.
This takes great care which takes time and leads to slow development. I think of C++ as worth it when this level of care is necessary to create a well-performing system.
I would agree that most toy examples don't do a good job of explaining how to use RAII in a careful high performing way
> In the finance world, you would be laughed out of the room for proposing C for a performance-critical task.
This has not been my experience and I'm greatly relieved that I haven't been subjected to such a hostile and destructive working environment in finance as yourself.
The correct response is always: "Show the benchmark numbers and make your case that the trade off is worth it." Then there is discussion on how compelling a case has been made.
> The correct response is always: "Show the benchmark numbers and make your case that the trade off is worth it." Then there is discussion on how compelling a case has been made.
Where do you work that allows you to implement everything twice, once in C and once in C++, for a proper comparison to be made?
(I realize that's not quite what you said, but if we're going with 'benchmark', then presumably those places have already done the benchmarks and decided that C++ is generally superior performance-wise.)
As an example, usually there is a default C++ implementation because C++ makes things easier for the programmer so it makes sense if you're at the end of the food chain that requires perfomance (ie you're not using a memory managed language in the first place because that's easier still). So then you benchmark it thorougly. Get stuck into all the perf numbers and analysing those. You note why you think you can shave X off because the latency is due to i, j and k measures that look as though they could be improved so that's probably worth a go. Now you can (at least sometimes) in your lab, replace one part with (say in this example, C code) and see if your perf stat ratios improve the way you think they should. Yes it is difficult. Yes there are plenty of times where you see nothing because the slow from the code you changed is actually manifesting in cache misses elsewhere.
Micro benchmarks sadly have a way of showing nothing much at that end of things. Showing it's no worse/better doesn't even mean it's no worse if the additional/reduced pressure on caches doesn't result in much in the way of additional/reduced cache spills. So you end up doing mirco optimisations that each, taken alone don't do much. But when you get them all together you cross a threshold that has a considerable improvement. I believe SQLite did something like this although probably in a more restrained fashion as they have other goals than being the lowest latency solution against similarly motivated competitors.
Then of course at some point you get a reputation for success so you show the problem and the numbers identifying it and just say "I want to invest X of my time trying Y which I believe has a reasonable chance of helping" And you are told to focus on that and please let them know if you have any more good news. ;-) Sometimes your ideas are genuinely a wash and not worth merging but it is noted that it was worth trying and you're thanked for your efforts.
As somebody else noted in this thread the game has moved beyond much of this kind of thing in some, possibly many instances.
I felt it a reasonable answer, if you think not I probably can't help you. I have no idea what you think I'm aggreeing or disagreeing with. Did you reply to the wrong comment?
Having spent 15 years in finance, I've never seen C used for perfomance critical code. It's always been C++. Some shops use Java or C# with GC effectively disabled, but those are rare. If you can preallocate all of the memory you need, this may be fine, but not all trading systems have that luxury.
I'm also sorry if your exeprience has been similarly hostile. This is not a comment about which languages get used in which shops.
Whatever your technology choice you would expect to have to make a case, using evidence, and have that case properly listened to (or why did they hire you?). For performance critical technology you would expect that evidence to include benchmark numbers.
Any other approach to considering technology suggestions, involving "laughing out of the room", in the absence of evidence is completely unprofessional, very silly, utterly rude, purile, immature and wholly unacceptable. (And just quietly on that basis alone, I'd be pretty confident my solutions are faster than any put together by a team that ignores evidence and belittles talent. Again I say this is not about one technology vs another, this about "Laughing out of the room." Having said that I've sometimes used C and it has worked really quite well in those particular cases. I can say the same of C++ fwiw).
I have not encountered any hostility. But if you don't know how to get better performance from C++ than you can get from C and lots of extra work, you probably will not be hired, most places.
It's not about hostility, it's about competence. Some places just demand competence more than others.
The laughter would not be about measurable performance; you might well be able to get C code to go as fast. The laughter would be over your apparent fear of the C++ compiler, and the suggestion to spend their money on a predictably inferior, more expensive, less maintainable solution.
You might as well suggest installing fluorescent lights in the office, or 10baseT network hubs, or CRT monitors. Those would get a jolly laugh, too, completely without hostility, but you still might not be invited back.
This conversation is now bordering on unhelpful. C has uses, some of them are compelling. Every finance shop I've worked in uses the linux kernel and drivers extensively. These are (mostly? all?) written in C. As is git. Nobody laughs. C++ has uses. Some of them are compelling. For a given case where these uses dominate nobody laughs. Claiming C is as obsolete as CRT monitors or 10baseT without any supporting evidence? Yeah, no thanks. Someone could take that the wrong way. Actually isn't this what Alexandrescou says about C++ because he's all about D nowadays? What Eckel did when he became all about Java (or is he now all scala?). Last time I saw footage of Stanley Lippman speaking he was being brutally disparaging of C++ and invoking in his defence Richard Gabriel describing C as "a virus" to make him look less hostile in comparison to other language warriors and wars. [2]
It just isn't a meaningful conversation without a specific case, with specific requirements and where a specific solution is advanced with appropriate evidence. If in your experience there has never been a case for C, in your own work in the very broad finance domain, over C++. Great. Your experience differs from mine. I've had circumstance where I've made the case, then implemented and released it and watched the p&l go quite well as a result. I do know how to program in C++ where that is appropriate and indeed I do. No really. Also Java, and Python, and (gasp, even) Perl, and... Like any and every experienced programmer, right?
Claiming someone is "scared of the c++" compiler in the absence of evidence for that is probably not your best comment here. It's actually imho the very worst thing about c++ as a language. That is the casual derogatory, frequently unsubstantiated, statements the pervade the surrounding culture. Linus had something to say about it back in the day [1] so it's not like the C++ folk have an exclusive license on that kind of attitude... His tone there doesn't make me laugh.
Two Cases you might like to consider. (1) Optimising the very last cycle matters, how hard you have to work to shake it out, not so much. Maybe you haven't had that, um, pleasure. (2) You have a reason to want to parse and transform the source. (Given (1) above the reason for doing so might be obvious but I probably shouldn't say too much more about it, also you might laugh! ;-)
Maybe in ncmncm's circle they laugh at C developers, but trying to extrapolate that across the finance industry is not credible.
Performance at the sharp end comes from preallocating memory, reducing syscalls, removing PIC (drop the GOT), static linking (enable LTO), using LTO, using simd where possible, hand rolling important functions in assembly, cpusets, reducing network hops, attempting to get the program to fit in L1 cache, putting the code into the switch, moving to fpga, moving to asic.
There's loads more optimizations but almost none of them have anything to do with C++ over C.
> You do all those things with C++ (or, for fpga, asic, etc., without). There is exactly zero benefit to dropping to C first.
That's my point. The performance tuning is generally not between C++ or C hence why you cannot extrapolate your experiences to the rest of the finance industry.
>And, nobody suggested anybody actually laughs at C developers.
You said, "In the finance world, you would be laughed out of the room for proposing C for a performance-critical task."
Then in discussing the nature of the laughter you said "The laughter would not be about measurable performance; you might well be able to get C code to go as fast. The laughter would be over your apparent fear of the C++ compiler, and the suggestion to spend their money on a predictably inferior, more expensive, less maintainable solution."
We have a case in English, as in all languages in its cohort, for discussion of counterfactuals, signaled by "would". If there were anyone proposing C for performance-critical code, they would be laughed at. Happily, none are.
This conversation is no longer of any benefit, but for the sake of you and your co-workers... Saying that C is predictably inferior, more expensive, and less maintainable shows a blatant amount of incompetence on how either languages operate. C++ only offers abstractions to (debatably) more easily allow for an object oriented design for codebases. Semantically everything that you can do in C++, you can do in C.
There are plenty of features in C++ that are beneficial, particularly the handful of mechanisms that allow you to move away from #defines, but to compare performance between C and C++ simply shows ignorance.
I didn't say there was hostility towards one tech stack or another, I was just stating my observations/experience. In my experience, once a firm has one or two tech stacks, they don't really want to deviate too far from what's entrenched because, well, they need to have the talent on hand to support everything. This becomes exponentially harder if every developer chooses their language/framework of choice. Generally, though, if there's a compelling reason to break from the core competencies of the firm, exceptions may be granted (this was how Python first got its foothold in my first firm - I could easily, e.g. generate, Python wrappers over my C++ APIs that allowed me to do everything my business users wanted while still maintaining compliance control and audits over data changes).
Typically, you can usually tell the age of the trading firm by the tech stack. Older companies are usually on C++ and have millions upon millions of existing legacy code that works, so there's little impetus to rewrite in the latest shiny. Younger firms are more likely to have embraced Java and younger yet, C#. This doesn't always hold true, though, it's just a generalization of my experience.
In my 15 years in finance, I spent 9 years at my first firm, and when I joined it was purely C++, with some perl to move data around between vendors and systems. There had been some Java, but we had a bad interaction with Sun Microsystems (at the time) and our CEO came down hard and kicked Java to the curb (and we migrated off of Sun/Solaris onto Intel x86 & Linx & C++). Towards the last several years I was at that shop, C#/WPF had been embraced for all new UI work, and even some rewrites of existing MFC apps were in progress. Python had also largely replaced Perl for all new scripting/data munging and had also been embraced by the quants. All of the servers still ran on Linux and were in C++ when I left ~6 years ago. There was even talk of migrating from Sybase on Linux to MS SQL Server. That didn't happen in my time, but from former colleagues I've kept in touch with, they've pushed forward on that. I don't know if they're using SQL Server on Windows or on Linux, now that that's an option.
My current shop is the oldest shop I've ever worked for (founded in the 70s) and is entirely a Windows/C# shop. They brought me on this year to help with their build out of Python at the firm (mostly for our quants). I'm actually really enjoying this position, because they're new to Python, but I've got +15 years experience with it, there's a lot of greenfield development and next to no existing legacy code, so I largely get to drive style and architecture.
I've also worked in finance and nobody I've known reaches for C++ for performance over C - only convenience/comfort. For performance, one examines the compiled output, analyze the instructions, try to prevent pipeline stalls, etc. And if the compiler is generating dissastisfactory code, use assembly for those parts.
C absolutely has them, you can opt out of aliasing on an individual basis but aliasing flaws are the default.
So much so that rustc regularly has to disable their noalias annotations because its pervasive noaliasing exercises rarely used LLVM paths and leads to miscompilation. Outputting noalias is currently disabled again in IR generation[0][1].
Note from downthread comments in [1] that it's completely possible to create valid C code which miscompiles on both GCC and LLVM (though GCC has been fixed since), it's just unlikely that a regular C dev would write the mix of unrolled loops, inlined functions and restrict annotations which triggers the issue.
FWIW the last round was an issue with the interaction between noaliasing and loop unrolling (after inlining) as unrolling would fail to "split" the noalias between unrolled iteration, so it was not the noalias handling which was broken but the loop unrolling pass.
And my understanding's it's since been fixed in GCC.
I don't doubt for a second there will be new miscompilations discovered after that one's fixed though.
One hopes that the optimizer has arranged to check the bounds at the start of the loop, and not in it. Apparently Rust's doesn't, yet, but it seems usually to happen to overlap the checking it does with other stalls, so it might not usually matter much.
The actual bounds checking instructions are often negligible in terms of overhead, but it can inflict damage by inhibiting other optimizations and loop optimizations, like reduction or vectorization.
Ok, but please don't be rude or post shallow dismissals here. If you know more, it would be great to share some of what you know. Then we can all learn something. Alternatively, it's fine not to post even when someone else is wrong.
What the parent is saying is essentially right. C++'s compile time polymorphism and lamdas can allow faster than C given the right subset of language chosen. For instance std::sort regularly outperforms qsort(3).
> It's the inexperienced "C/C++" (no such thing, BTW) that are 'choosing subsets'
That's blatantly not true. Strousup himself contributed to the joint strike fighter C++ coding standard which has things like "Allocation/deallocation from/to the free store (heap) shall not occur after initialization." which means that the STL is off the table for example.
Beyond that, there's pieces like RTTI/dynamic_cast that should be avoided for a lot of reasons, perf being one of them.
Most of the STL does no allocation. The containers allocate only at known places, permitted during initialization. They also know how to allocate from provided bespoke stores, which are permitted, in JSF code, even after initialization is done.
So, no, the STL is very much on the table, as in any competently constructed work product.
Looking at performance counter data is good, but I would have liked to see a real validation of the hypothesis that bounds checking is to blame for the extra branches and instructions. That is, modify the Rust compiler to not emit bounds checks (or maybe there is even a flag for this?) and look at performance and counters. I would imagine that this would bring the data for Rust to pretty much the same as C. But other compiler (micro-)optimizations might be at play as well.
Also, from the paper's Conclusions: "The cost of [Rust's] safety and security features are only 2% - 10% of throughput on modern out-of-order CPUs." 10% network throughput is a lot.
This, imo, is absolutely correct (it is a dark idea to have a "let's be unsafe for more performance" flag) but maybe a experimental build of the Rust compiler could have this as a configuration option? Possibly the toolchain could warn every step of the way if such a 'tainted' module is ever linked, etc.
It just seems like this sort of question is going to recur, and being able to persistently track the overhead of checking (it would allow you to monitor specific performance improvements) is much nicer than having someone do a one-off experiment.
If it is implemented, it will be used. And people will put it in their own builds.
We already have one “secret” escape flag feature, and people do use it, as much as we don’t talk about it and tell people not to use it when they find it.
Maybe put a tainted flag in it that causes the linker or runtime to fail? Then don't open source or release the modifications to allow the linker/runtime to avoid that failure check and refuse to let anyone check in a "fix" that allows this check to be skipped to an official build...
This surely seems like an incredibly important cost. Surely it's worth doing a bit of ugly magic to be able to keep track of it persistently.
Thanks to both of you for the insightful discussion. A flag would be helpful for testing, but it's true that if it's there, it will be used. Still, this can be tracked as part of a CI system by keeping around a patch for disabling bounds checks and regularly building and benchmarking a patched version. Less nice, but should get the job done.
I’m not 100% sure if there’s a source exactly, but we don’t like safety and correctness to depend on what flags you pass or do not pass. We don’t offer a fast-math flag either for similar reasons.
The odd one out is overflow, and that’s only because it’s well defined (a “program error”) and not UB to overflow in Rust. This gets checked in debug but not currently release, though the spec allows for it.
What do you think of Julia's macro-based approach?
That is, there are `@inbounds` and `@fastmath` macros that turn off bounds checking/enable fast-math flags in the following expression.
`@fastmath` works simply by swapping functions (eg `+`) with versions (eg, `Base.FastMath.add_fast`) that have the appropriate llvm flags.
When testing Julia libraries, all `@inbounds` are ignored (ie, it'll emit bounds checks anyway).
I assume it's already possible for a user to similarly implement `inbounds!` and `fastmath!` macros in Rust to substitute `[]` for `.get_unchecked()`, etc. (I haven't checked if there are already crates.) But it sounds like it should be easy enough for folks to check performance sensitive regions this way (in particular, loops that may need these flags to vectorize).
I guess my thought is that much of correctness comes from the compiler being able to make assertions that some type (and thus some memory address) will only be used in a correct way at compile time, etc, etc.
For example if we were dynamically linking a Rust crate into a Rust binary is it necessary to check boundaries in both or can some of that be deferred because we can assume the binary that will link has already done the boundary checks, etc?
I know it's a bit contrived since ideally we'd just compile statically, but I think it's still potentially valid. If both pieces of software have the guarantees then ideally you can factor out some of the overhead.
Not really: indexing out of bounds without this check would invoke undefined behaviour. A compile time flag would not be able to distinguish the cases where a bounds check is required for the program to be correct, from the cases where the index is provably within bounds and so is unnecessary.
Who wants a compile-time flag that makes valid programs have undefined behaviour? Nobody, especially when you consider that UB in any language really does mean undefined: in the best case the program crashes, in the worst it deletes all your files.
What's wanted is a way to tell the compiler "no, in this specific case which I have determined to be a bottleneck in my program, I want to omit bounds checking because due to XYZ it's impossible for the index to ever be out of bounds" and that's exactly what this method provides.
They can just profile to find out which functions in their program are consuming the most CPU. Finding if these functions do have any bound checks, and if so, writing the single line of code required to tell the compiler "trust me, it is impossible for this index to ever be out-of-bounds, a bound check is not necessary".
If they are right, and bound checks are the issue, doing this should recover the performance difference.
yeah, I wonder in which world they live. If I could sell a limb to get 10% more audio plug-ins in my DAW, you could be sure that my bedtime book would be "Life pro-tips for quadruple amputees"
2-10% for an already fast user space driver is nothing.
State of the art for a lot of these use cases is still the kernel driver which is ~7 times slower. Sure, all that stuff is moving to XDP/eBPF/AF_XDP, but that is still ~20-30% slower than a user-space driver.
Also, these 2-10% only show up when underclocking the CPU while running the unrealistic benchmark of forwarding packets bidirectionally on only one core (trivial to parallelize).
In the end it's about 6-12 cycles spent more in the driver. That's not a lot if you have a non-trivial application on top of it.
Fortunately for your body, this problem is easily solvable by hardware. Modern DAW's performance scales well with multithreading, for regular use cases at least.
I don't know your use case, but generally, if you have so many VST processing on a single track that it loads a core of a modern CPU, it means you're doing either something really creative, sculpting a sound, or some heavy-handed audio restoration. Both are candidates for freezing/rendering to a stem. YMMV, of course.
As an aside, can I just say how happy I am that the language I generally find a joy to use is nearly as fast as C? It's my daily driver, and I have almost no complaints.
Nearly?
Rust is already faster than C in quite a few scenarios and benchmarks. The other comment also said Rust is faster when C is compiled through clang.
I wonder if some of the bounds checks could be eliminated by using iterators instead of loops? It is common when coming from C to Rust to sometimes avoid complicated iterators because you imagine it can't be fast, so you use a loop, but usually the iterator really is faster.
And I believe the checked math can be eliminated just by explicitly stating you want to use unchecked math. It doesn't require unsafe to do so.
>I wonder if some of the bounds checks could be eliminated by using iterators instead of loops
It can and often is. Don't use [] to index into data if you can afford not to. Anecdotally, rustc is also much better at generating SIMD-friendly code with iterators than idiomatic C/C++, but that depends largely on what you're doing.
As an aside, I seem to recall .NET making an interesting optimisation here, such that if you access an array using, for example, `data[20]`, bounds checks are omitted for lower indexes.
Rust and/or LLVM will make this optimization as well. The easiest example I can find of this is in the buffered IO code [1], but there have been others
Anyone asking for "source" should be able to prove that they've not searched first. Because it's right there. As noted in the other comment it's in the literal documentation for rust.
Low effort commenting is usually shunned here, yet the ever prevalent "source?" seems to get a free pass all the time.
Really people should say: "source? Because I tried https://www.google.com/search?q=rustc+is+also+much+better+at... and found no sources whatsoever about this issue that has been discussed to death on dozens of forums yet I was somehow incapable of finding a single thing on the matter"
Why should someone spend time typing out easily searchable content for you? In fact, why should forums with a focus on high quality content even answer you?
Anyone who asks for sources should be required to do the same on demand. Whenever, wherever.
I've certainly seen requests for sources used obnoxiously by people who felt zero need to back up their side of the argument with sources. But it's also often a legitimate request to see what the OP is talking about.
The internet is vast. I sometimes have difficulty finding things I personally saw before and know exist. Slightly different search terms can turn up drastically different results and, with five million visitors per month to HN, it is a melting pot of people with vastly different backgrounds.
Well I mostly agree when the first Google result answer the claim.
I've clicked on most links from your Google query and was not able to find something that prove Anecdotally, rustc is also much better at generating SIMD-friendly code with iterators than idiomatic C/C++
Most of the links are about explicit SIMD, something that is off topic with the initial claim and something that c/c++ do far better (so many SIMD libs), and are also better at hybrid SIMD (SPMD, OpenMP 5,o penACC, etc).
"Anecdotally, rustc is also much better at generating SIMD-friendly code with iterators than idiomatic C/C++"
The author talk about auto vectorization.
Firstly c/c++ supports AVX512 while rust do not (in stable at least)
Secondly real world llvm/gcc dev are paid to improve c/c++ performance, rust performance improvements are only a side effect.
And iterators in c++ are idiomatic and I see no reason why rust iterators would auto vectorize better than c++ ones.
And c++ do not enforce bounds check (but allow it with at(), anyway bounds check are a totally useless concept because they slow down release performance and capture far less information for debugging than ASAN)
So the initial claim is non obvious and I see no answer on the Google query.
If nobody is able to prove it I will believe the claim is probably wrong.
Could it not be an educational tool to the person asked that when they write a statement of a certain kind a source citation should really accompany it?
Integer overflow checks are disabled by default in release builds, the author noted they were explicitly enabled in this project for additional safety.
(This is just for the normal operators, if you use e.g. checked_add() or wrapping_add() then you can be explicit about what behavior you want.)
[edit: oops, the overflow checks were only enabled for a separate test, see note from the author below]
iterators had a lot of work done on them for this reason. although it doesn't always work.
eg, once the difference between using a sign and unsigned value in a loop was causing tustc to not be able to optimize out the checks and resulted in terrible performance for that loop, but it was non-trivial to figure this out
It's be interesting to see how Rust performs compared to vector packet processing (https://fd.io/technology/). My hunch is, it's not very easy to implement something similar in Rust that can keep the instruction cache hot, especially going by the data from this paper.
We can see in this specific case there was better cache locality and more data was served from the L1 and L2 cache with a drop in L3 cache misses (no hits, because it didn’t have to look in L3 for anything).
6 cycles for bounds checks that the branch predictor never had to rewind on is nothing in comparison to saving a couple trips to L3.
> We can see in this specific case there was better cache locality
Dramatically better L1 and L2 cache behavior. It seems clear that the additional instruction load of the Rust driver is partially made up by the excellent cache utilization.
This "Rust vs C" document is just one part of a larger analysis of network driver implementations in many languages; C, Rust, Go, C#, Java, OCaml, Haskell, Swift, Javascript and Python. Have a look at the top level README.md of that GitHub repo.
Unless/until a lot more is written in Rust... not much. It uses slightly more base RAM to load the binary. Some of the bloat is things that in C programs would be dynamically linked in - it isn't that Rust is doing more, it's that C gets to share a lot of stuff and Rust has to bring it's own.
I don't know. I wouldn't be surprised that it loaded the whole thing. How could the OS predict how much to load (or wait on)? Waiting for a page to load just for the next function call would be hugely expensive.
In general, the effect of bloat is not visible in benchmarks like these where the goal is to run something small many many times, with ample memory available, and as little else on the system adding noise to the results as possible. It's the same reason you see "Java is faster than C" benchmark results, yet everyone knows how the former actually performs in practice.
The effects of larger memory usage don't become obvious until other applications start contending for it and/or swapping happens, and it's conveniently also something that is not as easily blamed on one application "being slow", which is why it doesn't receive nearly as much attention as it should.
It does take extra space, but ideally you'd store the exceptional error handling code out-of-line, so that they don't need to take up cache in the common case.
Our results probably only hold true for workloads with a low IPC. The test case is also a very limited forwarder, but real network functions also have a relatively low IPC in my experience (don't have any numbers to back up this claim, though).
If they built with debug there would be dramatically more load/store uops than you see in the benchmark. Debug mode builds disable optimizations and store variables back to memory after most expressions to aid debugging.
I wish they included rust benchmark code so we can play around and improve it. If the issue really is the bounds checks, then there are many ways to massage Rust code to make it obvious they're not needed. But it seems you can't really play with it without running on actual hardware.
Really no point to running it without real hardware, it will perform completely differently if you don't have the MMIO accesses and DMA by the NIC in there.
We'll have a VirtIO driver soon, but that's bottlenecked by the hypervisor and not really useful for performance tests.
With fake memory mapping it would not access the same memory, but you'd still run the same code, right? So the timing/details of access may change a lot, but the profile of executed code would be still meaningful. For example, if you run a loop with range checks, and a loop without them, it doesn't matter what's happening in the loop, as long as it stays consistent between the runs - you can still tell things about the loop overhead.
Would need a somewhat realistic emulation of the NIC setting/clearing these flags. Also, the by far slowest steps are the MMIO access because it involves a full PCIe round trip which is hard to emulate.
It will behave different at the level we are looking at here. It's easier to use real hardware; we specifically chose this NIC because it's probably the most common 10G NIC on servers.
(1) Runtime. Even though C has "no" runtime, there is some runtime overhead: Variable allocation / de-allocation and alignment. Function invocation / call stack setup. Et cetera. I don't know enough about Rust to understand what its exact "runtime" is like, but I'd bet it's not exactly like C. (Rust is strongly-typed, is it not? Does it do run-time type-checking?) Exception handling in particular requires a non-negligible amount of run-time overhead and may account for a significant proportion of the difference between the performance of Rust and C.
(2) Optimization. Rust and C are going to be compiled with either GCC or LLVM. The way both work is they compile the program to an internal representation of opcodes, which then get optimized and platform-specific, optimized machine code is then emitted. In some cases the specific set of opcodes or optimizations used with various conventions / data structures may be more efficient or less efficient. Over time this will improve.
Others have pointed out that Rust's type-checking is all done at compile time. Rust also does not have exception handling.
To be more specific: errors in Rust are either propagated by function return values (usually in the form of `Result` or `Option` types) which do not require any kind of runtime support, or by panicking, which unwinds the stack for clearing-up of resources and then halts the running thread. (Panics cannot be caught, although they can [EDIT: sometimes but not always; see burntsushi's comment below] be observed by a parent thread if they have occurred in a child thread.) Rust's "runtime", therefore is negligible, much like C's: it consists mainly of functions which can be called for things like stack unwinding or backtrace support.
Do note the caveats in the docs though. It isn't guaranteed to catch all panics. Indeed, unwinding can be disabled completely with a compiler option (instead, panics will abort).
EDIT: Sorry, I don't mean 'runtime type checks' are slowing down Rust, but rather that Rust performs more general runtime safety checks (like RefCell).
You're correct- what I had meant to say is Rust is smarter with more runtime checks over C for several cases, like RefCell. Therefor, Rust can be slower than C because of safety (but not because of type checking itself).
> I don't know enough about Rust to understand what its exact "runtime" is like, but I'd bet it's not exactly like C.
It is.
> Rust is strongly-typed, is it not?
Yes.
> Does it do run-time type-checking?
No.
> Exception handling in particular requires a non-negligible amount of run-time overhead and may account for a significant proportion of the difference between the performance of Rust and C.
Rust does not have exceptions, it has panics (fatal errors). It is correct for a panic to abort the program. You can provide an user-defined panic run-time that does whatever you want (unwinding, aborting, looping forever, `ud2`, etc.).
Well again -- I don't know how Rust does it specifically, but when you compile C++ programs in g++ with exceptions disabled you pretty consistently see a 10% speed improvement right off the bat and the performance falls more closely in line with the performance of C code.
Specifics aside the broader point is that the runtimes between the two languages will differ.
Zero-cost exception handling is only so cheap by excluding a lot of things from the cost/benefit analysis. It adds no extra code to the exception-free path. But it has a few major impacts:
* Exception tables are not particularly small, and can involve extra relocations, which can increase the time it takes to load the binary.
* Every function call where there is a variable with a destructor in scope has an implicit try/catch wrapped around it. This can increase code size tremendously, which hits your instruction cache. And, unlike the unwind tables, the relevant data is in the middle of your code sections, so you can't do tricks like lazy loading of the data.
* Every time you call a function that may throw, you need to break up the basic blocks. So every optimization in the compiler that cannot work across basic block boundaries is going to perform much more poorly with ZCEH.
* Of the various kinds of basic block edges, the exceptional edges are the hardest to reason about (computed goto edges being the second hardest). So many optimizations that can work across basic blocks are still going to bail out in ZCEH conditions.
* It's possible to ameliorate the optimization costs if the compiler can figure out that the callee isn't going to throw an exception and turn a may-throw call into a may-not-throw call. But memory allocation may throw (by default), so anything that might allocate memory (say, adding a value to a std::vector) potentially throws an exception and inhibits optimization.
But IIRC rust doesn't statically link glibc by default.
So the difference in runtime overhead is more than 10x the machine code size. That doesn't translate to a 10x slowdown and a lot of that is probably baked-in optimization but rust certainly has a runtime and associated overhead and it has a lot larger runtime than C.
AFAIK Rust type checks only during compilation and won’t allow you to cast values of one into another without clearly defining how it should work. So if there are any type related checks done during runtime, they happen because you added them.
For what it worth, it is more accurate to bench on your own machine, you may find a different or interesting outcomes such as error and slower in Rust.
We should not forget that Rust wants safety over speed, so it utilizes array bound checks and such at runtime, because programmers make mistakes. Sadly, that introduces slight overhead.
Newbie to rust, but surely there must be a way to disable bounds checking in rust, right?? Like C++ std::vector has operator() (bounds checked) and operator[] (pointer dereference, unsafe), or other languages have get/unsafe_get (ocaml), surely rust has a way to disable bounds checking as well (or disables it for optimised builds)
> Newbie to rust, but surely there must be a way to disable bounds checking in rust, right??
There's an unsafe method to not do bounds-checking, keeping in mind that indexing outside the collection is UB. It's usually a better idea to e.g. use iterators, or try and nudge the optimiser towards removing the bounds checks.
> disables it for optimised builds
A compiler flag to add UBs to a valid (though not correct) program is not considered a great idea by the rust team.
As a design choice, Rust prefers not to change the semantics of methods depending on the context, but instead to expose different methods. Arrays offer get_unchecked and get_unchecked_mut methods, which can only be called inside unsafe blocks.
The one exception I'm aware of is that integer overflow panics in debug builds, and silently wraps in release builds. But in most other cases, there will be different separate methods with different semantics, requiring an unsafe context as appropriate.
On how "fast" C is, guys, sorry to say this, but in my experience actually (1) even assembler is not very "fast", (2) for much the same reasons C is not very fast, (3) often Fortran is faster than C, and (4) PL/I can be the fastest of all of (1)-(4). More generally, how Fortran and PL/I can be faster than C should be exploited quite generally but apparently has not been fully generally.
Why? How can Fortran and PL/I beat assembler and C?
In particular, first, Fortran can easily beat C in subroutines that receive multidimensional arrays as parameters. Here PL/I also can beat C for the same reasons. Second, PL/I beats both Fortran and C in string handling.
How, why, what the heck is going on? In simple terms, first, Fortran gets to compile the multidimensional array handling within the language and C does not: In C, the programmer has to write the array addressing logic (e.g., row major or column major) that looks to the C compiler like just more code where the compiler has to treat that handling as general purpose code; the Fortran compiler KNOWS that the work is array handling and gets to make better uses of registers and to store intermediate results where the Fortran source code can't see them and, thus, take some liberties. That is, in Fortran, the array handling is in the language where the compiler can take liberties where a C compiler has to treat the work as ordinary code and not take the liberties. PL/I has the same advantages. Second, similarly for string handling in PL/I.
In particular, commonly in C and Fortran, string handling is via subroutines that to the linkage editor look like external references that must be linked. That indirection, stack manipulation, register save, restore, etc., PL/I doesn't have to do.
Scientific, engineering, analytical code is awash in multidimensional arrays, and there Fortran and PL/I can beat C. Now nearly all code is awash in string handling, and there PL/I can beat both Fortran and C.
For assembler, the old remark was that in practice an assembler programmer would write his own code that was less well designed and, thus, slower than C, Fortran, and PL/I. Further, the assembler programmer would be tempted to have a large library for arrays and strings that would be external subroutine calls likely with the overhead. Sure, in principle, assembler should win, but in practice writing code good enough to win is usually a bit too much work.
More generally, compiled functionality offered directly by the language can be faster than libraries of external subroutines, methods, etc.
How current is this impression? There continue to be first-rate Fortran compilers that probably win on numerics for exactly the reasons you state, but I would be surprised that PL/I has a toolchain that's kept up with modern architecture.
String handling, on most modern toolchains, mostly goes to inline functions or builtins that the compiler has been (rather pragmatically, if a big disgustingly) made aware of. It's very unlikely that strcpy or memmem is going through external linkage.
My experience is old, but my summary point remains: Language defined and compiler supported functionality usually is faster than libraries of external functions.
The relevance here: C isn't the fastest thing around and can be beaten. Maybe similarly for Rust.
For PL/I and strings, for the IBM versions on S/360 and S/370, the compiler might have used some of the hardware instructions for string handling.
I'd be surprised if in C strcpy, etc., were not still external: They USED to be, and then someone might have written their own version, e.g., for their own versions of strings, and linked to it; for compatibility, strcpy would about have to be an external function call still.
The compiler (off the top of my head this applies to GCC, clang, and MSVC, dont know about ICC) generally understands the C string functions at a fundamental level and will emit optimized code instead of a function call that will use hardware instructions and might know the size of the buffer at compile time. For instance on GCC you have to specify -fno-builtin to say "you don't know what you think you know about c library, actually emit those function calls instead of trying to optimize the idea of those calls."
I get it now: My view of HN just went down to the sewer: We're into the old, bitter, go out back and fight it out, religious computer language wars. And there C is one of the grand sacred subjects:
C is the greatest. The greatest, best ever. The greatest, fastest of all time, never to be improved on. Pure genius. Pure gold. And C programmers are like people who climb Mount Everest barefoot with no tools and the only REAL programmers.
Yup, C religion.
My view, bluntly: C was a toy language for an 8 KB DEC computer, well known to be a toy at the time. It's still a toy language and so primitive that it is a huge waste of effort in computing. There's no excuse for it and hasn't been even from the first day it was written, even for operating system code, since even at the time Multics was written in PL/I. And C is so primitive, in its K&R definition, that for string and array functionality, it is SLOW.
Besides, the idea that we are stuck-o with strcpy, etc. as in the original K&R external library means that we can't implement array bounds checking, reports on memory usage, etc. with an upgraded library of external functions.
My points here are rock solid: Functionality defined in the language and implemented in the compiler can be a lot faster than implementations with libraries of external functions. On strings, C and Fortran are examples. On arrays, C is an example.
If Rust is not a lot faster than K&R C, then I'm surprised at Rust.
All those hero barefoot Mount Everest climbers might notice that they are really tired, their feet hurt, and they very much need some better equipment!
Ah, programming language religious wars!!!!!
Grow up HN; set aside the toys of childhood. Be rational and set aside religious wars.
You appear to be making your points in a way in which no developments from the part 10-20 or so years are intruding. Can you point to a extant PL/I compiler that produces good code on modern architectures? How sure are you that the issues that separated C, Fortran and PL/I historically are even remotely relevant now?
I would imagine inlined functions (and for the brave, interprocedural analysis) renders many of your points moot. Fortran still has an edge on numerics as I understand it, but I don't think it's all that decisive.
The other reason that baking functionality into the language is problematic is that you wind up having a few things that go fast, while you neglect to build facilities that optimize other people's data structures. So you get the world's fastest dense array (say) but your sparse array is resolutely mediocre. Instead of a language packed with special-cases, I would much rather a minimal language with good analyses (many of which can be supported by language design; I think Rust has a good future here to support good analyses by how carefully it tracks lifetimes and how it doesn't fling aliases around with the abandon of C).
> You appear to be making your points in a way in which no developments from the part 10-20 or so years are intruding. Can you point to a extant PL/I compiler that produces good code on modern architectures? How sure are you that the issues that separated C, Fortran and PL/I historically are even remotely relevant now?
It's simple. Again, once again, over again, yet again, one more time, the main point is that functionality in a language definition and implemented in a compiler is faster than that functionality implemented in external functions. Simple. Dirt simple. An old point, and still both true and relevant and not changed in the last "10-20" years.
C is not supposed to have changed from K&R -- that it is backwards compatible was one of its main selling points and main reason to put up with such meager functionality that had programmers digging with a teaspoon.
Fortran and PL/I just illustrate the point. The point is thus illustrated. For this illustration, I can go back to old Fortran 66 and PL/I F level version 4. Still the point is illustrated and remains true.
And for the changes in C, does the language yet have definitions and support for multidimensional arrays and something decent in strings? And what about Malloc and Free -- have they been brought up to date?
I did NOT propose, I am NOT proposing, using Fortran or PL/I now. Instead, again, once again, over again, ..., I just made a point about language design and implementation. That this evidence is old is fully appropriate if we regard C as defined in K&R and not changed since. If people changed it, then it's a different language and likely not fully compatible with the past.
So, this thread was about C: I assumed C hasn't changed so assumed K&R C.
I don't want to use C but have, recently in one situation was essentially forced into it. But after that case, I left C. I'm not using Fortran or PL/I, either.
I'm using Microsoft's .NET version of Visual Basic (VB). Right, laughter, please, except it appears that .NET VB and .NET C# differ essentially only in syntactic sugar (there are translations both directions), and I prefer the VB flavor of that sugar as easier to teach, learn, read, and write than C# which borrowed some of the C syntax that K&R asserted was idiosyncratic and in places really obscure.
For this thread, I was surprised that Rust would not run circles around old, standard K&R C. Soooo, the answer is in part that the C of this thread is not old K&R C. Okay, C has changed and is no longer seriously backwards compatible. Okay. Okay to know, but for a huge list of reasons in my work I'm eager to f'get about C and its digging with a teaspoon.
If the new versions of C do not do well handling multidimensional arrays passed as arguments, then C should still be consider TOO SLOW, and too clumsy.
But, whether or not I use C, and whatever I think of it, the point remains: Functionality in the language is faster than that implemented via external functions.
Of course, this point is important in general when considering extensible languages. Sooo, maybe if could do for syntax as much as has been done for semantics, then we could do better on extensible languages. But then we could extend 1000s of ways, maybe once for each application, and then have to worry about the 1000s of cases of documentation, etc. So, extensible, when do it, has pros and cons.
I see: Discussing language design, especially for sacred C, is heresy in the religious language wars. Forbidden topic. Who'd uh thunk -- HN is full of hyper sensitive and hostile religious bigots.
> C is not supposed to have changed from K&R -- that it is backwards compatible was one of its main selling points and main reason to put up with such meager functionality that had programmers digging with a teaspoon.
C has gained new functionality and features. It just hasn't deprecated the old ones. K&R C will still compile with a modern C compiler, but it won't be as good as modern (C99/C17) C.
I really wasn't interested in the role of time. I was just struck that Rust was not a lot faster than C. Even with strcpy, malloc, free, etc. all compiled, C should still be slow due to essentially no support for arrays with more than one subscript. Thus I would expect that a modern language with good array support would beat C and am surprised that Rust does not.
Array support is now very old stuff; gee, now we'd like balanced binary tree support, maybe AVL trees, maybe like the C++ collection classes, but compiled and not just classes with its too much indirection. And we'd like good multi-threading support. Some of that is in the .NET languages; then they have a shot at beating C. I would expect that any language taken seriously today would beat C, the K&R version, the ANSI version, some recent version, etc.
And I'd like a lot more, e.g., some good semantic guarantees from static code analysis.
BTW in my remark above, have to swap "syntax" and "semantics" -- we have good progress on syntax but less good on the more difficult semantics.
A practical challenge is given some code, say, 100 KLOC, maybe 1 MLOC, have some static analysis that reports some useful information. Then have some transformations that have some useful guaranteed properties. If current languages do not admit such, then try to come up with a new language that does and still is easy enough to use and fast enough.
Gee, I assumed that after all these years we'd get some progress -- that Rust doesn't run circles around C is disappointing.
> I really wasn't interested in the role of time. I was just struck that Rust was not a lot faster than C. Even with strcpy, malloc, free, etc. all compiled, C should still be slow due to essentially no support for arrays with more than one subscript. Thus I would expect that a modern language with good array support would beat C and am surprised that Rust does not.
What do you think "array support" means to the compiler? Fortran has historically been faster just because of the the aliasing rules, but C has the restrict keyword now that brings it up to snuff there. The standard library inlining is more a hack around the fact that the c stdlib is dynamically linked, and that doesn't always make sense for heavily used functions.
I'm not really sure where you got this idea that C is a low bar. For instance here's C meeting or beating Fortran on every benchmark except one (where statistically they're probably tied). https://benchmarksgame-team.pages.debian.net/benchmarksgame/...
Soooo, those C compilers cheat on the language as defined in K&R etc.!!! WOW!!
With that cheating, C gets to catch up with, say, what PL/I was doing with string manipulations in, say, 1969!!!
So, this is the 50th anniversary!! C string handling is right up to date as of 50 years ago!!!
There is still the issue of a C programmer having to calculate array indices that Fortran and PL/I can do back, way back there, 50+ years ago!!!! What was it, Fortran 66 or some such???
> It doesn't "cheat"; the semantics of the C standard library are just as much a part of the language as the rest.
As I read K&R, strcpy, etc., where there, in a library of external functions, as a convenience and optional for the user, and NOT part of the definition of the C language. Your YMMV.
The situation definitely is "cheating": In K&R,
strcpy is clearly, syntax, semantics, original implementation, an external library function. K&R has the definition of the language. One of the most important parts of C is that it doesn't change. It was clear in K&R that strcpy could be implemented by any user in their own external code.
So, a compiler that implements strcpy with its own in-line, not with external names, code is cheating on the language definition.
C as in K&R was designed to run in a DEC computer with 8 KB of main memory. So, the K&R definition of C was primitive.
Since strings in C as in K&R are so primitive, really a disaster, a situation well understood when K&R was written, some programmers might have done a much better string implementation via external function calls. So, given some old C code, they could have linked in their own external calls not mentioned in K&R and for parts of the old code that called strcpy linked in their own version of that function that worked with the new string functionality. All to have been expected.
So, some compiler writers have gotten more performance from C by cheating some on the language and put in compiler options to justify the cheating.
With this cheating, the compiler writers are
making my main point: Functionality defined in the language and implemented in the compiler via in-line code can be faster than functionality implemented in external function calls. That is, the more recent C compiler writers so strongly agreed with this point that they cheated on the language definition to get the gains. This point was made very clearly by IBM as they pointed out that the string functionality in the PL/I language and the IBM compilers was faster than the many Fortran string packages implemented by external functions.
Being faster than K&R style C is easy. So, here I was struck by the point that Rust is not always a LOT faster than C, that is, K&R C. But, maybe Rust IS a lot faster than C with a compiler that does no cheating!
Who said? K&R is pre ANSI C, so pre any standardization.
The concept of the C standard library post-dates ANSI C, so to say that they "changed" the language in that they more closely documented the behavior of the language and the standard library means that, for all intents and purposes, the standard library is part of the language.
Which is exactly the point you are making about FORTRAN and PL/1. That they have standardized functions that perform in a documented manner and so can therefore be optimized by the compiler.
That C compilers understand the defined behavior of strcpy et al means that they are allowed to optimize the implementation.
C/C++ compilers will happily optimize standard library functions. A call to printf that doesn't use any % formatting will usually get converted into a call to puts, for example. And strlen of a string literal is of course replaced with a constant integer, and strcpy will be converted to memcpy if the length is known. And memcpys of known length are of course emitted with appropriate move sequences.
This is a great comment, but it's in the wrong place. The OP is about using Rust/C to compose network and packet handling software; using Fortran there would be completely pointless. There is no easy gain to be found from vectorization in networking.
I was not bringing up vectorization in Fortran. E.g., in Fortran we can have
Real X(20, 30,40), all
Call addall(X, 20, 30,40, all)
and have
Subroutine Addall (Z, L, M, N, Result)
Real Z(L, M, N), All, Result
Integer L, M, N, I, J, K
Sum = 0.0
Do 100 I = 1, L
Do 100 J = 1, M
Do 100 J = 1, N
All = All + Z(I, J, K)
100 Continue
So what is good here is that in the
subroutine the parameter Z
gets its array bounds PASSED
from the calling routine, the
statement
Real Z(L, M, N)
honors those bounds, and the
code then does the indexing
arithmetic (row major or column
major, I don't recall which)
using the bounds passed as
arguments to the parameters
L, M, N. That is, in the subroutine
we can have the array bounds as
parameters, and still Fortran does
the array indexing arithmetic.
Can't do that in C: In C the
programmer has to do array indexing
with their own code in the
subroutine. Thus to the compiler
this code is just ordinary code
and has to be treated as such.
In contrast array indexing Fortran
does is not from ordinary code
thus permitting Fortran to make
better use of registers and,
for example, leaving intermediate
values in registers and not
writing them back to storage
in variables known to the source
code.
Here vectorization has nothing
to do with it.
Bluntly, C doesn't support arrays,
and that's a bummer for the programmer
and also execution speed.
My point is on topic because this thread is about C performance, and
I've brought up the point that
functionality, in this case array
indexing arithmetic, defined in the
language and implemented in the compiler
can be faster than the programmer
implementing the functionality in their
own code, especially, in the case of strings, if that
implementation is via external
routines.
> leaving intermediate values in registers and not writing them back to storage in variables known to the source code
C compilers (with optimization on) don't spill intermediates either.
Your decades of experience are appreciated, but apparently your decades of experience don't cover looking at what optimizing C compilers learned to do in the last 30 years or so.
> Bluntly, C doesn't support arrays, and that's a bummer for the programmer
That's true.
> and also execution speed.
That isn't, or at least you haven't demonstrated it.
That is, they are compiling C with GCC, but Rust uses LLVM as a backend. Compiling their C code with clang reveals that it is also slightly slower than the C code compiled by GCC, but not faster than Rust.
The actual conclusion here is that the GCC toolchain is slightly faster than the LLVM toolchain for this particular use case, but that isn't really news - it happens all the time.
If one want to focus this on Rust, then the downside of Rust vs C here is that C has a GCC toolchain that happens to be faster for this use case, but Rust does not.