A potential lesson here (i.e. I am applying confirmation bias to retroactively view this article as justification for a strongly held opinion, lol):
Unless you are gonna benchmark something, for details like this you should pretty much always just trust the damn compiler and write the code in the most maintainable way.
This comes up in code review a LOT at my work:
- "you can write this simpler with XYZ"
- "but that will be slower because it's a copy/a function call/an indirect branch/a channel send/a shared memory access/some other combination of assumptions about what the compiler will generate and what is slow on a CPU"
I always ask them to either prove it or write the simple thing. If the code in question isn't hot enough to bother benchmarking it, the performance benefits probably aren't worth it _even if they exist_.
Blog author here. I somewhat agree, somewhat disagree. This line makes me uneasy:
> I always ask them to either prove it or write the simple thing. If the code in question isn't hot enough to bother benchmarking it, the performance benefits probably aren't worth it _even if they exist_.
One of my philosophies is that death by a thousand cuts is fine, but death by ten thousand cuts isn’t. A team of 10 engineers can probably fix most of a thousand cuts in two or three months. But if you have ten thousand cuts you’re probably doomed. And those don’t show up cleanly in a flame graph.
Now for some context my background is video games. Which means the team knows they need to hit an aggressive performance bar. This isn’t true for many projects. shared_ptr is a canonical example of death by ten thousand cuts.
That said, I strongly agree with the principle of “just do the simple thing”. However I think it’s important to have “sane defaults”. A project can easily have a thousand or ten thousand papercuts that kill performance. But you can’t microbench every tiny decision. And microbenches are only a vague approximation of what actually matters.
I’m also wary of “the compiler will make it fast”. Because that’s true… until it’s not! Although these days you don’t have any choice but to lean heavily on the compiler and “trust but verify”.
No one wants a super complex solution if it’s not needed. However I am very amenable to “do a slightly more complex thing if you know it’s correct and we can never think about this ever again”. It’s much easier to do the fast thing upfront than for someone else to try and speed it up in two years when we’re doing a papercut pass.
> But if you have ten thousand cuts you’re probably doomed. And those don’t show up cleanly in a flame graph.
I am reminded of the lovely nanosecond/microsecond talk by Grace Hopper. If your code does a little bit of setup and then spends all of its time in a single hotspot, fine. But if your code is full of microsecond-suboptimal speed bumps, you can probably hide your hotspot altogether. And a flat-ish flame graph looks fine: nothing stands out as a problem!
It's valuable to do micro-benchmarks, not just to hone your optimization skills, but to learn optimal patterns in your language of choice. Then, when you're "in the zone" and laying down new code, you just do the optimal thing reflexively. Or, when you're reviewing or rewriting something, those micro-hotspots jump out and grab your attention.
There's a reason that ancient software running on ancient hardware is way more responsive & snappy than what we have today. Laziness.
> There's a reason that ancient software running on ancient hardware is way more responsive & snappy than what we have today. Laziness.
Laziness in terms of using entirely inappropriate algorithms, sure.
Laziness in not microbenchmarking minutia? It shouldn't be. There's a limit on how much that can hurt you. I would say much less than a factor of ten, but let's go with 10x just for argument's sake. If you have a CPU that's 500x faster, and use easy code that's 10x slower, you're doing just fine. This is not the problem with modern unresponsiveness.
When I rewrite python code in C, I often hit 1000x speedups, and sub-100x is rare. And that's line-for-line. When I fix an accidentally-quadratic issue, for example, I've seen speedups in the billions without even changing the language.
People have lionized Knuth's quote about premature optimization, and used that to ignore performance issues across the board. Since the early '00s, we have not seen a 500x improvement in CPU speed. It's less than 2x on frequency, and let's say 8x on core-count for most users (which doesn't help your single-core lazy programmer). In my experience, programmers will make projections based on a 500x-faster processors that will never arrive, because it's easier than honing their skills and keeping them sharp. And even if these magical THz-frequency chips arrive, if you have three layers of 10x slowdowns, you're back down to GHz.
This has been my experience too. I wrote a text crdt last year which improved automerge’s (then) 5 minute runtime. My code currently takes 6ms to do the same work.
Automerge’s design assumed this stuff would always be slow, so they had this whole frontend / backend code split so they could put the expensive operations on worker threads. Good optimizations in the right places make all that complexity unnecessary. The new automerge is shaping up to be simpler as well as faster.
> When I rewrite python code in C, I often hit 1000x speedups, and sub-100x is rare. And that's line-for-line. When I fix an accidentally-quadratic issue, for example, I've seen speedups in the billions without even changing the language.
And neither of those is a microbenchmark thing, which is kind of my point. I'm surprised language would hurt that much, but that's enough to break things on its own without any layering.
> Since the early '00s, we have not seen a 500x improvement in CPU speed. It's less than 2x on frequency, and let's say 8x on core-count for most users (which doesn't help your single-core lazy programmer).
I don't think people are talking about 2004 when they talk about the responsiveness of ancient software on ancient hardware. I interpret that as more like an Apple II. But instructions per clock have also gone up a lot since the pentium 4 days, and having more than one core in your CPU has a huge impact even for single-threaded programs.
> And neither of those is a microbenchmark thing, which is kind of my point. I'm surprised language would hurt that much...
The point I'm making here is that every line matters -- not just the hotspots. If you're suprised that language can have that much impact, perhaps it's time to learn a bit about performance issues that you're being dismissive of?
> I don't think people are talking about 2004 when they talk about the responsiveness of ancient software on ancient hardware.
No, I was responding to your mention of a 500x improvement in hardware. That pipedream ended in the early 00s, and people still talk like Moore's law will absolve their inattentive coding practice. And that felt fine in the decades we went from kHz to GHz, but it's unacceptable today.
> The point I'm making here is that every line matters -- not just the hotspots.
That depends on how it would have performed if you only transformed the hottest 10% into C.
But I was mainly responding to the idea that micro-optimizations are needed to keep general software snappy, and I don't think they are. If one language is that much faster, that's not micro-optimization.
> No, I was responding to your mention of a 500x improvement in hardware.
What do you mean "No"? I was talking about current hardware being 500x faster than 1988 hardware, which it is. If that's not what you meant by "ancient software on ancient hardware", fine, but that's what my 500x was talking about.
> people still talk like Moore's law will absolve their inattentive coding practice
I'm not trying to excuse inattentive coding. I'm trying to say certain kinds of attention are important and others aren't.
Python is easily 200x slower than C for equivalent code. The reason why Python is usable is because it uses a lot of C libraries that do the heavy lifting.
> And a flat-ish flame graph looks fine: nothing stands out as a problem!
If your program is still slow, that would also indicate that everything is a problem, ie. the ten thousand cuts. Start optimizing at some obvious spots and then see what happens.
Haha "death by a thousand cuts" is exactly the phrase I encounter in these debates!
And actually I still disagree - e.g. I once took over a DMA management firmware and the TL told me "we are really trying to avoid DBATC so we take care to write every line efficiently". But the thing was that once you have a holistic understanding of the systems performance you tend to find _only a small fraction of the code ever affects the metrics you care about_!
E.g. in that case the CPU was so rarely the bottleneck that it really didn't matter, we could have rewritten half the code in Python (if we'd had the memory) without hurting latency or throughput.
Admittedly I can see how games or like JS engines might be a kinda special case here, where the OVERALL compute bandwidth begins to become a concern (almost like an HPC system) and maybe then every line really does count.
Very little of your code is in hot loops. If the code that takes half a millisecond per frame could be twice as fast, but the hot loop is very optimized, then it doesn't really matter. And that's what I would think of by default for having many many cuts. Better to spend the optimization effort elsewhere.
> shared_ptr is a canonical example of death by ten thousand cuts
Why does that count as ten thousand cuts rather than one cut? That doesn't sound intractable to fix if you have months.
Very few programs have "hot loops" as small, contained things. "hot loops" is at this point as much of a bad trope as "but premature optimizations!" is.
> Why does that count as ten thousand cuts rather than one cut? That doesn't sound intractable to fix if you have months.
Why spend months re-writing code you could have just written correctly the first time if a single person had spent an hour messing around with benchmarks to figure out the proper guidance?
> Very few programs have "hot loops" as small, contained things. "hot loops" is at this point as much of a bad trope as "but premature optimizations!" is.
If you say so. I can pretty easily and meaningfully divide my current program into "happens once per frame or less" and "happens hundreds or more times in a row".
> Why spend months re-writing code you could have just written correctly the first time if a single person had spent an hour messing around with benchmarks to figure out the proper guidance?
If they could have figured it out that easily I understand even less how this example works.
Let me put it this way: If a benchmark can change huge swaths of significantly performance important code, then it's "hot enough to bother" by orders of magnitude. I thought we were talking about microbenchmarks for individual cuts!
I generally agree, but it’s also not obvious to me in Rust (or in Go) whether passing by reference or by copy is more maintainable or clear. I guess what I want is some guidance on what I should do by default, which you sort of give with “do what is more maintainable”, but I can’t tell what that means in practice (I’ve been told to default to pass-by-reference in the past because most traits take &self and not self).
> I’ve been told to default to pass-by-reference in the past because most traits take &self and not self
This is only blanket advice for designing traits, because as the trait author you don't know what concrete type the downstream user is going to want to use, and taking `&self` in that circumstance is the choice that is friendliest to both Copy and non-Copy types.
If you're just writing a non-generic function and you do know what concrete types you're using, the flowchart is pretty simple:
1. If the type is not Copy, then pass by-ref if you just need to read the value, pass by-mutable-ref if you just need to mutate the value, and pass by-value if you want to consume the value.
2. If the type is Copy, then pass by-value, but if your type is really big or if benchmarking has determined that this is a critical code path then pass by-ref.
I would still consider myself a go novice, but I have been burned a number of times passing simple objects by reference and then that object gets mutated causing subtle bugs. Also, go is happy to blow your foot off if you take the reference of a loop variable. Although, there is a proposal to fix that.
Generally I find that less bugs get introduced when using copy instead of pass by reference, but I’m sure others have the opposite opinion.
Usually (read: near universally) something that provides interior mutability will implement a runtime check to protect you (eg, a lock), but you could make a footgun if you were cavalier about it. And recent I misused a framework (I kept a value returned by the framework around that should've been dropped) without realizing it was holding locks internal to the framework, and ended up with a tricky to diagnose deadlock (though only tricky before I hadn't read the documentation as closely as I should've).
I have had the same results. Passing by copy is simpler and less bug-prone and reduces the urge to "just set the value since I have a reference to the object" which is a well-paved road to significant pain.
And the objects have to get surprisingly large before passing by reference really makes a difference.
That kind of problem doesn't happen in Rust though, because if all you have is a reference then you cannot mutate anything. You can only mutate things if you have a mutable reference to them, which makes the caller jump through hoops to acquire one for you. They may have to have ownership of the thing, or use a Mutex or RWLock to acquire exclusive access to it, before they can call your method. Nothing can be mutated by accident, or in a surprising place.
Yeah, const semantics are one of many things I would have preferred to have in Go over generics. You can sort of emulate them with pass by copy, but (1) that has performance implications and (2) if your data contains a reference, that can still be mutated.
Ouch, I actually feel Ada was right in having the for-loop 'variable' be constant inside the body of the loop. If you want a modifiable loop variable, use a loop/while loop, but at least the first question in peer-review becomes 'why not a for-loop here'?
I don’t think that solves the problem because the loop itself still needs to mutate the variable. The issue often arises when people fork a new goroutine that closes over the loop variable inside the loop body—the loop often completes before any of the threads start running, so many/all of the threads read the loop variable at its max value (e.g., if we are forking 10 goroutines that just print their loop variable and quit, then there’s a good chance every goroutine will print 10 rather than printing 1-10).
Wouldn't any sane fork system capture the value of the loop variable instead of a reference to it? Why would anyone ever want to capture a reference to a loop variable that will be moved under one's feet? I'm sorry I'm struggling to understand what we're doing here and why.
No, the capturing is done by the closure and as far as I know capturing by reference is pretty standard for closures across languages. The forking system just runs the closure in a thread, agnostic to any capturing details.
I don't think one of them is more clear or maintainable universally, but in a lot of contexts, there might be an obvious choice. As a trivial example (that isn't quite fair given that the topic is about structs), it will almost never be more clear or maintainable to pass a shared reference to an integer (although there may be cases where a mutable reference might make sense). I don't think there's much need for one to have precedence over the other by default; if anything, I see the discussion about performance tradeoffs not being worth fretting about in the absence of actual measurement to be an argument _against_ one of them being inherently preferable.
Yeah totally agree it's not always/usually obvious. But there are cases where there's a clear readibility/assumed-performance tradeoff and in those cases I say always prefer readibility (unless you benchmark).
> but that will be slower because it's a copy/a function call/an indirect branch/a channel send/a shared memory access
I really dislike these takes. I see engineers optimize these cases and then go ahead and make two separate SQL queries that could be one, ruining any false optimization gains they got by lord know how many times.
Yeah, you can loop over that 100 element list twice doing basic computation if you want, it's not going to make a difference for many engineering workloads, but could make a big difference in readability.
This x 10000 ! If I had a dime for every time I provided this exact feedback in code reviews…I find surprising that a lot of devs in tech industry are obsessed with pointless micro optimizations and they don’t care about writing maintainable, testable simple code. My final comment is always to not outsmart compilers/jvm because they tend to do a much better job that developers.
Please, don’t optimize unless you have reasons to do so and numbers backing that up.
My advice is the opposite: if you want to make performance justifications for code, you need a benchmarking suite. I have them for a lot of my projects. (Rust’s criterion is a delight). A good benchmark suite is a subtle thing to write - you want real world testing data, and benchmarks for a range of scenarios. The benchmarks should be run often. For some changes, I‘ll rerun my benchmarks multiple times for a single commit. I benchmark the speed of a few operations, serialisation size, wasm bundle size and some other things.
Having real benchmarking data is eye opening. It was shocking to me how much the wasm bundle size increased when I added serialisation code. The time to serialise / deserialise a big chunk of data for me is 0.5ms - so fast that it’s not worth more microoptimizations. Lots of changes I think will make the code slower have no impact whatsoever on performance. And my instincts are so often wrong. About 50% of microoptimizations I try either have no effect or make the code slightly slower. And it’s quite common for changes that shouldn’t change performance at all to cause significant performance regressions for unexpected reasons.
I’ve also learned how important “short circuit” cases can be for performance. Adding a single early check for the trivial case in a function can sometimes improve end to end performance by 15-20%, which in already well tuned code is massive.
Performance work is really fun. But if you do performance tuning without measuring the results, you’re driving blindfolded. You’re as likely to make your code worse as you are to make it better. Add benchmarks.
This is true for application code. But, Rust is trying to sell itself as a systems language and an embedded language and a language you can write kernel modules in. Memory budget matters in these cases.
If memory budget matters, you have a memory budget. So you should be measuring and you can actually tell where you need improvements.
But in practice what we see overwhelmingly is that people want to do this stuff but they aren't measuring, because measuring is boring whereas making the code more complicated to show off how much you think you know about optimisation is easy. Knock it off.
Works really nice for me, of course, I actually measure what I'm doing.
You edited your comment, so I guess I will too: Rust doesn't actually have Linear types. Linear types ("must use or compile error") would be tricky to provide, Aria blogged about it back in the day. So that's definitely going to be a problem with your refactoring.
Yup, we have a good benchmarking suite to make sure changes don't cause regressions and that optimization changes actually work.
That said, I think asking a developer to write everything they do twice so that they can A/B test is overboard. You can come back and really aggressively optimize later, but I think the "default" should be the fast thing, rather than the slow & easy thing.
Time to market can matter in fintech as much as anywhere else.
Edit: also optimizations often interact with each other and you can't always benchmark all possible combinations, so sometime you do have to rely on experience or reasoning from first principles to decide if an optimisation is worth it or not.
We could, yeah. I'm sure you're capable of re-writing everything you do twice. It's just a huge waste of time.
You get just as much benefit by assigning a performance refactor to an engineer when needed vs literally halving or worse the whole teams productivity.
When you're writing in a language like Rust or C++, you know broadly what the compiler optimizes and what it probably won't. You do need to do the work yourself a lot of the time.
One neat thing here is that the compiler is aware of which types are `Copy` and not internally mutable (not contianing an `UnsafeCell`). For these types, passing `&T` and `T` are equivalent, so the compiler could just choose the faster option.
Even if it's not smart enough to do that today, it could implement this optimization in the future. This could work even without inlining, since the Rust calling convention is unstable, and an optimization based on type size could be incorporated into it.
It would be nice if Rust could do this, but it breaks backwards compatibility. Some existing code depends on pointer values of &T being equal or not equal.
As an addendum, LLVM can automatically perform the “&T-as-T” optimization (without inlining) in some cases where the callee function is in the same compilation unit and known to not care about the pointer value. However, these types of optimizations tend to be fragile, easily disturbed when things get slightly complex.
Indeed, but the compiler is still capable of doing it on a case-by-case basis. Quite often the observed semantics are identical and it's easy for the backend to see that a pointer has been created only to be immediately dereferenced.
This seems to be an unpopular opinion, but I feel similarly about how sometimes people seem to toss out `inline` (and even more suspect, `inline(always)` annotations on Rust functions like candy on Halloween and there are almost never any sort of actual measurements of whether it actually helps in the cases it's used. It's not even that I think it really hurts that much in most of the stuff I've worked on (which tends to be more sensitive to concurrency design and network round trips), but I can't help but worry that people using stuff like this when they don't seem to fully grasp the implications is a recipe for trouble.
Yeah inline is an absolute classic for this. The number of uncommented __attribute__((always_inline))s I see in C code drives me crazy. There are absolutely legitimate reasons to use that attribute but there should ALWAYS be a comment about why, so that later readers know in what conditions they can safely remove it.
In the old days I saw C code with massive overuse of the "register" keyword. Meaning this variable should be kept in a register. Someone had ove 12 of these in one function back when that was waaaay more variables then processors had registers.
Good thing modern register allocation by compilers makes this irrelevant.
Even if you do benchmark something, maintainability can be more important than a marginal performance improvement.
I've seen this happen a lot with JavaScript, particularly in the last 5-10 years as JS engines have developed increasingly sophisticated approaches to performance. Today's optimization can be tomorrow's de-optimization. Even given an unchanging landscape of compiler/interpreter, tightly-optimized code can become de-optimized when updated and extended, as compared to maintainable code that may not suffer much performance degradation upon extension.
How are you using the word simpler? Because to me that implies a combination of more obvious and number of lines of code. Something that a benchmark shouldn't be involved in.
For example asking someone to delete 10 lines of code and instead use go's ` net.SplitHostPort` would be an example of "simpler".
I've read that good generals worry about tactics and great generals worry about logistics.
Good programmers play code golf, great programmers write readable and maintainable code.
Your example seems reasonable but programmers also like to act like the smartest one in the room. I often come across tricky and borderline obfuscated code because somebody wanted to look clever. This is a logistical nightmare.
Ugh, you are right but then someone comes and uses this to rationalize not including things like map, filter and reduce in a language because they are supposedly too complicated and you can just do it with a for loop
I work in a Rust codebase that uses a lot of functional functions, and I’ll say this: on average the imperative style takes less lines of code and less indentation. I also find it more readable personally, and idiomatic.
Just because we’re on the topic of performance: the rust optimizer can sometimes generate better code if you use map / filter / etc. The slice iterator in any context is a huge win over manual array iteration because it only needs to do bounds checking once.
Javascript (v8, last I checked) is the opposite. Simple for loops almost always outperform anything else.
I've seen cases where an iterator was better, but I've also seen gains from using an imperative loop with manual indexing. Loop conditions and the occasional assertion can be enough to elide bounds checks. (Though sometimes the compiler gets too paranoid about integer overflow.)
Most of the time you should just write whatever's clear/convenient but sometimes it's worth trying both and scrutinizing godbolt.
Without knowing the domain, you have no way of knowing that.
It’s also much easier to stick to for() loops in javascript as you code than it is to rewrite everything later when tuning for performance. If that’s something you expect to need to do.
Functional iteration is good for the same reason we use for loops over while loops, and while loops over goto: they are more constrained, more clearly communicate intent, and are therefore easier to reason about.
Yeah, I agree with that. Especially reduce/fold, which I find is almost always better written as loop. Filter would be a good example of the opposite for me: almost always much clearer written functionally.
I suspect this has more to do with the lack of first-class sequence comprehensions in the syntax. If you had to write imperative style, but all loops were HOFs, it would hardly be ergonomic, either. OTOH a good query language is much more readable.
That's what they're saying: if someone says anything more complicated is faster, they challenge them to benchmark it. Usually, it turns out whoever argues the "is faster" point doesn't bother to benchmark it and the simpler code-wise thing wins out. So yes - the benchmark goes to performance, simplicity is in lines-of-code, cyclomatic, "in the eye of the beholder" or whatever other metric you choose, but usually it's obvious.
> I always ask them to either prove it or write the simple thing
This is cumulative thou, 1 may not make much difference. 100,000 of those will. There is that story of chrome slows down that was does discovered to be in strings each one of which is too little to make a difference
> I always ask them to either prove it or write the simple thing.
Even if they do, they also need to make a case that in this specific case, performance matters enough to pessimize code simplicity and maintainability.
Also, if performance is that critical, it's imperative to benchmark again after each compiler release to guard against codegen regressions. And benchmark after changing this piece of code. Otherwise, we can say that performance doesn't really matter.
I think you're missing the point of the previous comment, they are saying a good proxy for it being worth it to optimize, is if you're willing to benchmark it.
My favorite way of thinking. Should be applied to question the need for the existence of the given feature or function too, and best to delete the whole stuff.
I don’t feel like this gave a satisfactory answer the question. Since everything was inlined, the argument passing convention made no difference in the micro benchmarks. But what happens when it does not inline? Then you would actually be testing by-borrow be by-copy instead of how good rust is at optimizing.
> Then you would actually be testing by-borrow be by-copy instead of how good rust is at optimizing.
I don’t think the question is actually: “what is faster in practice, a by-copy method call or a by-value method call”, I think the question is: “as an implementer, which semantics should I choose when I’m writing my function”.
For the second question: “Rust is usually pretty good at aggressively inlining, so… if you’re willing to trust Rust’s compiler, you’re often okay going with by-copy implementations, but you should keep an eye on it”. Whereas, as you note, for the first question it’s not an answer.
But, I do think if someone was going to put more work into it I’d be very curious what the answer to the first question is. If I’m choosing to implement with by-copy semantics and trusting the Rust compiler to hopefully inline things for me, I’d like to know the implications in the cases when it doesn’t.
Blog author here. This feels like the best summary in this comment section.
The root question is indeed “what semantics should I use”. And the answer I came up with was “the compiler does a lot of magic so by-copy seems pretty good”. I agree with the previous commenter this is not a satisfying conclusion!
My experience with Rust is that it requires a moderate amount of trust in the compiler. Iterator code is another example where the compiler should produce near optimal code. Emphasis on should!
When value size is small (whatever "small" means for particular architecture) I'd say "trust the compiler" suggestion is reasonable. When the size grows there should be no more "trust" unless compiler can decipher if it is safe to use ref instead of value basing on value size (we assume that the function does not mutate the value).
> When the size grows there should be no more "trust" unless compiler can decipher if it is safe to use ref instead of value basing on value size
I believe that the Rust compiler at least does exactly that. Large structs will be passed by reference under the hood even if it passed by value in the code. I suspect C++ compilers do the same, although I'm not sure about that.
>"Were the other benchmarks run in debug mode / with optimizations turned off or something like that?"
Why would I do something like that? Of course all builds are release mode, optimize for speed.
Rust - Windows - By-Copy: 14124, By-Borrow: 8150
C++ - Windows MS Compiler - By-Copy: 12160, By-Ref: 11423
C++ - Windows LLVM 15 - By-Copy: 4397, By-Ref: 4396
>"Why is performance so much better in this case?"
Not sure and not in a mood to investigate. I do know if cache locality and branch prediction stars line up properly the performance difference can be staggering. Maybe LLVM has accomplished something nice in this department.
Such stark difference in performance is really fishy and needs a deeper analysis. I just checked it with different compilers (all based on LLVM) and here are the results:
Is this Windows or Linux? Also what CPU? Your results for rust on LLVM15 are pretty much close to mine but clang++ on LLVM15 is almost twice as slow. I really want to find reason.
Sorry for being dumb, I thought I've enabled it in ./cargo/config.toml for rust. Checked it again and it appears that I did not (the file name was wrong). -march=native for CLang also did better than just avx2
No need to bother with LLVM16 and / or GCC. Enabling avx2 for Clang compiler produced following results that are more or less in line with what was expected.
Either is way faster than the result from the original post and no rust does non win this "competition". C++ is a bit faster (see also the results from other poster above) but not by much.
It may be the default fp mode for msvc is causing C++ to suffer here ( https://learn.microsoft.com/en-us/cpp/build/reference/fp-spe... ). It looks like the default behavior is quite conservative it what it allows to happen - like it won't even use FMAs with the default behavior?
I feel like they got excited by their C++ code being so much slower and curious about the "weird" C++ result and forgot to figure out the original question.
Rust - Windows - By-Copy: 14124, By-Borrow: 8150
C++ - Windows MS Compiler - By-Copy: 12160, By-Ref: 11423
C++ - Windows LLVM 15 - By-Copy: 4397, By-Ref: 4396 Delta: -0.0227428%
So it appears that C++ - Windows LLVM 15 beats Rust by large margin.
In Rust it's considered idiomatic to pass things by-value whenever you can. Usually this is also the most performant option, since it avoids dereferencing in the callee.
Of course, if your struct is truly enormous, you may want to break this rule to avoid large copies. But in that case you probably want to Box<T> the struct anyway.
Of course, if your struct contains something that can't be copied--like a Vec<T>--you'll have to decide whether to clone the whole struct (and thus the vector in it), pass the struct by-borrow, or find some other solution.
When the compiler can see both the callee and the caller (the most common case), why should it even matter? If you pass by value, but don't actually do anything that mutates the copy, and there are no aliasing concerns, surely the compiler can make it by-ref under the hood?
(The other direction is trickier, since by-ref implies the desire to observe aliasing, even though that's usually not expected in practice - but the compiler cannot tell.)
> Safely borrowing things by reference is one of Rust's headline features
Sure, but it's worth noting that references in Rust do not exist merely to avoid passing by-value. They also exist to make it easier to deal with Rust's ownership semantics: they let you pass things to a function without also requiring the function to "pass back" those things as returned values. In other words, references let you do `fn foo(x: &Bar)` rather than `fn foo(x: Bar) -> Bar`. This is a unique and interesting consequence of languages with by-default move semantics.
This is one advantage of Ada, where parameters are abstractly declared as "in" or "in out" or "out". The compiler can then decide how to best implement it for that specific size and architecture.
So far as I can tell, it's not quite the same thing since these still have pointer semantics (and thus have to deal with aliasing etc). The in/out approach is more generic, since "in" can map to a pointer where it makes sense, and to a copy where it does not.
Better yet when you prohibit such arguments from aliasing (or at least make no-alias the default) - now the compiler can also implement "in out" by copying the value back and forth, if it's faster than indirection.
It can return multiple values, so this doesn’t matter much for value types, but it would be nice to be able to specify that a pointer arg is an out-param sometimes and enforce that it is not read from while handling allocation in the caller.
The difference between passing by reference vs. by value is observable when comparing pointers to the original vs. to the argument. This difference may be unobservable in Ada though (not sure), so Ada would have more freedom choosing between the two.
A question to the Rust experts, would lifetime annotations 'a in Rust have similar benefit as "in" or "in out" or "out" in Ada and other languages? With the additional benefit in Rust where the compiler can deduce those automatically for most cases?
As a sibling comment points out, "in" is effectively equivalent to "&T", and "inout" is effectively equivalent to "&mut T". Rust is missing purely "out" parameters, but that isn't a very common case, and I'm not sure how much value there is in saying "this reference can't be read" since references are always guaranteed to be valid in Rust.
Out params are needed to ensure zero-copy in-place initialization of large objects, i.e. what C++ does with placement new. You can sort of fake it with MaybeUninit<> https://doc.rust-lang.org/std/mem/union.MaybeUninit.html but it's quite fiddly, requires unsafe code, and has undesirable effects on the layout of containing objects.
Dlang can also qualify parameters as in, out, and inout; although I don't know to what degree the compiler is able to use that for optimization purposes (it is used for safety checks IIRC)
At the minimum, the compiler can look at the size of the type to decide whether copying or indirection makes more sense. The size is stable for a given parameter signature and architecture, so the ABI is stable.
But also, the ABI only matters between two pieces that are separately compiled. A static binary optimized at link-time doesn't have to care.
This is one of those questions where you really, honestly, do need to look at a very low level.
Back in the ancient days, I worked at IBM doing benchmarking for an OS project that was never released. We were using PPC601 Sandalfoots (Sandalfeet?) as dev machines. A perennial fight was devs writing their own memcpy using dst++ = src++ loops rather than the one in the library, which was written by one of my coworkers and consisted of 3 pages of assembly that used at least 18 registers.
The simple loop was something like X cycles/byte, while the library version was P + (Q cycles/byte) but the difference was such that the crossover point was about 8 bytes. So, scraping out the simple memcpy implementations from the code was about a weekly thing for me.
At this point, we discovered that our C compiler would pass structs by value (This was the early-ish days of ANSI C and was a surprise to some of my older coworkers.) and benchmarked that.
And discovered that its copy code was worse than the simple dst++ = src++ loops. By about a factor of 4. (The simple loop would be optimized to work with word-sized ints, while the compiler was generating code that copied each byte individually.)
If you are doing something where this matters, something like VTune is very important. So is the ability to convince people who do stupid things to stop doing the stupid things.
I always prefer by-borrow. That's because in the future this struct may become non-copy and that means some unnecessary refactoring.
My thinking is a bit like "don't take ownership if not needed" - the "not needed" part is the most important thing. Don't require things that are not needed.
If a struct might lose Copy you shouldn't implement Copy at all, to preserve forward compatibility. You can still derive Clone in most cases; using .clone() does not per se add any overhead.
Exactly why IMHO the rust stdlib is so easy to understand. Ownership only when required as a design principle tends to make the design of the overall system more consistent / easier to approach.
> Blech! Having to explicitly borrow temporary values is super gross.
I don’t think you ever have to write code like this. Implement your math traits in terms for both value and reference types like the standard library does.
Go down to Trait Implementations for scalar types, for instance i32 [1]
impl Add<&i32> for &i32
impl Add<&i32> for i32
impl Add<i32> for &i32
impl Add<i32> for i32
Once you do that your ergonomics should be exactly the same as with built in scalar types.
True, that does work for traits. But it's super annoying if you have to write multiple copies of the same thing. That can get out of control quick if you need to implement every combination.
And that doesn't help at all if you're writing a "free function" like 3D primitive intersection functions. I suppose you could change that simple function into a generic function that takes AsDeref? Bleh.
It's compiled, so, without any investigation at all, I would have been disappointed if there were any significant difference in the code emitted in these cases. I would expect the compiler to do the efficient thing based on usage rather than the particular syntax. I may have too much faith in the compiler.
I'd expect your claim to be true whenever the callee is inlined into the caller. In this case, the compiler has all the relevant information at the right point in time. As other commenters have pointed out, by enabling inlining the author has gone down a rabbit hole somewhat unrelated to the question, because any copies can be simply elided.
If there's no inlining at play, I'd expect vast differences to be possible. For example, imagine a chain of 3 functions - f calls g, g calls h, where one of the arguments is a 1kB struct and the options are passing by copy or by borrowing. In this case, each stack frame will be 1kB in size in the copy case and there will be a large performance overhead as opposed to the by-reference case. One would expect simply calling the function to be similar in overhead to an uncached memory load.
Within a single crate the inlining is possible, with multiple crates it's only possible with LTO enabled (and I'm not sure how _probable_ it is that the inlining would occur).
In either case, the difference between a 32 byte and 8 byte argument in terms of overhead is likely meaningless - the sort of thing to be optimized if profiling says it's a problem as opposed to ahead of time.
> Within a single crate (more specifically, codegen unit) the inlining is possible
Cross-crate inlining happens all the time. In order to be eligible for inlining, a function needs to have its IR included in the object's metadata. This happens automatically for every generic function (it's the only way monomorphization can work), and for non-generic functions can be enabled manually via the `#[inline]` attribute (which does not force inlining, it only makes it possible to inline at the backend's discretion).
However, as you, say, if you have LTO enabled then "cross-crate" inlining can happen regardless, since it's all just one giant compilation unit at that point.
At the VERY end of the article, the author points out "Oh, btw, I used MSVC for the C++ compilation, when I used clang things changed!"
So, what the author actually measured was the difference between llvm and msvc throughout the article. Particularly when they talked about rust being better at autovectorization than C++.
Incorrect. Clang C++ vs MSVC C++ is very comparable, and noticeably worse for f64 by-ref. Clang C++ is still slower than Rust by a large margin. Using Clang C++ throughout would not change any conclusion (or lack thereof).
I'd be interested to know what the benchmarks of the two rust solutions are when inlining is disabled so we can get an idea of the different performance characteristics of each function call even if it's not a very realistic scenario.
The other question I have is which style should you use when writing a library? It's obviously not possible to benchmark all the software that will call your library but you still want to consider readability, performance as well as other factors such as common convention.
I would go with the version that gives the clean user interface (that is, by copy in this case). If it turns out that the other version is significantly more performant and this additional performance is critical for the end users consider adding the by-borrow option.
The clarity of the code using a particular library is such an big (but often under-appreciated) benefit that I would heavily lean in this direction when considering interface options. My 2c.
The assumption behind such arguments is when a performance problem does arise, a profiler will point to a single, easy to fix, smoking gun. Unfortunately this is not always the case. Performance problems can be hard to diagnose and hard to fix. A lot of damage has been done by unexamined / dogmatic "root of all evil" mantra.
In the vast majority of situations (1) you'll prematurely optimize in the wrong place and (2) yes the profiler will point to a single, easy-to-fix smoking gun.
Situations otherwise are the exception, rather than the rule, and it takes an expert to (1) recognize those situations and (2) know exactly how to write optimized code in that situation.
That's why "don't prematurely optimize" is a good rule of thumb - because it works the majority of the time, and it takes experience to know when not to apply it.
> In the vast majority of situations [..] yes the profiler will point to a single, easy-to-fix smoking gun.
[citation needed]
This claim depends hugely on the industry you're actually working in and the problem space. Things like UIs & games basically never have a single, easy-to-fix smoking gun. The entire app is more or less a hotspot - be it interactive performance, startup performance, RAM usage, or general responsiveness.
And once you're gone down the route of "build it first, optimize it later" you're pretty much fucked when you get to the "optimize" step because now your performance mistakes are basically unfixable without a rewrite - every layer of your architecture has issues that you can't fix without drastic overhauls. It would have been much easier to do some up-front measurements, get some guidelines in place (even if they aren't perfect), and then build the app.
This is ridiculous. If you had the knowledge, you'd apply it, and you can't acquire it without practice, which is going to involve heuristics (which you falsely call "dogma" in order to ad-hominem my argument) like this one, which exists precisely because you need something for when you don't have the specialized domain expertise.
> The true root of all evil is unexamined dogma.
This is so absurd that it doesn't deserve a reply.
This advice hinges hugely on what "start simple" really means. There's a ton of counter-examples here where that just isn't true at all depending on what you're calling "simple". In particular JIT'd languages can be especially problematic here. An example would be using Java's Streams interfaces to do something that could be done without much difficulty with a regular boring ol' for loop. At the end of the day you're hoping the JIT will eventually convert the streams version into the same bytecode the for loop version would have started with. But it won't do that consistently, and you've still wasted time before it did so.
Trusting the compiler also means knowing what the compiler actually understands & handles vs. what's a library-provided abstraction that's maybe too bloated for its own good and that quickly becomes "not simple" depending on your language of choice.
I agree in general, but the side-effect of doing this is that no matter how fast your hardware gets, your software will always end up optimized to the new hardware. So over time your software gets slower and slower but performance stays consistent as hardware gets faster.
> * Sprinkling & around everything in math expressions does make them ugly. Maybe rust needs an asBorrow or similar?
Do you mean AsRef, or do you mean magic which automatically borrows parameters and is specifically what rust does not do any more than e.g. C does?
Though you can probably get both if the by-ref version is faster (or more convenient internally): wrap the by-ref function with a by-value wrapper which is #[inline]-ed, this way the interface is by value but the actual parameter passing is byref (as the value-consuming wrapper will be inlined and essentially removed).
The benchmarks lack the standard deviation, so the results may well be equivalent. Don't roll your own micro-benchmark runners.
References may get optimized to copies where possible and sound (i.e. blittable and const), a common heuristic involves the size of a cache line (64b on most modern ISAs, including x86_64).
Using a Vector4 would have pushed the structure size beyond the 64b heuristic. You would also need to disable inlining for the measured methods.
It was also (needlessly) using 2 different compilers, MSVC and LLVM. This is just a bad way to compare things all around.
And, for simple operations like this, you really should just look at the assembly output. If you are only generating 20ish instructions, then look at those 20 instructions rather than trying to heuristically guess what is happening.
Note that this is from 2019, so it's probably worth re-benchmarking to see if anything has changed in the interim. Can we get the year added to the title?
For this code, the compiler inlined the call. So there should be no difference between pass by copy or pass by reference, which is what was measured. Where it could matter is when the code isn’t inlined. But with small structs it might not matter all that much.
It does sometimes matter though. One optimization I’ve seen in a few places is to box the error type, so that a result doesn’t copy the (usually empty) error by value on the stack. That actually makes a small performance difference, on the order of about 5-10%.
Folks, processors continue to give smaller and smaller gains every year. Something has to give. If you have critical path code that absolutely must max out the core, then this type of analysis (as pedantic as it is) is useful in the long run.
It's not like you can do arithmetic with references, so maybe the ergonomics of by-value vs. by-reference shouldn't really be that different.
The cost of by-value lies in memory copies, while the cost of by-reference lies in dereferencing pointers where the values are needed, which might mean many more memory reads are needed than with by-value (depends on what you're doing). So it's just hard to tell which will do better in general -- there's no answer to that.
For a library, maybe providing by-value and by-reference interfaces should be good (except that will bloat the library). For everything else just use by-value as it has the best ergonomics.
This actually bothers me. I think the rust performance here is praise worthy. What bothers me is that we piled complexity over complexity at the hardware and compiler levels, and ended up in a situation where you got no way to get a reasonable understanding of how low level code will perform. Nowadays the main reason to program in a "low level" language is that you know that on average the compiler will be able to do a better job because the language doesn't have abstractions that map poorly to the hardware model. But for much of it you can forget about "I know what the hardware is going to do"
But at the risk of loss of respect, I'll wait for Rust2ShinyNewLanguage to solve this.
All I know is I hope I'm smart enough to understand ShinyNewLanguage's compiler. Or maybe even build it.
I've got several projects that could use some additional Boxes of structures, and borrow instead of move, and maybe a few more complex reference counting mechanics.
Rust forced me to understand what that meant. That's good for building a better engineer.
But it's not fun to work with.
I hope the next experience is better. Sorry Rustaceans.
There is no single answer to this question because it's going to depend completely on call patterns further up. Especially in regards to how much of the rest of the running program's data fits in L1 cache, and most especially in regards to what's going on in terms of concurrency.
The benchmark made here could completely fall apart once more threads are added.
Modern computer architectures are non-uniform in terms of any kind of memory accesses. The same logical operations can have extremely varied costs depending on how the whole program flow goes.
My first thought was "now what is the calling convention for float parameters again? they are passed in registers right? the compiler can probably arrange so they don't have to actually be copied" and then I realized it will probably even inline it.
Anyway, assuming it's not inlined I would guess pass-by-copy, maybe with an occasional exception in code with heavy register pressure.
Edit: Actually since it's a structure, the calling convention is to memory allocate it and pass a pointer, doh. So it should actually compile the same.
> Edit: Actually since it's a structure, the calling convention is to memory allocate it and pass a pointer, doh. So it should actually compile the same.
FWIW the AMD64 SysV v1.0 psABI allows structures of up to 8 members to be passed via registers. Though older revisions limit that to 2 (and it's unclear whether MS's divergent ABI allows aggregates to be splat at all.
As sad as it's unsurprising, it does not look like LLVM (linux?) has followed up, on godbolt a 2-struct passes everything via registers but a 3-struct passes everything via the stack. Maybe there's a magic flag to use the 1.0 ABI, but a quick googling didn't reveal one. ICC doesn't seem to have followed up either.
How can an abi allow but not require passing them via registers? Isn't the point of a convention that everyone does it the same way? Other wise the caller could put it on the stack and the callee look for it in a register?
Actually reading the word abi made a little neuron light up. To what extent does rust even follow that abi?
> Edit: Actually since it's a structure, the calling convention is to memory allocate it and pass a pointer, doh. So it should actually compile the same.
Depending on calling convention, the structure may be spread out into registers.
Anyone know why seemingly knowledgeable people (like the person who wrote this article) don't use micro benchmarking frameworks when they run these tests?
Also, whenever you do one of these, please post the full source with it. There's no reason to leave your readers in the dark, wondering what could be going on, which is exactly what I'm doing now, because there's almost no excuse for c++ to be slower in a task than rust--it's just a matter of how much work you need to put in to make it get there.
I see now. I looked twice. I think most people stop after a section called "Conclusion" that ends with "Thanks for reading." It doesn't help that the formatting then leaves a large gap between sections that doesn't indicate there's more after that.
For C++ I guess you could make the claim that it's just too annoying to take a dependency on something like google-benchmark or whatever, since C++ dependency management is such a mess to deal with in general.
But yeah I have no idea why a benchmark framework wasn't used for Rust.
The general usability impact matters slightly less than it looks here, in part because the `do_math` with references in the article has two extra &s, and in part because methods autoreference when called like x.f().
Performance-wise, if you're likely to touch every element in a type anyway, err on the side of copies. They are going to have to end up in registers eventually anyway, so you might as well let the caller find out the best way to put them there.
The Rust-test implements the traits Add, Sub, Mul by value. This makes the few references less important in the total test. The ergonomics argument is motivated by using these traits. Otherwise, references would have had the same ergonomics.
But also, the struct is 3x32 bits, and Rust auto-implements the Copy-trait for it. It is barely larger than u64, which is the size of the reference.
But life is only simpler when Copy and Clone can be auto-implemented.
I haven't benchmarked that, but in Rust `ScalarPair`s (i.e., structs who have up to to scalars) are passed in two registers, while bigger structs are always passed by pointer. Therefore, passing bigger structs by move will require the compiler to copy them, while with references it is not required to, so references may be faster in that case.
I understand that this is an example for the purposes of answering the given question, but when actually doing things with 3D vertices one should be thinking in terms of structures of arrays. As someone said here already: good generals worry about strategy and great generals worry about logistics.
You are comparing two completely different compilers; I wouldn't worry all that much about the difference between rust and C++. If you do want to compare them directly, why not use LLVM for C++ as well? That will highlight any language-specific differences.
This is not really surprising in such a case. The Rust compiler is pretty good at optimizing out uneeded copies. Here it does see that the copied value is not used after the function call, so it should simply not emit the copies in the final assembly.
Minor nit: many of the differences in the article aren't really specific to the Rust vs C++, but rather differences between llvm vs whatever compiler backend is used by msvc.
This is one of the problems I have with writing rust code. You have to think about so many mundane details that you barely have time left to think about more important and more interesting things.
Having written a lot of C, you spend basically all your time thinking about "mundane details", and worse, if you make a mistake, you often don't know you made a mistake until it's running on some customer's servers and you just got a ticket escalated 3 times up to you with vague information about crashes rarely happening. Good luck remembering which bit of code you wrote 6 months ago that may be causing the problem.
I'll take Rust shouting at me for missing "mundane details" any day of the week.
Well, it is a systems programming language. Thinking about exactly how the language passes bits around is the whole point. Rust should specify a stable ABI already so that everyone can form a good mental model of what their code becomes once compiled.
True, I probably was using Rust for the wrong type of problem, i.e. was hoping to write a user-level application with a graphical UI at the time.
Rust is probably better used for writing fast low-level libraries that you call from higher level languages, possibly with a garbage collector, so you don't waste time thinking about memory management while you design/write your high-level application.
My (brief) experience with Rust was that, while I had to struggle to learn the borrow-checker, I didn't have lots of "mundane details" to worry about - if any, less than C(++).
As a Rust beginner that likes to learn the hard way I think I have some insights why Rust seems cumbersome and/or hard for programmers trying it.
Rust uses syntax that feels familiar but means completely different things than in pretty much any other language.
For example '=' doesn't mean assign handle or copy. It by default means move.
'let' doesn't mean create a name for something. It means create physical space for something (of known size) that can be moved into or moved out of.
You don't deal with objects and values of primitive types. Instead everything in Rust is a value. When you move, you move the value. If you compare, you compare by value. If you pass something from variable into function, you move the value into the function.
And when the space where you keep the value goes out of scope value dies with it if it wasn't moved out to somewhere else.
Scope for variables (which are just named spaces for values) ends with the end of the block, but some values, created by functions and returned from them, if they are not moved into any named space, can die sooner, even in the middle of the line where they were acquired from function call.
Everything else stems from that fixed size moved value semantics. If you don't want to move the value into the function when you call it you need to pass something else instead, so you create and pass in the borrow. But you have to ensure that the value doesn't die or get moved anywhere (even inside the container you borrowed from) before borrows to it all die.
Because of this you are better off with borrows that are short lived and local. Often it's better to keep the index of an element of a Vec instead of the borrow of this element. If you must create types that contain borrows you must know that they become borrows themselves and you need to treat them exactly the same trying to limit their scope and life time.
It's hard when you come from any other language because borrows are superficialy similar to pointers or references to objects. So you try to use them as such. And crash into the compiler because they are not that. What's worse their syntax is very minimalistic which triggers intuition that they must be fast and optimal solution for many problems which they sure can be once you fully internalize their limitations but not a moment sooner.
Another thing is that values in Rust must have the fixed size. So even as simple thing as a string requires a bit of hackery. Basically in Rust the default strategy to have something of variable size is to allocate it on the heap and treat pointer to it (possibly with some other fixed sized data like length) as the fixed sized value you can move around clone and borrow.
So if you want to have semantics you know from other languages you can't just use basic Rust syntax.
You need constructs such as Box and Rc, Cell, RefCell. Make your things clonable and sometimes even copyable and avoid creating borrows whenever possible initially. When you do it Rust becomes as flexible as any other language and you can use it pretty much just as comfortably. Then the value semantics shines as you can very easily compare your data by value, order it, create operators for it, create has for it so you can keep it in HashMaps and HashSets. Then it's delightful.
My advice is when you create a long lived type just wrap it in Rc and treat this Rc as your 'object'. And avoid borrows in your types unless you have a very good performance (measured) reason to have them or you are creating something obviously dependant and usually short lived like an iterator.
> For example '=' doesn't mean assign handle or copy. It by default means move.
Some complain about this, but the fact is there's no such thing as a zero-overhead "copy" for non-trivial types. C++ started out with = meaning clone the object which was an even bigger footgun, and support for move had to be added after the fact.
Yeah. Making = mean copy is a really bad idea. I very much like the solution in Rust where attempt to move out something that can't be moved out results in automatic copy if the type implements trait Copy.
It's very elegant solution for simple, small data types. But it further occludes how meaningfully Rust is different from everything else because thanks to that = sometimes does mean copy.
Rc, Cell & RefCell are suppose to be rare. For example I've got 2,000 line Rust program in front of me and I've used Arc 3 times and RWLock 1 time, that's all.
You need to structure your program as a Directed Acyclic Graph (DAG), with things interacting only with the things below them in the graph.
Then occasionally you might need to break the DAG structure by using Rc, Cell & RefCell, etc...
The thing is not everything can be expressed as a DAG.
And finding it towards the end of writing your program after hours of fighting with borrow checker is extremely unpleasant.
And I don't think I ever landed in the situation where I could fix the discrepancy by sprinkling in few Rc, RefCells and such.
So I prefer to write with RefCells from the start and when I got the thing working and I am ambitious enough then I look at which parts could be borrows instead and I swap them out.
> How do you express as a DAG a tree where nodes need to keep references to their children and parents?
Rewrite your program in a form where it does not contain a tree.
If you want an actual tree as a data structure, see the trees crate.
> Rust is not Forth. I can write whatever I want and there's nothing wrong with that.
And other people write Haskell code in Python :p. If your code style doesn't match the language you are using you are going to have a lot of unnecessary friction.
Sibling and parent pointers are almost universally a sign that an abstract data structure (and associated algorithm) has been mistaken for concrete. The exception that comes to mind first is Knuth's dancing links, and its obscurity is an indication of the rarity of actually needing these pointers. In any case, it's also a poster child for using indices rather than pointers.
Main objects in my program are expression trees. I manipulate them, cut them, merge them, compare them, splice one into the other. Rc's enable me to have full flexibility and share tremendous amount of data across objects in my program.
Rust is absolutely wonderful language for this problem thanks to Rc's, enums, value semantics, auto-deriving traits and ability to implement traits for existing types and of course speed.
I'm not implementing specific algorithms. I'm making them up as I go although I used some simple ones like topological sort or A* that eventually turned into just breadth search because I have no idea how far I am from the solution.
> I'm not implementing specific algorithms. I'm making them up as I go
It's mindboggling to me that people are using a systems programming language for mathematical research, especially if they don't know yet what the final algorithms will look like.
If you further wrap Rc in an Option you can set crosslinks and backlinks to None when you are dropping your data to get rid of the problem of crosslinks or backlinks making reference-counting leak memory. Then you just need to be mindful to not loose handle to a cycle of your nodes before you break the cycle by setting some crosslinks to None.
You can fairly easily refactor your almost-tree code to adapt it to that additional Option wrap.
Of course you might instead opt to introduce some garbage collector crate into your project. They usually provide garbage collected Rc equivalent, which makes swapping it out very easy.
Rc's are really very useful first approach to making anything complex in Rust.
I usually have something like
struct NodeStruct {
my_data: i32,
link: Node
}
and
struct Node(Rc<NodeStruct>);
or
struct Node(Option<Rc<NodeStruct>>);
if I need cross-links.
Great thing is you can then add 'methods' to your type with
impl Node {}
Or define operators and other traits with:
impl Add<Node> for Node {}
Sometimes, when I need mutability I even wrap the NodeStruct in RefCell.
It seems like a lot of wrappers but thanks to them you can have very nice code that uses this type that has pretty much 'normal modren language' semantics + value semantics and is still blazing fast.
When you implement Ord, Eq, Hash they all go through all the wrappers and let you treat your final type Node as a comparable, sortable, hashable and cheaply clonable value. Dereferencing also goes through all or most of the wrappers automatically.
The dependency is on the ABI, which can be OS-dependent. Also, it is a depressingly manual optimization to do: compilers don't know when it is safe to change a reference to a copy (for example) without an analysis of future code that they don't do.
C and C++ are also set up such that the compiler can do that optimization, they just don't. I'm pretty sure the Rust compiler is in the same boat - has the information, but doesn't do the optimization.
Because rust is a compiled language and therefore it means you compile your code to a certain architecture. Who told you about syscalls ? There are systems not using syscalls
There are no syscalls or equivalent operating system calls in the code paths measured. The architecture is also independent of the operating system, with exceptions in some languages for the calling convention (not in rust, afaik, or at least rust makes no guarantees there.)
However, in practice Rust's calling convention does actually depend on the operating system. So on Linux Rust will make use of the stack red zone, while on Windows it doesn't. (Also some codegen in LLVM depends on the operating system)
Unless you are gonna benchmark something, for details like this you should pretty much always just trust the damn compiler and write the code in the most maintainable way.
This comes up in code review a LOT at my work:
- "you can write this simpler with XYZ"
- "but that will be slower because it's a copy/a function call/an indirect branch/a channel send/a shared memory access/some other combination of assumptions about what the compiler will generate and what is slow on a CPU"
I always ask them to either prove it or write the simple thing. If the code in question isn't hot enough to bother benchmarking it, the performance benefits probably aren't worth it _even if they exist_.