Should small Rust structs be passed by-copy or by-borrow? (2019)

bjackman · on Dec 31, 2022

A potential lesson here (i.e. I am applying confirmation bias to retroactively view this article as justification for a strongly held opinion, lol):

Unless you are gonna benchmark something, for details like this you should pretty much always just trust the damn compiler and write the code in the most maintainable way.

This comes up in code review a LOT at my work:

- "you can write this simpler with XYZ"

- "but that will be slower because it's a copy/a function call/an indirect branch/a channel send/a shared memory access/some other combination of assumptions about what the compiler will generate and what is slow on a CPU"

I always ask them to either prove it or write the simple thing. If the code in question isn't hot enough to bother benchmarking it, the performance benefits probably aren't worth it _even if they exist_.

forrestthewoods · on Dec 31, 2022

Blog author here. I somewhat agree, somewhat disagree. This line makes me uneasy:

> I always ask them to either prove it or write the simple thing. If the code in question isn't hot enough to bother benchmarking it, the performance benefits probably aren't worth it _even if they exist_.

One of my philosophies is that death by a thousand cuts is fine, but death by ten thousand cuts isn’t. A team of 10 engineers can probably fix most of a thousand cuts in two or three months. But if you have ten thousand cuts you’re probably doomed. And those don’t show up cleanly in a flame graph.

Now for some context my background is video games. Which means the team knows they need to hit an aggressive performance bar. This isn’t true for many projects. shared_ptr is a canonical example of death by ten thousand cuts.

That said, I strongly agree with the principle of “just do the simple thing”. However I think it’s important to have “sane defaults”. A project can easily have a thousand or ten thousand papercuts that kill performance. But you can’t microbench every tiny decision. And microbenches are only a vague approximation of what actually matters.

I’m also wary of “the compiler will make it fast”. Because that’s true… until it’s not! Although these days you don’t have any choice but to lean heavily on the compiler and “trust but verify”.

No one wants a super complex solution if it’s not needed. However I am very amenable to “do a slightly more complex thing if you know it’s correct and we can never think about this ever again”. It’s much easier to do the fast thing upfront than for someone else to try and speed it up in two years when we’re doing a papercut pass.

klyrs · on Dec 31, 2022

> But if you have ten thousand cuts you’re probably doomed. And those don’t show up cleanly in a flame graph.

I am reminded of the lovely nanosecond/microsecond talk by Grace Hopper. If your code does a little bit of setup and then spends all of its time in a single hotspot, fine. But if your code is full of microsecond-suboptimal speed bumps, you can probably hide your hotspot altogether. And a flat-ish flame graph looks fine: nothing stands out as a problem!

It's valuable to do micro-benchmarks, not just to hone your optimization skills, but to learn optimal patterns in your language of choice. Then, when you're "in the zone" and laying down new code, you just do the optimal thing reflexively. Or, when you're reviewing or rewriting something, those micro-hotspots jump out and grab your attention.

There's a reason that ancient software running on ancient hardware is way more responsive & snappy than what we have today. Laziness.

Dylan16807 · on Dec 31, 2022

> There's a reason that ancient software running on ancient hardware is way more responsive & snappy than what we have today. Laziness.

Laziness in terms of using entirely inappropriate algorithms, sure.

Laziness in not microbenchmarking minutia? It shouldn't be. There's a limit on how much that can hurt you. I would say much less than a factor of ten, but let's go with 10x just for argument's sake. If you have a CPU that's 500x faster, and use easy code that's 10x slower, you're doing just fine. This is not the problem with modern unresponsiveness.

klyrs · on Dec 31, 2022

When I rewrite python code in C, I often hit 1000x speedups, and sub-100x is rare. And that's line-for-line. When I fix an accidentally-quadratic issue, for example, I've seen speedups in the billions without even changing the language.

People have lionized Knuth's quote about premature optimization, and used that to ignore performance issues across the board. Since the early '00s, we have not seen a 500x improvement in CPU speed. It's less than 2x on frequency, and let's say 8x on core-count for most users (which doesn't help your single-core lazy programmer). In my experience, programmers will make projections based on a 500x-faster processors that will never arrive, because it's easier than honing their skills and keeping them sharp. And even if these magical THz-frequency chips arrive, if you have three layers of 10x slowdowns, you're back down to GHz.

josephg · on Dec 31, 2022

This has been my experience too. I wrote a text crdt last year which improved automerge’s (then) 5 minute runtime. My code currently takes 6ms to do the same work.

Automerge’s design assumed this stuff would always be slow, so they had this whole frontend / backend code split so they could put the expensive operations on worker threads. Good optimizations in the right places make all that complexity unnecessary. The new automerge is shaping up to be simpler as well as faster.

Dylan16807 · on Dec 31, 2022

> When I rewrite python code in C, I often hit 1000x speedups, and sub-100x is rare. And that's line-for-line. When I fix an accidentally-quadratic issue, for example, I've seen speedups in the billions without even changing the language.

And neither of those is a microbenchmark thing, which is kind of my point. I'm surprised language would hurt that much, but that's enough to break things on its own without any layering.

> Since the early '00s, we have not seen a 500x improvement in CPU speed. It's less than 2x on frequency, and let's say 8x on core-count for most users (which doesn't help your single-core lazy programmer).

I don't think people are talking about 2004 when they talk about the responsiveness of ancient software on ancient hardware. I interpret that as more like an Apple II. But instructions per clock have also gone up a lot since the pentium 4 days, and having more than one core in your CPU has a huge impact even for single-threaded programs.

klyrs · on Dec 31, 2022

> And neither of those is a microbenchmark thing, which is kind of my point. I'm surprised language would hurt that much...

The point I'm making here is that every line matters -- not just the hotspots. If you're suprised that language can have that much impact, perhaps it's time to learn a bit about performance issues that you're being dismissive of?

> I don't think people are talking about 2004 when they talk about the responsiveness of ancient software on ancient hardware.

No, I was responding to your mention of a 500x improvement in hardware. That pipedream ended in the early 00s, and people still talk like Moore's law will absolve their inattentive coding practice. And that felt fine in the decades we went from kHz to GHz, but it's unacceptable today.

Dylan16807 · on Dec 31, 2022

> The point I'm making here is that every line matters -- not just the hotspots.

That depends on how it would have performed if you only transformed the hottest 10% into C.

But I was mainly responding to the idea that micro-optimizations are needed to keep general software snappy, and I don't think they are. If one language is that much faster, that's not micro-optimization.

> No, I was responding to your mention of a 500x improvement in hardware.

What do you mean "No"? I was talking about current hardware being 500x faster than 1988 hardware, which it is. If that's not what you meant by "ancient software on ancient hardware", fine, but that's what my 500x was talking about.

> people still talk like Moore's law will absolve their inattentive coding practice

I'm not trying to excuse inattentive coding. I'm trying to say certain kinds of attention are important and others aren't.

imtringued · on Jan 1, 2023

Python is easily 200x slower than C for equivalent code. The reason why Python is usable is because it uses a lot of C libraries that do the heavy lifting.

naasking · on Dec 31, 2022

> And a flat-ish flame graph looks fine: nothing stands out as a problem!

If your program is still slow, that would also indicate that everything is a problem, ie. the ten thousand cuts. Start optimizing at some obvious spots and then see what happens.

bjackman · on Dec 31, 2022

Haha "death by a thousand cuts" is exactly the phrase I encounter in these debates!

And actually I still disagree - e.g. I once took over a DMA management firmware and the TL told me "we are really trying to avoid DBATC so we take care to write every line efficiently". But the thing was that once you have a holistic understanding of the systems performance you tend to find _only a small fraction of the code ever affects the metrics you care about_!

E.g. in that case the CPU was so rarely the bottleneck that it really didn't matter, we could have rewritten half the code in Python (if we'd had the memory) without hurting latency or throughput.

Admittedly I can see how games or like JS engines might be a kinda special case here, where the OVERALL compute bandwidth begins to become a concern (almost like an HPC system) and maybe then every line really does count.

Dylan16807 · on Dec 31, 2022

Very little of your code is in hot loops. If the code that takes half a millisecond per frame could be twice as fast, but the hot loop is very optimized, then it doesn't really matter. And that's what I would think of by default for having many many cuts. Better to spend the optimization effort elsewhere.

> shared_ptr is a canonical example of death by ten thousand cuts

Why does that count as ten thousand cuts rather than one cut? That doesn't sound intractable to fix if you have months.

kllrnohj · on Jan 1, 2023

> Very little of your code is in hot loops.

Very few programs have "hot loops" as small, contained things. "hot loops" is at this point as much of a bad trope as "but premature optimizations!" is.

> Why does that count as ten thousand cuts rather than one cut? That doesn't sound intractable to fix if you have months.

Why spend months re-writing code you could have just written correctly the first time if a single person had spent an hour messing around with benchmarks to figure out the proper guidance?

Dylan16807 · on Jan 1, 2023

> Very few programs have "hot loops" as small, contained things. "hot loops" is at this point as much of a bad trope as "but premature optimizations!" is.

If you say so. I can pretty easily and meaningfully divide my current program into "happens once per frame or less" and "happens hundreds or more times in a row".

> Why spend months re-writing code you could have just written correctly the first time if a single person had spent an hour messing around with benchmarks to figure out the proper guidance?

If they could have figured it out that easily I understand even less how this example works.

Let me put it this way: If a benchmark can change huge swaths of significantly performance important code, then it's "hot enough to bother" by orders of magnitude. I thought we were talking about microbenchmarks for individual cuts!

throwaway894345 · on Dec 31, 2022

I generally agree, but it’s also not obvious to me in Rust (or in Go) whether passing by reference or by copy is more maintainable or clear. I guess what I want is some guidance on what I should do by default, which you sort of give with “do what is more maintainable”, but I can’t tell what that means in practice (I’ve been told to default to pass-by-reference in the past because most traits take &self and not self).

kibwen · on Dec 31, 2022

> I’ve been told to default to pass-by-reference in the past because most traits take &self and not self

This is only blanket advice for designing traits, because as the trait author you don't know what concrete type the downstream user is going to want to use, and taking `&self` in that circumstance is the choice that is friendliest to both Copy and non-Copy types.

If you're just writing a non-generic function and you do know what concrete types you're using, the flowchart is pretty simple:

1. If the type is not Copy, then pass by-ref if you just need to read the value, pass by-mutable-ref if you just need to mutate the value, and pass by-value if you want to consume the value.

2. If the type is Copy, then pass by-value, but if your type is really big or if benchmarking has determined that this is a critical code path then pass by-ref.

__turbobrew__ · on Dec 31, 2022

I would still consider myself a go novice, but I have been burned a number of times passing simple objects by reference and then that object gets mutated causing subtle bugs. Also, go is happy to blow your foot off if you take the reference of a loop variable. Although, there is a proposal to fix that.

Generally I find that less bugs get introduced when using copy instead of pass by reference, but I’m sure others have the opposite opinion.

411111111111111 · on Dec 31, 2022

This is about rust though and thats not really possible there (at least to my knowledge). You should get a compiler error if you attempt this.

I got very little experience in rust though, so there might be a way (I'm just not aware of) to circumvent this check

maxbond · on Jan 1, 2023

Indeed, you can use interior mutability to accomplish this:

https://doc.rust-lang.org/reference/interior-mutability.html

Usually (read: near universally) something that provides interior mutability will implement a runtime check to protect you (eg, a lock), but you could make a footgun if you were cavalier about it. And recent I misused a framework (I kept a value returned by the framework around that should've been dropped) without realizing it was holding locks internal to the framework, and ended up with a tricky to diagnose deadlock (though only tricky before I hadn't read the documentation as closely as I should've).

mcguire · on Dec 31, 2022

I have had the same results. Passing by copy is simpler and less bug-prone and reduces the urge to "just set the value since I have a reference to the object" which is a well-paved road to significant pain.

And the objects have to get surprisingly large before passing by reference really makes a difference.

db48x · on Jan 1, 2023

That kind of problem doesn't happen in Rust though, because if all you have is a reference then you cannot mutate anything. You can only mutate things if you have a mutable reference to them, which makes the caller jump through hoops to acquire one for you. They may have to have ownership of the thing, or use a Mutex or RWLock to acquire exclusive access to it, before they can call your method. Nothing can be mutated by accident, or in a surprising place.

throwaway894345 · on Jan 1, 2023

Yeah, const semantics are one of many things I would have preferred to have in Go over generics. You can sort of emulate them with pass by copy, but (1) that has performance implications and (2) if your data contains a reference, that can still be mutated.

touisteur · on Jan 1, 2023

Ouch, I actually feel Ada was right in having the for-loop 'variable' be constant inside the body of the loop. If you want a modifiable loop variable, use a loop/while loop, but at least the first question in peer-review becomes 'why not a for-loop here'?

throwaway894345 · on Jan 1, 2023

I don’t think that solves the problem because the loop itself still needs to mutate the variable. The issue often arises when people fork a new goroutine that closes over the loop variable inside the loop body—the loop often completes before any of the threads start running, so many/all of the threads read the loop variable at its max value (e.g., if we are forking 10 goroutines that just print their loop variable and quit, then there’s a good chance every goroutine will print 10 rather than printing 1-10).

touisteur · on Jan 2, 2023

Wouldn't any sane fork system capture the value of the loop variable instead of a reference to it? Why would anyone ever want to capture a reference to a loop variable that will be moved under one's feet? I'm sorry I'm struggling to understand what we're doing here and why.

throwaway894345 · on Jan 2, 2023

No, the capturing is done by the closure and as far as I know capturing by reference is pretty standard for closures across languages. The forking system just runs the closure in a thread, agnostic to any capturing details.

saghm · on Dec 31, 2022

I don't think one of them is more clear or maintainable universally, but in a lot of contexts, there might be an obvious choice. As a trivial example (that isn't quite fair given that the topic is about structs), it will almost never be more clear or maintainable to pass a shared reference to an integer (although there may be cases where a mutable reference might make sense). I don't think there's much need for one to have precedence over the other by default; if anything, I see the discussion about performance tradeoffs not being worth fretting about in the absence of actual measurement to be an argument _against_ one of them being inherently preferable.

bjackman · on Dec 31, 2022

Yeah totally agree it's not always/usually obvious. But there are cases where there's a clear readibility/assumed-performance tradeoff and in those cases I say always prefer readibility (unless you benchmark).

jjice · on Dec 31, 2022

> but that will be slower because it's a copy/a function call/an indirect branch/a channel send/a shared memory access

I really dislike these takes. I see engineers optimize these cases and then go ahead and make two separate SQL queries that could be one, ruining any false optimization gains they got by lord know how many times.

Yeah, you can loop over that 100 element list twice doing basic computation if you want, it's not going to make a difference for many engineering workloads, but could make a big difference in readability.

Patrol8394 · on Dec 31, 2022

This x 10000 ! If I had a dime for every time I provided this exact feedback in code reviews…I find surprising that a lot of devs in tech industry are obsessed with pointless micro optimizations and they don’t care about writing maintainable, testable simple code. My final comment is always to not outsmart compilers/jvm because they tend to do a much better job that developers.

Please, don’t optimize unless you have reasons to do so and numbers backing that up.

josephg · on Dec 31, 2022

My advice is the opposite: if you want to make performance justifications for code, you need a benchmarking suite. I have them for a lot of my projects. (Rust’s criterion is a delight). A good benchmark suite is a subtle thing to write - you want real world testing data, and benchmarks for a range of scenarios. The benchmarks should be run often. For some changes, I‘ll rerun my benchmarks multiple times for a single commit. I benchmark the speed of a few operations, serialisation size, wasm bundle size and some other things.

Having real benchmarking data is eye opening. It was shocking to me how much the wasm bundle size increased when I added serialisation code. The time to serialise / deserialise a big chunk of data for me is 0.5ms - so fast that it’s not worth more microoptimizations. Lots of changes I think will make the code slower have no impact whatsoever on performance. And my instincts are so often wrong. About 50% of microoptimizations I try either have no effect or make the code slightly slower. And it’s quite common for changes that shouldn’t change performance at all to cause significant performance regressions for unexpected reasons.

I’ve also learned how important “short circuit” cases can be for performance. Adding a single early check for the trivial case in a function can sometimes improve end to end performance by 15-20%, which in already well tuned code is massive.

Performance work is really fun. But if you do performance tuning without measuring the results, you’re driving blindfolded. You’re as likely to make your code worse as you are to make it better. Add benchmarks.

ok123456 · on Dec 31, 2022

This is true for application code. But, Rust is trying to sell itself as a systems language and an embedded language and a language you can write kernel modules in. Memory budget matters in these cases.

tialaramex · on Dec 31, 2022

If memory budget matters, you have a memory budget. So you should be measuring and you can actually tell where you need improvements.

But in practice what we see overwhelmingly is that people want to do this stuff but they aren't measuring, because measuring is boring whereas making the code more complicated to show off how much you think you know about optimisation is easy. Knock it off.

ok123456 · on Dec 31, 2022

Then knock off trying to use Rust as a systems language. Linear types make refactoring this nearly impossible if it does become and issue.

tialaramex · on Dec 31, 2022

Works really nice for me, of course, I actually measure what I'm doing.

You edited your comment, so I guess I will too: Rust doesn't actually have Linear types. Linear types ("must use or compile error") would be tricky to provide, Aria blogged about it back in the day. So that's definitely going to be a problem with your refactoring.

dahfizz · on Dec 31, 2022

It depends on your specialization, I guess. If you're making a website, a few microseconds here and there probably don't matter.

But in my field (Fintech), performance really does matter. Doing the simple, slow thing is just lazy and won't make it through review.

imron · on Dec 31, 2022

Great. Should be easy to prove then.

dahfizz · on Dec 31, 2022

Yup, we have a good benchmarking suite to make sure changes don't cause regressions and that optimization changes actually work.

That said, I think asking a developer to write everything they do twice so that they can A/B test is overboard. You can come back and really aggressively optimize later, but I think the "default" should be the fast thing, rather than the slow & easy thing.

bmacho · on Dec 31, 2022

If "performance really does matter" (in Fintech) then a developer surely can write everything twice or more.

gpderetta · on Dec 31, 2022

Time to market can matter in fintech as much as anywhere else.

Edit: also optimizations often interact with each other and you can't always benchmark all possible combinations, so sometime you do have to rely on experience or reasoning from first principles to decide if an optimisation is worth it or not.

dahfizz · on Dec 31, 2022

We could, yeah. I'm sure you're capable of re-writing everything you do twice. It's just a huge waste of time.

You get just as much benefit by assigning a performance refactor to an engineer when needed vs literally halving or worse the whole teams productivity.

elimerl · on Dec 31, 2022

The idea is that the "default" is the easy thing, which is usually optimized by the compiler.

TillE · on Jan 1, 2023

When you're writing in a language like Rust or C++, you know broadly what the compiler optimizes and what it probably won't. You do need to do the work yourself a lot of the time.

Diggsey · on Dec 31, 2022

One neat thing here is that the compiler is aware of which types are `Copy` and not internally mutable (not contianing an `UnsafeCell`). For these types, passing `&T` and `T` are equivalent, so the compiler could just choose the faster option.

Even if it's not smart enough to do that today, it could implement this optimization in the future. This could work even without inlining, since the Rust calling convention is unstable, and an optimization based on type size could be incorporated into it.

cwzwarich · on Dec 31, 2022

It would be nice if Rust could do this, but it breaks backwards compatibility. Some existing code depends on pointer values of &T being equal or not equal.

comex · on Dec 31, 2022

As an addendum, LLVM can automatically perform the “&T-as-T” optimization (without inlining) in some cases where the callee function is in the same compilation unit and known to not care about the pointer value. However, these types of optimizations tend to be fragile, easily disturbed when things get slightly complex.

zozbot234 · on Dec 31, 2022

It would be more advisable to add this as a clippy hint, because `&T` and `T` are not always equivalent wrt. FFI.

kibwen · on Dec 31, 2022

Indeed, but the compiler is still capable of doing it on a case-by-case basis. Quite often the observed semantics are identical and it's easy for the backend to see that a pointer has been created only to be immediately dereferenced.

saghm · on Dec 31, 2022

This seems to be an unpopular opinion, but I feel similarly about how sometimes people seem to toss out `inline` (and even more suspect, `inline(always)` annotations on Rust functions like candy on Halloween and there are almost never any sort of actual measurements of whether it actually helps in the cases it's used. It's not even that I think it really hurts that much in most of the stuff I've worked on (which tends to be more sensitive to concurrency design and network round trips), but I can't help but worry that people using stuff like this when they don't seem to fully grasp the implications is a recipe for trouble.

bjackman · on Dec 31, 2022

Yeah inline is an absolute classic for this. The number of uncommented __attribute__((always_inline))s I see in C code drives me crazy. There are absolutely legitimate reasons to use that attribute but there should ALWAYS be a comment about why, so that later readers know in what conditions they can safely remove it.

phkahler · on Jan 1, 2023

In the old days I saw C code with massive overuse of the "register" keyword. Meaning this variable should be kept in a register. Someone had ove 12 of these in one function back when that was waaaay more variables then processors had registers.

Good thing modern register allocation by compilers makes this irrelevant.

int_19h · on Jan 1, 2023

As I recall, "register" was already a no-op in C compilers by the 90s or so (maybe except for embedded? I just don't know enough about that).

heydenberk · on Dec 31, 2022

Even if you do benchmark something, maintainability can be more important than a marginal performance improvement.

I've seen this happen a lot with JavaScript, particularly in the last 5-10 years as JS engines have developed increasingly sophisticated approaches to performance. Today's optimization can be tomorrow's de-optimization. Even given an unchanging landscape of compiler/interpreter, tightly-optimized code can become de-optimized when updated and extended, as compared to maintainable code that may not suffer much performance degradation upon extension.

Dobbs · on Dec 31, 2022

edit: I misread the previous post. Ignore this.

How are you using the word simpler? Because to me that implies a combination of more obvious and number of lines of code. Something that a benchmark shouldn't be involved in.

For example asking someone to delete 10 lines of code and instead use go's ` net.SplitHostPort` would be an example of "simpler".

2OEH8eoCRo0 · on Dec 31, 2022

I've read that good generals worry about tactics and great generals worry about logistics.

Good programmers play code golf, great programmers write readable and maintainable code.

Your example seems reasonable but programmers also like to act like the smartest one in the room. I often come across tricky and borderline obfuscated code because somebody wanted to look clever. This is a logistical nightmare.

tested23 · on Dec 31, 2022

Ugh, you are right but then someone comes and uses this to rationalize not including things like map, filter and reduce in a language because they are supposedly too complicated and you can just do it with a for loop

baby · on Dec 31, 2022

I work in a Rust codebase that uses a lot of functional functions, and I’ll say this: on average the imperative style takes less lines of code and less indentation. I also find it more readable personally, and idiomatic.

josephg · on Dec 31, 2022

Just because we’re on the topic of performance: the rust optimizer can sometimes generate better code if you use map / filter / etc. The slice iterator in any context is a huge win over manual array iteration because it only needs to do bounds checking once.

Javascript (v8, last I checked) is the opposite. Simple for loops almost always outperform anything else.

duckerude · on Dec 31, 2022

I've seen cases where an iterator was better, but I've also seen gains from using an imperative loop with manual indexing. Loop conditions and the occasional assertion can be enough to elide bounds checks. (Though sometimes the compiler gets too paranoid about integer overflow.)

Most of the time you should just write whatever's clear/convenient but sometimes it's worth trying both and scrutinizing godbolt.

baby · on Jan 1, 2023

This is premature optimization imo. Unless you’re using par_iter

josephg · on Jan 1, 2023

Without knowing the domain, you have no way of knowing that.

It’s also much easier to stick to for() loops in javascript as you code than it is to rewrite everything later when tuning for performance. If that’s something you expect to need to do.

nicoburns · on Dec 31, 2022

Functional iteration is good for the same reason we use for loops over while loops, and while loops over goto: they are more constrained, more clearly communicate intent, and are therefore easier to reason about.

josephg · on Dec 31, 2022

Sure but it’s easy to go overboard with this stuff. Reduce (fold) especially can be pretty hard to read in hairy situations.

My general rule is that if you need fewer lines of code to implement your logic with a simple for loop, you probably should.

nicoburns · on Jan 1, 2023

Yeah, I agree with that. Especially reduce/fold, which I find is almost always better written as loop. Filter would be a good example of the opposite for me: almost always much clearer written functionally.

jamincan · on Jan 1, 2023

I find I frequently use a combination - use map/filter to setup an iterator, and then reduce it in a loop.

int_19h · on Jan 1, 2023

I suspect this has more to do with the lack of first-class sequence comprehensions in the syntax. If you had to write imperative style, but all loops were HOFs, it would hardly be ergonomic, either. OTOH a good query language is much more readable.

karamanolev · on Dec 31, 2022

That's what they're saying: if someone says anything more complicated is faster, they challenge them to benchmark it. Usually, it turns out whoever argues the "is faster" point doesn't bother to benchmark it and the simpler code-wise thing wins out. So yes - the benchmark goes to performance, simplicity is in lines-of-code, cyclomatic, "in the eye of the beholder" or whatever other metric you choose, but usually it's obvious.

zaphirplane · on Jan 1, 2023

> I always ask them to either prove it or write the simple thing

This is cumulative thou, 1 may not make much difference. 100,000 of those will. There is that story of chrome slows down that was does discovered to be in strings each one of which is too little to make a difference

nextaccountic · on Dec 31, 2022

> I always ask them to either prove it or write the simple thing.

Even if they do, they also need to make a case that in this specific case, performance matters enough to pessimize code simplicity and maintainability.

Also, if performance is that critical, it's imperative to benchmark again after each compiler release to guard against codegen regressions. And benchmark after changing this piece of code. Otherwise, we can say that performance doesn't really matter.

eldenring · on Dec 31, 2022

I think you're missing the point of the previous comment, they are saying a good proxy for it being worth it to optimize, is if you're willing to benchmark it.

szundi · on Dec 31, 2022

My favorite way of thinking. Should be applied to question the need for the existence of the given feature or function too, and best to delete the whole stuff.

celeritascelery · on Dec 31, 2022

I don’t feel like this gave a satisfactory answer the question. Since everything was inlined, the argument passing convention made no difference in the micro benchmarks. But what happens when it does not inline? Then you would actually be testing by-borrow be by-copy instead of how good rust is at optimizing.

ncallaway · on Dec 31, 2022

I sort of agree and sort of disagree.

> Then you would actually be testing by-borrow be by-copy instead of how good rust is at optimizing.

I don’t think the question is actually: “what is faster in practice, a by-copy method call or a by-value method call”, I think the question is: “as an implementer, which semantics should I choose when I’m writing my function”.

For the second question: “Rust is usually pretty good at aggressively inlining, so… if you’re willing to trust Rust’s compiler, you’re often okay going with by-copy implementations, but you should keep an eye on it”. Whereas, as you note, for the first question it’s not an answer.

But, I do think if someone was going to put more work into it I’d be very curious what the answer to the first question is. If I’m choosing to implement with by-copy semantics and trusting the Rust compiler to hopefully inline things for me, I’d like to know the implications in the cases when it doesn’t.

forrestthewoods · on Dec 31, 2022

Blog author here. This feels like the best summary in this comment section.

The root question is indeed “what semantics should I use”. And the answer I came up with was “the compiler does a lot of magic so by-copy seems pretty good”. I agree with the previous commenter this is not a satisfying conclusion!

My experience with Rust is that it requires a moderate amount of trust in the compiler. Iterator code is another example where the compiler should produce near optimal code. Emphasis on should!

FpUser · on Dec 31, 2022

When value size is small (whatever "small" means for particular architecture) I'd say "trust the compiler" suggestion is reasonable. When the size grows there should be no more "trust" unless compiler can decipher if it is safe to use ref instead of value basing on value size (we assume that the function does not mutate the value).

Your tests on my PC:

  Rust - By-Copy: 14124, By-Borrow: 8150
  C++ - By-Copy: 12160, By-Ref: 11423

P.S. Just built it using LLVM under CLion IDE and the results are:

  G:\temp\cpp\rust-cpp-bench\cpp\cmake\cmake-build- 
   release\fts_cmake_cpp_bench.exe
   Totals:
     Overlaps: 220384338
     By-Copy: 4397
     By-Ref: 4396  Delta: -0.0227428%

  Process finished with exit code 0

nicoburns · on Dec 31, 2022

> When the size grows there should be no more "trust" unless compiler can decipher if it is safe to use ref instead of value basing on value size

I believe that the Rust compiler at least does exactly that. Large structs will be passed by reference under the hood even if it passed by value in the code. I suspect C++ compilers do the same, although I'm not sure about that.

FpUser · on Dec 31, 2022

Now comes big surprise: I just built it using LLVM under CLion IDE and the results are:

  G:\temp\cpp\rust-cpp-bench\cpp\cmake\cmake-build- 
   release\fts_cmake_cpp_bench.exe
   Totals:
     Overlaps: 220384338
     By-Copy: 4397
     By-Ref: 4396  Delta: -0.0227428%

  Process finished with exit code 0

josephg · on Dec 31, 2022

Why is performance so much better in this case? That seems like a suspiciously large delta from the first test.

Were the other benchmarks run in debug mode / with optimizations turned off or something like that? What compiler & flags are you using?

FpUser · on Dec 31, 2022

>"Were the other benchmarks run in debug mode / with optimizations turned off or something like that?"

Why would I do something like that? Of course all builds are release mode, optimize for speed.

  Rust - Windows - By-Copy: 14124, By-Borrow: 8150
  C++ - Windows MS Compiler - By-Copy: 12160, By-Ref: 11423
  C++ - Windows LLVM 15 - By-Copy: 4397, By-Ref: 4396

>"Why is performance so much better in this case?"

Not sure and not in a mood to investigate. I do know if cache locality and branch prediction stars line up properly the performance difference can be staggering. Maybe LLVM has accomplished something nice in this department.

forrestthewoods · on Dec 31, 2022

I just updated Visual Studio 2022 with all the latest updates and installed the Clang toolchain. I also updated Rust to latest.

C++ MSVC: By-Copy: 12,077 By-Ref: 11,901

C++ Clang: By-Copy: 5,020 By-Ref: 5,029

Rust: By-Copy: 3,173 By-Borrow: 3,148

All on Windows, and on the same i7-8700k desktop I used for the original post in 2019.

Your Rust numbers are particularly curious. Maybe run `rustup update` and try again?

FpUser · on Jan 1, 2023

Just did this rustup update. Much better now. So the finals are:

  Rust - By-Copy: 2685, By-Borrow: 2694
  C++ - Windows MS Compiler - By-Copy: 12160, By-Ref: 11423
  C++ - Windows LLVM 15 - By-Copy: 4397, By-Ref: 4396

My CPU is AMD Ryzen 5950X so it seems like Rust kicks the shit out of C++ in this case. I am going to try LLVM 16 and GCC tomorrow.

Happy New Year

dagome · on Jan 1, 2023

Such stark difference in performance is really fishy and needs a deeper analysis. I just checked it with different compilers (all based on LLVM) and here are the results:

  rustc 1.58 (LLVM 13): By-Copy: 10804, By-Borrow: 7198
  rustc 1.64 (LLVM 14): By-Copy: 7385,  By-Borrow: 7328
  rustc 1.66 (LLVM 15): By-Copy: 2667,  By-Borrow: 2777
  clang++ (LLVM 14):    By-Copy: 2439,  By-Ref: 2589
  clang++ (LLVM 15):    By-Copy: 2473,  By-Ref: 2556

When compared with the same LLVM version clang much better utilised LLVM-14 than rustc - C++ is 3x faster. With LLVM-15 they are much closer.

FpUser · on Jan 1, 2023

Is this Windows or Linux? Also what CPU? Your results for rust on LLVM15 are pretty much close to mine but clang++ on LLVM15 is almost twice as slow. I really want to find reason.

FpUser · on Jan 1, 2023

Never mind, figured it out:

  clang++ -O3 -mavx2 fts_cpp_copy_bench.cpp -o test.exe

this did it. The final score is:

  rust - By-Copy: 2683, By-Borrow: 2697
  c++ - By-Copy: 2577, By-Ref: 2600

so C++ seems a bit faster but not by much

josephg · on Jan 2, 2023

Are you enabling avx2 for rust as well? What arguments are you using for rust?

Try this:

    RUSTFLAGS="-C target-cpu=native" cargo run --release # or whatever

And:

    clang++ -O3 -march=native fts_cpp_copy_bench.cpp -o test.exe

arch=native will also activate SSE, MMX, and all the other goodies that modern CPUs have to offer.

FpUser · on Jan 2, 2023

Sorry for being dumb, I thought I've enabled it in ./cargo/config.toml for rust. Checked it again and it appears that I did not (the file name was wrong). -march=native for CLang also did better than just avx2

  rust - By-Copy: 1831, By-Borrow: 1850
  c++ -  By-Copy: 2411, By-Ref: 2458

Interesting why Rust wins with such a large margin. Maybe will try to find out later.

forrestthewoods · on Jan 1, 2023

Happy new year! If you dig deeper post here, or feel free to e-mail me. username @ gmail

FpUser · on Jan 1, 2023

No need to bother with LLVM16 and / or GCC. Enabling avx2 for Clang compiler produced following results that are more or less in line with what was expected.

  rust - By-Copy: 2683, By-Borrow: 2697
  c++ - By-Copy: 2577, By-Ref: 2600

Either is way faster than the result from the original post and no rust does non win this "competition". C++ is a bit faster (see also the results from other poster above) but not by much.

kllrnohj · on Jan 1, 2023

It may be the default fp mode for msvc is causing C++ to suffer here ( https://learn.microsoft.com/en-us/cpp/build/reference/fp-spe... ). It looks like the default behavior is quite conservative it what it allows to happen - like it won't even use FMAs with the default behavior?

furyofantares · on Dec 31, 2022

I feel like they got excited by their C++ code being so much slower and curious about the "weird" C++ result and forgot to figure out the original question.

fnordpiglet · on Dec 31, 2022

To be fair I got excited too. But I still want to know the answer as well.

FpUser · on Dec 31, 2022

>"C++ code being so much slower"

  Rust - Windows - By-Copy: 14124, By-Borrow: 8150
  C++ - Windows MS Compiler - By-Copy: 12160, By-Ref: 11423
  C++ - Windows LLVM 15 - By-Copy: 4397, By-Ref: 4396  Delta: -0.0227428%

So it appears that C++ - Windows LLVM 15 beats Rust by large margin.

jasonhansel · on Dec 31, 2022

In Rust it's considered idiomatic to pass things by-value whenever you can. Usually this is also the most performant option, since it avoids dereferencing in the callee.

Of course, if your struct is truly enormous, you may want to break this rule to avoid large copies. But in that case you probably want to Box<T> the struct anyway.

Of course, if your struct contains something that can't be copied--like a Vec<T>--you'll have to decide whether to clone the whole struct (and thus the vector in it), pass the struct by-borrow, or find some other solution.

int_19h · on Jan 1, 2023

When the compiler can see both the callee and the caller (the most common case), why should it even matter? If you pass by value, but don't actually do anything that mutates the copy, and there are no aliasing concerns, surely the compiler can make it by-ref under the hood?

(The other direction is trickier, since by-ref implies the desire to observe aliasing, even though that's usually not expected in practice - but the compiler cannot tell.)

brundolf · on Dec 31, 2022

I don't think I'd agree that idioms come into play here, one way or the other. Safely borrowing things by reference is one of Rust's headline features

kibwen · on Dec 31, 2022

> Safely borrowing things by reference is one of Rust's headline features

Sure, but it's worth noting that references in Rust do not exist merely to avoid passing by-value. They also exist to make it easier to deal with Rust's ownership semantics: they let you pass things to a function without also requiring the function to "pass back" those things as returned values. In other words, references let you do `fn foo(x: &Bar)` rather than `fn foo(x: Bar) -> Bar`. This is a unique and interesting consequence of languages with by-default move semantics.

dwheeler · on Dec 31, 2022

This is one advantage of Ada, where parameters are abstractly declared as "in" or "in out" or "out". The compiler can then decide how to best implement it for that specific size and architecture.

sampo · on Dec 31, 2022

> This is one advantage of Ada, where parameters are abstractly declared as "in" or "in out" or "out".

Also Fortran has "in", "inout" and "out".

Congeec · on Dec 31, 2022

C++23 is not too late to the party https://en.cppreference.com/w/cpp/memory/out_ptr_t/out_ptr

int_19h · on Jan 1, 2023

So far as I can tell, it's not quite the same thing since these still have pointer semantics (and thus have to deal with aliasing etc). The in/out approach is more generic, since "in" can map to a pointer where it makes sense, and to a copy where it does not.

Better yet when you prohibit such arguments from aliasing (or at least make no-alias the default) - now the compiler can also implement "in out" by copying the value back and forth, if it's faster than indirection.

IAmLiterallyAB · on Jan 1, 2023

Ehh that's not quite what those are. The types being added for C++23 are designed for FFI.

Herb made a proposal for proper in/out parameters for C++ in 2020 https://youtu.be/6lurOCdaj0Y

jb1991 · on Dec 31, 2022

And Swift also has "inout" parameters.

stephencanon · on Dec 31, 2022

But not “out” params, sadly.

It can return multiple values, so this doesn’t matter much for value types, but it would be nice to be able to specify that a pointer arg is an out-param sometimes and enforce that it is not read from while handling allocation in the caller.

trifurcate · on Dec 31, 2022

Also, MSVC has similar annotations for various static analyses: https://learn.microsoft.com/en-us/cpp/code-quality/understan...

FpUser · on Dec 31, 2022

So does Delphi / FreePascal

pletnes · on Dec 31, 2022

Fortran also has «default» / no intent. This is somehow different from inout.

wiz21c · on Dec 31, 2022

and GL/SL IIRC ...

ardel95 · on Dec 31, 2022

How is that semantically different from Rust?

in - regular function arguments

inout - mut function arguments

out - function return

Is there any additional information that a compiler can infer from Ada’s parameter syntax?

layer8 · on Dec 31, 2022

The difference between passing by reference vs. by value is observable when comparing pointers to the original vs. to the argument. This difference may be unobservable in Ada though (not sure), so Ada would have more freedom choosing between the two.

rwaksmunski · on Dec 31, 2022

A question to the Rust experts, would lifetime annotations 'a in Rust have similar benefit as "in" or "in out" or "out" in Ada and other languages? With the additional benefit in Rust where the compiler can deduce those automatically for most cases?

chc · on Dec 31, 2022

As a sibling comment points out, "in" is effectively equivalent to "&T", and "inout" is effectively equivalent to "&mut T". Rust is missing purely "out" parameters, but that isn't a very common case, and I'm not sure how much value there is in saying "this reference can't be read" since references are always guaranteed to be valid in Rust.

db48x · on Jan 1, 2023

A Rust program can trivially return a tuple of values, so it doesn't need out params.

zozbot234 · on Jan 1, 2023

Out params are needed to ensure zero-copy in-place initialization of large objects, i.e. what C++ does with placement new. You can sort of fake it with MaybeUninit<> https://doc.rust-lang.org/std/mem/union.MaybeUninit.html but it's quite fiddly, requires unsafe code, and has undesirable effects on the layout of containing objects.

db48x · on Jan 1, 2023

In–place initialization is nice to have, but not strictly required.

chromatin · on Dec 31, 2022

Dlang can also qualify parameters as in, out, and inout; although I don't know to what degree the compiler is able to use that for optimization purposes (it is used for safety checks IIRC)

bvrmn · on Dec 31, 2022

Always curious how Ada solves ABI issue with such optimizations in place.

rightbyte · on Dec 31, 2022

As long as the calling convention is deterministic from the declaration of the function it should be fine right?

usrnm · on Dec 31, 2022

If it's deterministic, the compiler cannot actually choose the best way to optimise it.

int_19h · on Jan 1, 2023

At the minimum, the compiler can look at the size of the type to decide whether copying or indirection makes more sense. The size is stable for a given parameter signature and architecture, so the ABI is stable.

But also, the ABI only matters between two pieces that are separately compiled. A static binary optimized at link-time doesn't have to care.

rightbyte · on Dec 31, 2022

It just have to come up with the same best way each time?

mcguire · on Dec 31, 2022

This is one of those questions where you really, honestly, do need to look at a very low level.

Back in the ancient days, I worked at IBM doing benchmarking for an OS project that was never released. We were using PPC601 Sandalfoots (Sandalfeet?) as dev machines. A perennial fight was devs writing their own memcpy using dst++ = src++ loops rather than the one in the library, which was written by one of my coworkers and consisted of 3 pages of assembly that used at least 18 registers.

The simple loop was something like X cycles/byte, while the library version was P + (Q cycles/byte) but the difference was such that the crossover point was about 8 bytes. So, scraping out the simple memcpy implementations from the code was about a weekly thing for me.

At this point, we discovered that our C compiler would pass structs by value (This was the early-ish days of ANSI C and was a surprise to some of my older coworkers.) and benchmarked that.

And discovered that its copy code was worse than the simple dst++ = src++ loops. By about a factor of 4. (The simple loop would be optimized to work with word-sized ints, while the compiler was generating code that copied each byte individually.)

If you are doing something where this matters, something like VTune is very important. So is the ability to convince people who do stupid things to stop doing the stupid things.

lukaszwojtow · on Dec 31, 2022

I always prefer by-borrow. That's because in the future this struct may become non-copy and that means some unnecessary refactoring. My thinking is a bit like "don't take ownership if not needed" - the "not needed" part is the most important thing. Don't require things that are not needed.

zozbot234 · on Dec 31, 2022

If a struct might lose Copy you shouldn't implement Copy at all, to preserve forward compatibility. You can still derive Clone in most cases; using .clone() does not per se add any overhead.

carlmr · on Dec 31, 2022

Exactly, and if performance at some point matters: benchmark!

And I would bet 9 times out of 10 it won't be the bottleneck or even make a measurable difference.

QuadDamaged · on Dec 31, 2022

Exactly why IMHO the rust stdlib is so easy to understand. Ownership only when required as a design principle tends to make the design of the overall system more consistent / easier to approach.

eterevsky · on Dec 31, 2022

If it's a 3D real-valued vector, or similarly basic structure, you can be fairly certain, that it will stay copyable.

josephg · on Dec 31, 2022

I agree. Being copyable is part of the signature for something like this. Explicitly so in rust.

theptip · on Dec 31, 2022

Rust noob here - is it common to see a struct lose Copy as things grow?

sedatk · on Jan 1, 2023

> don't take ownership if not needed

That's my approach too as a Rust newbie. Borrow by default and take ownership only when needed, for the best ergononmics.

arcticbull · on Dec 31, 2022

> Blech! Having to explicitly borrow temporary values is super gross.

I don’t think you ever have to write code like this. Implement your math traits in terms for both value and reference types like the standard library does.

Go down to Trait Implementations for scalar types, for instance i32 [1]

impl Add<&i32> for &i32

impl Add<&i32> for i32

impl Add<i32> for &i32

impl Add<i32> for i32

Once you do that your ergonomics should be exactly the same as with built in scalar types.

[1] https://doc.rust-lang.org/std/primitive.i32.html

forrestthewoods · on Dec 31, 2022

Oh neat, that’s my blog. My old posts don’t resurface on HN that often.

Lots of criticism of my methodology in the comments here. That’s fine. That post was more of a self nerd snipe that went way deeper than I expected.

I hoped that my post would lead to a more definitive answer from some actual experts in the field. Unfortunately that never happened, afaik. Bummer.

the_mitsuhiko · on Dec 31, 2022

My only criticism is the “ugly mess” part. You can implement the traits on references too.

forrestthewoods · on Dec 31, 2022

True, that does work for traits. But it's super annoying if you have to write multiple copies of the same thing. That can get out of control quick if you need to implement every combination.

And that doesn't help at all if you're writing a "free function" like 3D primitive intersection functions. I suppose you could change that simple function into a generic function that takes AsDeref? Bleh.

brundolf · on Dec 31, 2022

Maybe it'll happen here! :)

ergonaught · on Dec 31, 2022

It's compiled, so, without any investigation at all, I would have been disappointed if there were any significant difference in the code emitted in these cases. I would expect the compiler to do the efficient thing based on usage rather than the particular syntax. I may have too much faith in the compiler.

CHY872 · on Dec 31, 2022

I'd expect your claim to be true whenever the callee is inlined into the caller. In this case, the compiler has all the relevant information at the right point in time. As other commenters have pointed out, by enabling inlining the author has gone down a rabbit hole somewhat unrelated to the question, because any copies can be simply elided.

If there's no inlining at play, I'd expect vast differences to be possible. For example, imagine a chain of 3 functions - f calls g, g calls h, where one of the arguments is a 1kB struct and the options are passing by copy or by borrowing. In this case, each stack frame will be 1kB in size in the copy case and there will be a large performance overhead as opposed to the by-reference case. One would expect simply calling the function to be similar in overhead to an uncached memory load.

Within a single crate the inlining is possible, with multiple crates it's only possible with LTO enabled (and I'm not sure how _probable_ it is that the inlining would occur).

In either case, the difference between a 32 byte and 8 byte argument in terms of overhead is likely meaningless - the sort of thing to be optimized if profiling says it's a problem as opposed to ahead of time.

kibwen · on Dec 31, 2022

> Within a single crate (more specifically, codegen unit) the inlining is possible

Cross-crate inlining happens all the time. In order to be eligible for inlining, a function needs to have its IR included in the object's metadata. This happens automatically for every generic function (it's the only way monomorphization can work), and for non-generic functions can be enabled manually via the `#[inline]` attribute (which does not force inlining, it only makes it possible to inline at the backend's discretion).

However, as you, say, if you have LTO enabled then "cross-crate" inlining can happen regardless, since it's all just one giant compilation unit at that point.

cogman10 · on Dec 31, 2022

At the VERY end of the article, the author points out "Oh, btw, I used MSVC for the C++ compilation, when I used clang things changed!"

So, what the author actually measured was the difference between llvm and msvc throughout the article. Particularly when they talked about rust being better at autovectorization than C++.

forrestthewoods · on Dec 31, 2022

Incorrect. Clang C++ vs MSVC C++ is very comparable, and noticeably worse for f64 by-ref. Clang C++ is still slower than Rust by a large margin. Using Clang C++ throughout would not change any conclusion (or lack thereof).

spuz · on Dec 31, 2022

I'd be interested to know what the benchmarks of the two rust solutions are when inlining is disabled so we can get an idea of the different performance characteristics of each function call even if it's not a very realistic scenario.

The other question I have is which style should you use when writing a library? It's obviously not possible to benchmark all the software that will call your library but you still want to consider readability, performance as well as other factors such as common convention.

ptero · on Dec 31, 2022

I would go with the version that gives the clean user interface (that is, by copy in this case). If it turns out that the other version is significantly more performant and this additional performance is critical for the end users consider adding the by-borrow option.

The clarity of the code using a particular library is such an big (but often under-appreciated) benefit that I would heavily lean in this direction when considering interface options. My 2c.

daviddever23box · on Dec 31, 2022

Agreed - and this applies in nearly every language: start simple, trust your compiler, and optimize only when performance becomes untenable.

osigurdson · on Dec 31, 2022

The assumption behind such arguments is when a performance problem does arise, a profiler will point to a single, easy to fix, smoking gun. Unfortunately this is not always the case. Performance problems can be hard to diagnose and hard to fix. A lot of damage has been done by unexamined / dogmatic "root of all evil" mantra.

throw10920 · on Dec 31, 2022

In the vast majority of situations (1) you'll prematurely optimize in the wrong place and (2) yes the profiler will point to a single, easy-to-fix smoking gun.

Situations otherwise are the exception, rather than the rule, and it takes an expert to (1) recognize those situations and (2) know exactly how to write optimized code in that situation.

That's why "don't prematurely optimize" is a good rule of thumb - because it works the majority of the time, and it takes experience to know when not to apply it.

kllrnohj · on Dec 31, 2022

> In the vast majority of situations [..] yes the profiler will point to a single, easy-to-fix smoking gun.

[citation needed]

This claim depends hugely on the industry you're actually working in and the problem space. Things like UIs & games basically never have a single, easy-to-fix smoking gun. The entire app is more or less a hotspot - be it interactive performance, startup performance, RAM usage, or general responsiveness.

And once you're gone down the route of "build it first, optimize it later" you're pretty much fucked when you get to the "optimize" step because now your performance mistakes are basically unfixable without a rewrite - every layer of your architecture has issues that you can't fix without drastic overhauls. It would have been much easier to do some up-front measurements, get some guidelines in place (even if they aren't perfect), and then build the app.

osigurdson · on Dec 31, 2022

Suggest acquiring the needed knowledge instead of applying dogma. The true root of all evil is unexamined dogma.

throw10920 · on Jan 3, 2023

This is ridiculous. If you had the knowledge, you'd apply it, and you can't acquire it without practice, which is going to involve heuristics (which you falsely call "dogma" in order to ad-hominem my argument) like this one, which exists precisely because you need something for when you don't have the specialized domain expertise.

> The true root of all evil is unexamined dogma.

This is so absurd that it doesn't deserve a reply.

mattgreenrocks · on Dec 31, 2022

The misapplication of that mantra doesn’t justify the design damage done by dogmatically passing everything by ref.

There’s no hard and fast rule here. Even if there was, optimizers still occasionally surprise seasoned native devs in both positive and negative ways.

Glad the author’s first instinct was to pull out profiling tools.

kllrnohj · on Dec 31, 2022

This advice hinges hugely on what "start simple" really means. There's a ton of counter-examples here where that just isn't true at all depending on what you're calling "simple". In particular JIT'd languages can be especially problematic here. An example would be using Java's Streams interfaces to do something that could be done without much difficulty with a regular boring ol' for loop. At the end of the day you're hoping the JIT will eventually convert the streams version into the same bytecode the for loop version would have started with. But it won't do that consistently, and you've still wasted time before it did so.

Trusting the compiler also means knowing what the compiler actually understands & handles vs. what's a library-provided abstraction that's maybe too bloated for its own good and that quickly becomes "not simple" depending on your language of choice.

int_19h · on Jan 1, 2023

> you're hoping the JIT will eventually convert the streams version into the same bytecode

Not really. I'm just hoping that it will be "fast enough", which in the vast majority of cases it is.

mlindner · on Dec 31, 2022

I agree in general, but the side-effect of doing this is that no matter how fast your hardware gets, your software will always end up optimized to the new hardware. So over time your software gets slower and slower but performance stays consistent as hardware gets faster.

Rustwerks · on Dec 31, 2022

I just went through all of this when building a raytracer.

* Sprinkling & around everything in math expressions does make them ugly. Maybe rust needs an asBorrow or similar?

* If you inline everything then the speed is the same.

* Link time optimizations are also an easy win.

https://github.com/mcallahan/lightray

woodruffw · on Dec 31, 2022

> Maybe rust needs an asBorrow or similar?

FWIW, the `Borrow`, `AsRef`, and `Deref` traits all exist to support different variants of this.

masklinn · on Dec 31, 2022

> * Sprinkling & around everything in math expressions does make them ugly. Maybe rust needs an asBorrow or similar?

Do you mean AsRef, or do you mean magic which automatically borrows parameters and is specifically what rust does not do any more than e.g. C does?

Though you can probably get both if the by-ref version is faster (or more convenient internally): wrap the by-ref function with a by-value wrapper which is #[inline]-ed, this way the interface is by value but the actual parameter passing is byref (as the value-consuming wrapper will be inlined and essentially removed).

zamalek · on Dec 31, 2022

The benchmarks lack the standard deviation, so the results may well be equivalent. Don't roll your own micro-benchmark runners.

References may get optimized to copies where possible and sound (i.e. blittable and const), a common heuristic involves the size of a cache line (64b on most modern ISAs, including x86_64).

Using a Vector4 would have pushed the structure size beyond the 64b heuristic. You would also need to disable inlining for the measured methods.

cogman10 · on Dec 31, 2022

It was also (needlessly) using 2 different compilers, MSVC and LLVM. This is just a bad way to compare things all around.

And, for simple operations like this, you really should just look at the assembly output. If you are only generating 20ish instructions, then look at those 20 instructions rather than trying to heuristically guess what is happening.

kibwen · on Dec 31, 2022

Note that this is from 2019, so it's probably worth re-benchmarking to see if anything has changed in the interim. Can we get the year added to the title?

slashdev · on Dec 31, 2022

For this code, the compiler inlined the call. So there should be no difference between pass by copy or pass by reference, which is what was measured. Where it could matter is when the code isn’t inlined. But with small structs it might not matter all that much.

It does sometimes matter though. One optimization I’ve seen in a few places is to box the error type, so that a result doesn’t copy the (usually empty) error by value on the stack. That actually makes a small performance difference, on the order of about 5-10%.

BooneJS · on Dec 31, 2022

Folks, processors continue to give smaller and smaller gains every year. Something has to give. If you have critical path code that absolutely must max out the core, then this type of analysis (as pedantic as it is) is useful in the long run.

cryptonector · on Jan 1, 2023

It's not like you can do arithmetic with references, so maybe the ergonomics of by-value vs. by-reference shouldn't really be that different.

The cost of by-value lies in memory copies, while the cost of by-reference lies in dereferencing pointers where the values are needed, which might mean many more memory reads are needed than with by-value (depends on what you're doing). So it's just hard to tell which will do better in general -- there's no answer to that.

For a library, maybe providing by-value and by-reference interfaces should be good (except that will bloat the library). For everything else just use by-value as it has the best ergonomics.

FpUser · on Dec 31, 2022

I did the test on my computer:

Rust - By-Copy: 14124, By-Borrow: 8150

C++ - By-Copy: 12160, By-Ref: 11423

P.S. Just built it using LLVM under CLion IDE and the results are:

  G:\temp\cpp\rust-cpp-bench\cpp\cmake\cmake-build- 
   release\fts_cmake_cpp_bench.exe
   Totals:
     Overlaps: 220384338
     By-Copy: 4397
     By-Ref: 4396  Delta: -0.0227428%

  Process finished with exit code 0

jeffbee · on Dec 31, 2022

How did you build it? It doesn't build with either gcc-12 or clang-15 on linux.

FpUser · on Dec 31, 2022

I built it on Windows, Visual C++ 2022. Did not check Linux as I do not think it matters much.

Now comes big surprise: I just built it using LLVM under CLion IDE and the results are:

  G:\temp\cpp\rust-cpp-bench\cpp\cmake\cmake-build- 
   release\fts_cmake_cpp_bench.exe
   Totals:
     Overlaps: 220384338
     By-Copy: 4397
     By-Ref: 4396  Delta: -0.0227428%

  Process finished with exit code 0

francasso · on Jan 1, 2023

This actually bothers me. I think the rust performance here is praise worthy. What bothers me is that we piled complexity over complexity at the hardware and compiler levels, and ended up in a situation where you got no way to get a reasonable understanding of how low level code will perform. Nowadays the main reason to program in a "low level" language is that you know that on average the compiler will be able to do a better job because the language doesn't have abstractions that map poorly to the hardware model. But for much of it you can forget about "I know what the hardware is going to do"

jgerrish · on Jan 1, 2023

I'm late to this discussion, sorry.

But at the risk of loss of respect, I'll wait for Rust2ShinyNewLanguage to solve this.

All I know is I hope I'm smart enough to understand ShinyNewLanguage's compiler. Or maybe even build it.

I've got several projects that could use some additional Boxes of structures, and borrow instead of move, and maybe a few more complex reference counting mechanics.

Rust forced me to understand what that meant. That's good for building a better engineer.

But it's not fun to work with.

I hope the next experience is better. Sorry Rustaceans.

cmrdporcupine · on Dec 31, 2022

There is no single answer to this question because it's going to depend completely on call patterns further up. Especially in regards to how much of the rest of the running program's data fits in L1 cache, and most especially in regards to what's going on in terms of concurrency.

The benchmark made here could completely fall apart once more threads are added.

Modern computer architectures are non-uniform in terms of any kind of memory accesses. The same logical operations can have extremely varied costs depending on how the whole program flow goes.

im3w1l · on Dec 31, 2022

My first thought was "now what is the calling convention for float parameters again? they are passed in registers right? the compiler can probably arrange so they don't have to actually be copied" and then I realized it will probably even inline it.

Anyway, assuming it's not inlined I would guess pass-by-copy, maybe with an occasional exception in code with heavy register pressure.

Edit: Actually since it's a structure, the calling convention is to memory allocate it and pass a pointer, doh. So it should actually compile the same.

masklinn · on Dec 31, 2022

> Edit: Actually since it's a structure, the calling convention is to memory allocate it and pass a pointer, doh. So it should actually compile the same.

FWIW the AMD64 SysV v1.0 psABI allows structures of up to 8 members to be passed via registers. Though older revisions limit that to 2 (and it's unclear whether MS's divergent ABI allows aggregates to be splat at all.

As sad as it's unsurprising, it does not look like LLVM (linux?) has followed up, on godbolt a 2-struct passes everything via registers but a 3-struct passes everything via the stack. Maybe there's a magic flag to use the 1.0 ABI, but a quick googling didn't reveal one. ICC doesn't seem to have followed up either.

im3w1l · on Jan 1, 2023

How can an abi allow but not require passing them via registers? Isn't the point of a convention that everyone does it the same way? Other wise the caller could put it on the stack and the callee look for it in a register?

Actually reading the word abi made a little neuron light up. To what extent does rust even follow that abi?

unsafecast · on Dec 31, 2022

> Edit: Actually since it's a structure, the calling convention is to memory allocate it and pass a pointer, doh. So it should actually compile the same.

Depending on calling convention, the structure may be spread out into registers.

kolbe · on Dec 31, 2022

Anyone know why seemingly knowledgeable people (like the person who wrote this article) don't use micro benchmarking frameworks when they run these tests?

Also, whenever you do one of these, please post the full source with it. There's no reason to leave your readers in the dark, wondering what could be going on, which is exactly what I'm doing now, because there's almost no excuse for c++ to be slower in a task than rust--it's just a matter of how much work you need to put in to make it get there.

forrestthewoods · on Dec 31, 2022

> please post the full source with it.

There’s literally a section called Source Code…

kolbe · on Dec 31, 2022

I see now. I looked twice. I think most people stop after a section called "Conclusion" that ends with "Thanks for reading." It doesn't help that the formatting then leaves a large gap between sections that doesn't indicate there's more after that.

forrestthewoods · on Dec 31, 2022

Fair!

kllrnohj · on Dec 31, 2022

For C++ I guess you could make the claim that it's just too annoying to take a dependency on something like google-benchmark or whatever, since C++ dependency management is such a mess to deal with in general.

But yeah I have no idea why a benchmark framework wasn't used for Rust.

kolbe · on Dec 31, 2022

Whenever I don't want to endure that annoyance, i just copy this single file header only microbenchmarking code:

https://github.com/sheredom/ubench.h

Veedrac · on Dec 31, 2022

The general usability impact matters slightly less than it looks here, in part because the `do_math` with references in the article has two extra &s, and in part because methods autoreference when called like x.f().

Performance-wise, if you're likely to touch every element in a type anyway, err on the side of copies. They are going to have to end up in registers eventually anyway, so you might as well let the caller find out the best way to put them there.

yobbo · on Dec 31, 2022

The Rust-test implements the traits Add, Sub, Mul by value. This makes the few references less important in the total test. The ergonomics argument is motivated by using these traits. Otherwise, references would have had the same ergonomics.

But also, the struct is 3x32 bits, and Rust auto-implements the Copy-trait for it. It is barely larger than u64, which is the size of the reference.

But life is only simpler when Copy and Clone can be auto-implemented.

YesThatTom2 · on Jan 1, 2023

I covet Ada’s feature where you just specify if a parameter is in, out, or inout; the compiler’s figures out whether to copy or pass a pointer.

afdbcreid · on Jan 1, 2023

I haven't benchmarked that, but in Rust `ScalarPair`s (i.e., structs who have up to to scalars) are passed in two registers, while bigger structs are always passed by pointer. Therefore, passing bigger structs by move will require the compiler to copy them, while with references it is not required to, so references may be faster in that case.

lowbloodsugar · on Dec 31, 2022

I understand that this is an example for the purposes of answering the given question, but when actually doing things with 3D vertices one should be thinking in terms of structures of arrays. As someone said here already: good generals worry about strategy and great generals worry about logistics.

redox99 · on Dec 31, 2022

I'm surprised he tested MSVC and Clang, and not GCC which usually generates faster code than those two.

3836293648 · on Dec 31, 2022

Well, they are the two easily available compilers on Windows. And rustc vs clang should be the fair comparison as they both use llvm

rnestler · on Jan 1, 2023

When comparing Rust and C++ it makes sense to use Clang IMO, since both, the Rust compiler and Clang, are based on LLVM.

But nevertheless: I agree it would have been interesting to test with GCC as well.

datafulman · on Dec 31, 2022

You are comparing two completely different compilers; I wouldn't worry all that much about the difference between rust and C++. If you do want to compare them directly, why not use LLVM for C++ as well? That will highlight any language-specific differences.

dang · on Dec 31, 2022

Discussed at the time:

Should small Rust structs be passed by-copy or by-borrow? - https://news.ycombinator.com/item?id=20798033 - Aug 2019 (107 comments)

tuetuopay · on Dec 31, 2022

This is not really surprising in such a case. The Rust compiler is pretty good at optimizing out uneeded copies. Here it does see that the copied value is not used after the function call, so it should simply not emit the copies in the final assembly.

ardel95 · on Dec 31, 2022

Minor nit: many of the differences in the article aren't really specific to the Rust vs C++, but rather differences between llvm vs whatever compiler backend is used by msvc.

the__alchemist · on Dec 31, 2022

Interesting! Of note, My `Vec3` and `Quaternion` types (f32 and f64) have `Copy` APIs, but I've wondered about this since their inception.

TEP_Kim_Il_Sung · on Jan 1, 2023

I don't know much about this, but by-copy sounds nice. I'd rather own stuff and be happy.

ta988 · on Jan 1, 2023

If you are not limited by memory and cpu sure.

throwawaybycopy · on Dec 31, 2022

Should have also tried pass-by-move .

29athrowaway · on Dec 31, 2022

A more direct comparison would have been a r-value reference.

cuteboy19 · on Jan 1, 2023

So c++ is less complicated than rust in some cases?

amelius · on Dec 31, 2022

This is one of the problems I have with writing rust code. You have to think about so many mundane details that you barely have time left to think about more important and more interesting things.

mlindner · on Dec 31, 2022

Having written a lot of C, you spend basically all your time thinking about "mundane details", and worse, if you make a mistake, you often don't know you made a mistake until it's running on some customer's servers and you just got a ticket escalated 3 times up to you with vague information about crashes rarely happening. Good luck remembering which bit of code you wrote 6 months ago that may be causing the problem.

I'll take Rust shouting at me for missing "mundane details" any day of the week.

matheusmoreira · on Dec 31, 2022

Well, it is a systems programming language. Thinking about exactly how the language passes bits around is the whole point. Rust should specify a stable ABI already so that everyone can form a good mental model of what their code becomes once compiled.

amelius · on Dec 31, 2022

True, I probably was using Rust for the wrong type of problem, i.e. was hoping to write a user-level application with a graphical UI at the time.

Rust is probably better used for writing fast low-level libraries that you call from higher level languages, possibly with a garbage collector, so you don't waste time thinking about memory management while you design/write your high-level application.

throw10920 · on Dec 31, 2022

My (brief) experience with Rust was that, while I had to struggle to learn the borrow-checker, I didn't have lots of "mundane details" to worry about - if any, less than C(++).

What did you have in mind?