I think it is a little ironic that he speaks of performance culture but simutaneously advises to use dynamic dispatch and avoid polymorphism. I can see the justification in non-critical code paths, but serialisation is a pretty important part of most networked software nowadays so I do not think that smaller binaries and faster compilation times (better developer experience) justifies a performance hit in the form of dynamic dispatch through crates like miniserde.
Performance culture has you measure the actual performance implications, then make an informed decision. Is the code on a performance-critical path? Maybe some of your serialization code is, but it's extremely unlikely that a dynamic dispatch when parsing command line args is the reason your app is slow. Also be aware that highly inlined code does nicely in microbenchmarks but might have significantly negative performance implications in a larger system when it blows out the I-cache.
> Also be aware that highly inlined code does nicely in microbenchmarks but might have significantly negative performance implications in a larger system when it blows out the I-cache.
I see this assertion a lot, but I have never actually seen a system in which inlining that would otherwise be a win in terms of performance becomes a loss in a large system. LLVM developers seem to agree, because LLVM is quite aggressive in inlining (the joke is that LLVM's inlining heuristic is "yes").
I'd be curious to see any examples of I$ effects from the effects of inlining specifically in large systems mattering in practice.
Fiora refactored the MMU code emitted by Dolphin to a far jump, which had significant performance improvements over inlining the code [0]. She had an article about it in PoC || GTFO [1].
Interesting, that's a good case. Though it's a bit of an extreme one, because it's jitcode for a CPU emulator. I'm not sure how relevant that is to Rust, though it's certainly worth keeping in mind.
In my experience, i$ is much bigger than everyone thinks, and they over-emphasize optimizing for it whenever someone brings up code size. It can soak up a lot. That said, for JITs, where code is not accessed very often and in weird patterns, it can matter quite a lot.
hm, I've run a lot of profiling of various software through the years and never once instruction cache misses have been a problem, in large template-rich boostful C++ codebases
Systems that JIT large amounts of code (HHVM, etc) deal with this trade-off all the time. See e.g. https://qconsf.com/sf2012/dl/qcon-sanfran-2012/slides/KeithA... for an old discussion of some of the issues (e.g. inlined/specialized versions of memcpy were slower overall than a standalone-slower outlined version).
You're asking for something which is a bit awkward to find, because it requires a bunch of code in a loop to pressure the cache, and then have someone notice the effects of inlining one thing vs not makes all the difference.
The most likely people to be able to answer this one would be game devs or video codec hackers, at a guess.
I do know that inlining choices can have massive effects on executable size. I've seen more people complain about this kind of thing. It's most noticeable when controlling the inlining of a runtime library function in a language a bit more high level than Rust - I'm thinking of Delphi, with its managed strings, arrays, interfaces etc.
One somewhat related example I can think of was how the v8 javascript implementation switched from a baseline compiler to a baseline interpreter. The interpreter version has less startup latency (because compiling to bytecode is less work) and uses less RAM (because bytecodes are more compact).
It isn't exactly about inling but it is an example where optimizing for size also optimized for speed at the same time.
Any time you have an error/exception/abort path, you always never want to inline it (LLVM prob has attributes to prevent that, but I'm not sure if they are used by rust). Also, LLVM does get a little too aggressive with things like unrolling so I wouldn't be surprised if it inlined too aggressively too.
I do agree with you with the measuring aspect. Part of building high-performance systems is being able to measure performance in an accurate and actionable way, and consequently optimise code paths that have significant performance impacts.
However, I do believe that a competent engineer would have the judgement to be able to see, roughly, where performance hits would likely arise and optimise accordingly. Command-line arguments would likely not fall under this mandate, but serialisation to stdout is likely a good candidate for well-designed and well-optimised code. A nice side-effect is that this also avoids significant refactors down the line when you need to, in this case, change your serialisation from dynamic dispatch to static dispatch.
One thing I like about Rust a lot is that it lets you choose which hit you want to take when it comes to polymorphism. You can manually (and without much difficulty!) prefer dynamic dispatch if it's important to you to keep your binary size small, but you can also choose static dispatch and allow some replicated copies of your parametric code if that's what you want to optimize for.
It's also worth noting that if you are using polymorphic functions that are only ever called in your code with a single known type parameter, then your program should be just as efficient as if you wrote a monomorphic version with that fixed type in both runtime _and_ binary size, which means the only disadvantage to the polymorphism there is compile time.
It would be nice if Box<dyn Trait> implemented Trait. Then we would be able to write the function only once with parametric polymorphism and then at callsite decide if we want to monomorphize for the given type or not.
You can provide this implementation yourself easily enough though. I agree it's maybe not ideal that this needs to be done for every Trait you want this behavior for.
I’ve been using C and C++ (sorry Rust), like, forever, and I think should hate them with all fibers of my soul. A high-level programming language is “supposed” to let me forget, for the higher being’s sake, all machine-specific details and focus on the logic of the problem at hand. (Hello FORTRAN.)
If you, as the programmer, would prefer not to know and remember machine specific details, you can use rust or c++ easily. Just don't think you will always get the best performance possible for your hardware. And I think that is justified and reasonable.
I've always thought the fact that parametric polymorphism always results in monomorphised code is just a temporary deficiency in the language/compiler.
I think if a problem can be solved with either, the default choice should generally be parametric polymorphism rather than subtype polymorphism, simply from a logical point of view.
Haskell is often described as passing around "dictionaries" corresponding to class instances. Presumably Rust could add the same functionality depending on how a type parameter is declared (eg, `fn foo<T: Foo>() ..` denotes a monomorphised function whereas `fn foo<%T: Foo>() ..` could denote a non-monomorphised function which at runtime takes arguments specifying the size and alignment of `T` as well as a vtable corresponding to the `Foo` trait). This would also make polymorphic recursion possible.
Swift takes this approach: monomorphisation is an implementation detail/optimisation.
In languages like Rust and Swift where values don't have a uniform representation (that is, not always a pointer), this takes a lot of infastructure, and a lot of performance/optimiser work to get reasonable performance for common code: the Swift compiler has quite a bit of code devoted to making generically-types values behave mostly like statically-typed ones, with minimal performance cliffs.
Rust's approach is that this sort of vtable-/dictionary-passing has to be done explicitly (with a trait object), and such values have restrictions.
To a very large extent, we already have this with `fn foo(foo: &dyn T)` (or `foo: Box<dyn T>` for the owned version). What I would find even more interesting is the compiler much more aggressively factoring out the common code from the multiple instances, ideally compiling it only once and putting only that one version in the binary.
`T` there would be a trait though, not a type. You should still be able to have, for example, `fn foo<%T>(v: Vec<T>)` in which case you can still call the function with a regular `Vec<i32>`, since if the size/alignment is simply passed as an argument at runtime, you can operate on the existing non-boxed representations of data. The only thing that's different is the specialisation of the instructions essentially happens at runtime rather than at compile time.
I have however thought that it should be automatically done based on heuristics, but since the notion of "zero-cost abstraction" is considered the default, I don't think this would be desirable.
Well, I have this anecdote. We switched from serde to our own serialization / deserialization scheme (it still uses serde, but only for the JSON part), which is heavily based on dynamic dispatch, and actually got it faster.
Wasn't apples to apples comparison, but it was some times faster at the time (my memory doesn't serve me, but something around 3x to 5x). Also, compilation speed went down (well, at the time :) ). It was mostly due how some of the features work in serde (flatten and tagged enums), though.
I made a separate, cleaner, experiment (https://github.com/idubrov/dynser), which does not show that dramatic improvement (again, wasn't apples to apples, there were other factors which I don't remember), but shows some.
Can you do both? Fast compile time and slightly slower execution time for debug build using dynamic dispatch and long compile time and fast execution time for release build using static polymorphism from the same code base.
It could probably be done using conditional builds for which there is support for in cargo, though that would require the programmer to write two versions of the same code. I doubt that it is possible for the compiler to do this optimisation automatically.
Recall that dynamic dispatch does not need you as a programmer to know which implementation is being used for a given polymorphic method or function - I find it difficult to see how the compiler would be able to reason what implementation is being referred to in code to generate static polymorphic code without the programmer being explicit. If that were the case, there would be no need to be explicit at all (and consequently no need for static dispatch in Rust code), and all you would need to do for polymorphism is to use dynamic dispatch. However, although this would be incredibly convenient and ergonomic, unfortunately the Rust compiler is not capable of magic.
I think it's considered old hat by now for JITs in languages with polymorphism to inline a little bit of dynamic dispatch code into the call sites. The branch predictor gets to work its magic and removes the call overhead in a high number of cases.
I think I read somewhere that Javascript engines do something similar, with some extra code to de-optimize when you fiddle with the object prototype.
I’m not super familiar with the details, but all the the major Javascript engines definitely do extremely involved runtime optimizations, and I wouldn’t be surprised at all if the case you described is one of them.
I think it is a little ironic that he speaks of performance culture but simutaneously advises to use dynamic dispatch and avoid polymorphism. I can see the justification in non-critical code paths, but serialisation is a pretty important part of most networked software nowadays so I do not think that smaller binaries and faster compilation times (better developer experience) justifies a performance hit in the form of dynamic dispatch through crates like miniserde.