*Use polymorphism sparingly* I think it is a little ironic that he speaks of per...

raphlinus · on Aug 21, 2019

Performance culture has you measure the actual performance implications, then make an informed decision. Is the code on a performance-critical path? Maybe some of your serialization code is, but it's extremely unlikely that a dynamic dispatch when parsing command line args is the reason your app is slow. Also be aware that highly inlined code does nicely in microbenchmarks but might have significantly negative performance implications in a larger system when it blows out the I-cache.

pcwalton · on Aug 21, 2019

> Also be aware that highly inlined code does nicely in microbenchmarks but might have significantly negative performance implications in a larger system when it blows out the I-cache.

I see this assertion a lot, but I have never actually seen a system in which inlining that would otherwise be a win in terms of performance becomes a loss in a large system. LLVM developers seem to agree, because LLVM is quite aggressive in inlining (the joke is that LLVM's inlining heuristic is "yes").

I'd be curious to see any examples of I$ effects from the effects of inlining specifically in large systems mattering in practice.

Jasper_ · on Aug 21, 2019

Fiora refactored the MMU code emitted by Dolphin to a far jump, which had significant performance improvements over inlining the code [0]. She had an article about it in PoC || GTFO [1].

[0] https://dolphin-emu.org/blog/2014/09/30/dolphin-progress-rep...

[1] https://github.com/angea/pocorgtfo/blob/master/contents/issu...

pcwalton · on Aug 22, 2019

Interesting, that's a good case. Though it's a bit of an extreme one, because it's jitcode for a CPU emulator. I'm not sure how relevant that is to Rust, though it's certainly worth keeping in mind.

Jasper_ · on Aug 22, 2019

In my experience, i$ is much bigger than everyone thinks, and they over-emphasize optimizing for it whenever someone brings up code size. It can soak up a lot. That said, for JITs, where code is not accessed very often and in weird patterns, it can matter quite a lot.

jcelerier · on Aug 22, 2019

hm, I've run a lot of profiling of various software through the years and never once instruction cache misses have been a problem, in large template-rich boostful C++ codebases

ajtulloch · on Aug 22, 2019

Systems that JIT large amounts of code (HHVM, etc) deal with this trade-off all the time. See e.g. https://qconsf.com/sf2012/dl/qcon-sanfran-2012/slides/KeithA... for an old discussion of some of the issues (e.g. inlined/specialized versions of memcpy were slower overall than a standalone-slower outlined version).

barrkel · on Aug 22, 2019

You're asking for something which is a bit awkward to find, because it requires a bunch of code in a loop to pressure the cache, and then have someone notice the effects of inlining one thing vs not makes all the difference.

The most likely people to be able to answer this one would be game devs or video codec hackers, at a guess.

I do know that inlining choices can have massive effects on executable size. I've seen more people complain about this kind of thing. It's most noticeable when controlling the inlining of a runtime library function in a language a bit more high level than Rust - I'm thinking of Delphi, with its managed strings, arrays, interfaces etc.

ufo · on Aug 21, 2019

One somewhat related example I can think of was how the v8 javascript implementation switched from a baseline compiler to a baseline interpreter. The interpreter version has less startup latency (because compiling to bytecode is less work) and uses less RAM (because bytecodes are more compact).

It isn't exactly about inling but it is an example where optimizing for size also optimized for speed at the same time.

pcwalton · on Aug 21, 2019

Good example, but that's not I$ specifically.

jnordwick · on Aug 22, 2019

Any time you have an error/exception/abort path, you always never want to inline it (LLVM prob has attributes to prevent that, but I'm not sure if they are used by rust). Also, LLVM does get a little too aggressive with things like unrolling so I wouldn't be surprised if it inlined too aggressively too.

Jweb_Guru · on Aug 22, 2019

They are used by Rust.

jnordwick · on Aug 25, 2019

How does it tell an unimportant error path from a needs to be optimized common case (this is where exceptions would be great).

exacube · on Aug 22, 2019

I've been told Chromium saw some measurable benefits from reducing binary size over code-inlining. (sorry, I'm not able to provide a citing).

Qasaur · on Aug 21, 2019

I do agree with you with the measuring aspect. Part of building high-performance systems is being able to measure performance in an accurate and actionable way, and consequently optimise code paths that have significant performance impacts.

However, I do believe that a competent engineer would have the judgement to be able to see, roughly, where performance hits would likely arise and optimise accordingly. Command-line arguments would likely not fall under this mandate, but serialisation to stdout is likely a good candidate for well-designed and well-optimised code. A nice side-effect is that this also avoids significant refactors down the line when you need to, in this case, change your serialisation from dynamic dispatch to static dispatch.

andolanra · on Aug 21, 2019

One thing I like about Rust a lot is that it lets you choose which hit you want to take when it comes to polymorphism. You can manually (and without much difficulty!) prefer dynamic dispatch if it's important to you to keep your binary size small, but you can also choose static dispatch and allow some replicated copies of your parametric code if that's what you want to optimize for.

It's also worth noting that if you are using polymorphic functions that are only ever called in your code with a single known type parameter, then your program should be just as efficient as if you wrote a monomorphic version with that fixed type in both runtime _and_ binary size, which means the only disadvantage to the polymorphism there is compile time.

bogdanoff_2 · on Aug 22, 2019

It would be nice if Box<dyn Trait> implemented Trait. Then we would be able to write the function only once with parametric polymorphism and then at callsite decide if we want to monomorphize for the given type or not.

grumdan · on Aug 22, 2019

You can provide this implementation yourself easily enough though. I agree it's maybe not ideal that this needs to be done for every Trait you want this behavior for.

Koshkin · on Aug 21, 2019

I’ve been using C and C++ (sorry Rust), like, forever, and I think should hate them with all fibers of my soul. A high-level programming language is “supposed” to let me forget, for the higher being’s sake, all machine-specific details and focus on the logic of the problem at hand. (Hello FORTRAN.)

dingo_bat · on Aug 22, 2019

If you, as the programmer, would prefer not to know and remember machine specific details, you can use rust or c++ easily. Just don't think you will always get the best performance possible for your hardware. And I think that is justified and reasonable.

maxdamantus · on Aug 22, 2019

I've always thought the fact that parametric polymorphism always results in monomorphised code is just a temporary deficiency in the language/compiler.

I think if a problem can be solved with either, the default choice should generally be parametric polymorphism rather than subtype polymorphism, simply from a logical point of view.

Haskell is often described as passing around "dictionaries" corresponding to class instances. Presumably Rust could add the same functionality depending on how a type parameter is declared (eg, `fn foo<T: Foo>() ..` denotes a monomorphised function whereas `fn foo<%T: Foo>() ..` could denote a non-monomorphised function which at runtime takes arguments specifying the size and alignment of `T` as well as a vtable corresponding to the `Foo` trait). This would also make polymorphic recursion possible.

dbaupp · on Aug 22, 2019

Swift takes this approach: monomorphisation is an implementation detail/optimisation.

In languages like Rust and Swift where values don't have a uniform representation (that is, not always a pointer), this takes a lot of infastructure, and a lot of performance/optimiser work to get reasonable performance for common code: the Swift compiler has quite a bit of code devoted to making generically-types values behave mostly like statically-typed ones, with minimal performance cliffs.

Rust's approach is that this sort of vtable-/dictionary-passing has to be done explicitly (with a trait object), and such values have restrictions.

raphlinus · on Aug 22, 2019

To a very large extent, we already have this with `fn foo(foo: &dyn T)` (or `foo: Box<dyn T>` for the owned version). What I would find even more interesting is the compiler much more aggressively factoring out the common code from the multiple instances, ideally compiling it only once and putting only that one version in the binary.

maxdamantus · on Aug 22, 2019

`T` there would be a trait though, not a type. You should still be able to have, for example, `fn foo<%T>(v: Vec<T>)` in which case you can still call the function with a regular `Vec<i32>`, since if the size/alignment is simply passed as an argument at runtime, you can operate on the existing non-boxed representations of data. The only thing that's different is the specialisation of the instructions essentially happens at runtime rather than at compile time.

I have however thought that it should be automatically done based on heuristics, but since the notion of "zero-cost abstraction" is considered the default, I don't think this would be desirable.

idubrov · on Aug 22, 2019

Well, I have this anecdote. We switched from serde to our own serialization / deserialization scheme (it still uses serde, but only for the JSON part), which is heavily based on dynamic dispatch, and actually got it faster.

Wasn't apples to apples comparison, but it was some times faster at the time (my memory doesn't serve me, but something around 3x to 5x). Also, compilation speed went down (well, at the time :) ). It was mostly due how some of the features work in serde (flatten and tagged enums), though.

I made a separate, cleaner, experiment (https://github.com/idubrov/dynser), which does not show that dramatic improvement (again, wasn't apples to apples, there were other factors which I don't remember), but shows some.

_nhynes · on Aug 21, 2019

When compiling to Wasm (esp. in the context of blockchain), small bytecodes are less costly than fast but short-lived programs.

However, in terms of maintainability, I’m happy to foist a few extra kB of bytecode on developers to produce an idiomatic (i.e. serde-using) library.

vbezhenar · on Aug 21, 2019

Can you do both? Fast compile time and slightly slower execution time for debug build using dynamic dispatch and long compile time and fast execution time for release build using static polymorphism from the same code base.

Qasaur · on Aug 21, 2019

It could probably be done using conditional builds for which there is support for in cargo, though that would require the programmer to write two versions of the same code. I doubt that it is possible for the compiler to do this optimisation automatically.

Recall that dynamic dispatch does not need you as a programmer to know which implementation is being used for a given polymorphic method or function - I find it difficult to see how the compiler would be able to reason what implementation is being referred to in code to generate static polymorphic code without the programmer being explicit. If that were the case, there would be no need to be explicit at all (and consequently no need for static dispatch in Rust code), and all you would need to do for polymorphism is to use dynamic dispatch. However, although this would be incredibly convenient and ergonomic, unfortunately the Rust compiler is not capable of magic.

swsieber · on Aug 21, 2019

While the compiker might not be able to do thus optimization automatically, you coukd probably write a proc macro to do it.

hinkley · on Aug 22, 2019

I think it's considered old hat by now for JITs in languages with polymorphism to inline a little bit of dynamic dispatch code into the call sites. The branch predictor gets to work its magic and removes the call overhead in a high number of cases.

I think I read somewhere that Javascript engines do something similar, with some extra code to de-optimize when you fiddle with the object prototype.

mhink · on Aug 22, 2019

I’m not super familiar with the details, but all the the major Javascript engines definitely do extremely involved runtime optimizations, and I wouldn’t be surprised at all if the case you described is one of them.