Hacker News new | past | comments | ask | show | jobs | submit login
A Rust optimization story (quickwit.io)
192 points by francoismassot on Oct 23, 2021 | hide | past | favorite | 45 comments



For what it's worth, it appears a C++ version [0] of that binary search algorithm with Clang 13 requires 9.45 cycles [1]. The clang++-generated ASM is quite similar to the rustc-generated ASM in the article. Not sure what accounts for the difference from the C++ implementations the author was comparing against.

Regardless, it's neat to see the two languages are so close in performance here. I wonder if down the line, the richer information about lifetimes, etc. that Rust provides will allow optimizations beyond those available to C++ -- or if this is already the case.

[0] https://godbolt.org/z/Mj7PWevex [1] https://bit.ly/3CawbUe


I'm not aware of any theoretical optimizations that lifetimes would enable. Aliasing information, absolutely, but the lifetimes themselves are purely used to reject a given program, and currently have no other impact. Maybe you could do something neat with reusing storage slots, though it seems like LLVM is already pretty good at that?


There's an opportunity to avoid copying certain values due to move semantics, but last I checked it doesn't work yet because of LLVM. But it could one day:

https://stackoverflow.com/questions/38571270/can-rust-optimi...

Edit: This answer is pretty old, so it may have changed by now, but I couldn't find anything more recent in a quick search


That specific optimization still does not happen.


Reusing storage slots definitely sounds plausible, another possible option is automatically allocating values with the same lifetime into memory arenas, so that you only have to do one alloc and one free for an arbitrary number of objects with the same or similar lifetimes.


That doesn't really make much sense for Rust, unfortunately.

That is, for stack objects, LLVM/etc already do this and lifetimes don't add any useful new information. A function will generally adjust the stack pointer once to allocate and free space for all its locals together.

Beyond that, lifetimes are answering the wrong question. For one thing, the language goes to great lengths to squash and stretch lifetimes to accept more programs- so function signatures are typically as general as possible (IOW they capture as little lifetime information as possible). This even includes entirely "forgetting" information about how/where things are allocated.

Actually using lifetimes or "regions" to control or track allocations probably makes more sense in a higher-level, perhaps more allocation-happy, language. You might be interested in looking at Vale, for example: https://vale.dev/


I did not know there was a CPU instruction timing simulator online!

https://uica.uops.info/


Casey Muratori just last week released a video where he uses it at the end as well, definitely worth a watch but be warned it’s basically part 3 of 3 and the context from the first two videos definitely helps, but that’s over 5 hours of (fascinating) content: https://youtu.be/1tEqsQ55-8I


That was also a significant takeaway from this for me! How cool!


Cool but these tools aren't all that useful in practice due to memory access dominating performance in the real world.


Sure, both things matter, but this gives good insight into one thing (the cycles) at least. Now what would be really neat would be if it would simulate memory and change behavior under some set of assumptions as well. Maybe such a thing exists?


It's not really possible with just running the program. The whole issue is that the cycles are usually dwarfed by a cache miss.

You can just assume you don't miss the cache of course.


I wonder what critics of optimizer-oriented programming, such as Marcel Weiher (mpweiher), think of this post. Weiher has been repeatedly critical of Swift for relying way too much on LLVM's optimizations; I wonder if Rust, with its emphasis on zero-overhead abstraction, falls into the same category for him.


I am not familiar with 'mpweiher's opinions on optimizers, but I do want to make an important distinction: there are optimizations like "this abstraction will get inlined" or "the compiler is aware of this idiom" that are will happen essentially 100% of the time. C++ pioneered the approach of "zero cost abstractions" that depend on a smart compiler to get you good code, with Swift and Rust following its footsteps. This is how you can have a nice thing like an array type whose subscript function doesn't literally turn into a function call, etc.

What happened here in this blog post is the code was relying on a higher-level optimization that compilers aim to hit, but necessarily cannot in every case: things like vectorization, loop unrolling, or branch elimination, are keyed on heuristics that frequently change. In general this is not a big deal, because the compiler will pick something reasonable, but in the hottest sections of your code this kind of thing can kill performance. When people say they can beat a compiler, it's these kinds of places where you'd do it, and in that point I do agree that compiler optimizations can be limited in their benefit.

IMO (which is heavily biased towards "zero-overhead abstractions") this is still a strong showing for smart compilers. In 99% of cases, the compiler gives you what you want with less code, and then you profile to find the spots where it needs a bit of help.


What I came away wondering is whether it would make sense to just check in the asm with the optimizations you like once you finagle it. The author mentions being concerned about regression in future compiler versions (which seems reasonable since it already happened once) and this would solve that problem. It seems like a trade off is that you would not get future optimizations for that architecture, but this seems like a pretty small risk.


I know some projects that use that policy. Here's an example off the top of my head: https://github.com/apple-open-source-mirror/objc4/blob/fd675.... Looks fairly reasonable, with method calls and such, but after inlining and optimization it compiles down to the right handful of instructions that the authors intended it to.


I agree!

... But I do wish I had some knobs in stable to hint the compiler in rust.

Mark a condition as unpredictable, a branch as likely, a loop as likely to be long, etc.

Some of it exists as intrinsics but this is not accessible in stable rust.


One thing that is stable is marking a function as cold which usually has the effect of marking the calling basic block as unlikely.


C++ has slowly added some of these for branches, perhaps we will see compiler extensions for loops and such as well. I am sure Rust will probably get some of these at some point too.


Idle question: how did we end up with the term "inverted index" for this sort of thing?

The term "index" is used because of an analogy to the index of a book. But the so-called "inverted" index is the same way round as a book's index: you look up a word and it gives something analogous to page numbers.


Books (especially technical books) often have two indices: The forward index in the beginning of the book, which gives you a list of chapters/sections and where to find them; and the reverse index at the end that gives you a list of terms and gives you a list of of places to find each term.


It's the opposite of 'forward index', the wikipedia page (and its references) cover some of the background. I think you're overestimating how much this terminology is based on one particular kind of book index - indexing even before electronic computers was more sophisticated than that.

https://en.wikipedia.org/wiki/Inverted_index


Looking at that, I get the impression that the story is:

- someone introduced the term "inverted file" for this sort of thing (which is sensible terminology)

- later people started considering these things to be a sort of database index

- so they started saying "inverted index" without paying attention to what that implies as an English phrase


I don't think that's the story, again, I think you're overemphasizing a particular kind of book index you're familiar with. Here's usage of 'direct' and 'inverted' index from 1903 that I just googled up

https://books.google.com/books?id=O0AwAQAAMAAJ&pg=PA422&dq=%...


That's an index of deeds, and the distinction there is that "direct" is indexed by the grantor and "inverted" is indexed by grantee.

I don't think that's the same thing at all.


I think it shows quite clearly that your theory that Information Retrieval nerds just got confused/misapplied a particular kind of book index is inaccurate and that IR nerds have been around for longer than one might think.


Sometimes coining a new term is so attractive that people cannot resist. Imagine having that on your resume.


I am not very experienced with SIMD but wouldn't a SIMD-based binary search be faster?

In this particular case (assuming SIMD can make 4 comparisons at once), we have an array with 128 elements. So we can compare the needle with indices {24, 48, 72, 96} in one CPU cycle; so we narrow down to approx 24 elements in 1 cycle. Repeat this until we get the answer.

This^ is a very approximate idea. There are a lot of edge-cases and things to consider. But couldn't we solve this problem in 4-5 cycles with SIMD binary search?


SIMD gather are slow on commodity hardware, often as slow as loading and inserting each element one at a time.


Excellent writeup, very succinct and easy to follow.


Nice writeup! especially on the explanation of low level asm stuff


Pretty interesting that a branchless linear search would be faster than a branching binary search! It makes sense, it just goes against the normal intuition. Just goes to show that benchmarking is always important


> it just goes against the normal intuition.

To me it's perfectly intuitive.

However, the other example he gave blew me away:

    return if cond { ... } else { ... }
The compiler evaluating both results and chosing the right one using a cmov, rather than using a branch to evaluate just one result was not what I would have expected. My how times have changed.


So many interesting findings in the post.

The tantivy binary search & rust binary search only has one different at the end, knowing the size of the slice ahead of time, right?

Could the rust compiler infer the size ahead of time?


With inlining maybe?


I assume that the lack of cmov generation affects C++ too with clang?


It is too early to assume that. It just fails to optimize that specific piece of library code, but when the author implements it himself LLVM does choose cmov instructions. It still can find the optimization but either something is preventing that in the first case, or llvm’s model thinks the alternative is faster.


If the author is reading, the phrase,

> Please follow me in my rabbit hole.

... means something rather different to what you intended, which is probably,

> Please follow me down the rabbit hole.

But thank you for the giggle. :-)


It means the same thing. The author is just clarifying that this is their rabbit hole they were in.

What else would it possibly mean?


Yeah the full paragraph is “Today I will be your rabbit guide. Please follow me in my rabbit hole.” which is to me, as a native speaker, pretty clear the author is guiding you through their rabbit hole of Rust optimization and not something more juvenile.


Does this mean that RabbitMQ must be renamed for the "psychological safety" of students like Coq?


Haha, this reminds me of a previous company I worked at, whose backend was at one point so architected as to enqueue every single HTTP request onto RabbitMQ to be processed, and then consume the response and return it to the client. (Don't ask me why - for whatever reason, they wanted an entirely async system.)

Anyway, RabbitMQ was so unreliable that they ended up having a 'Rabbit Watcher' service, devoted to monitoring Rabbit and alerting them when it went down. And when that service proved unreliable, they added a Rabbit Watcher Watcher. Presumably it was when that service failed that they finally decided to ditch RabbitMQ...


We have a service that is just terrible that runs here and crashes a lot. Somtimes multiple times a day. And instead of rewriting it they have a program that watches it too. And restarts it if it fails...

I can't believe the same nonsense exists elsewhere. ours is nothing to do with rabbitmq


Haha, at a multi-billion-dollar company too, in my case. The truth is that companies don't tend towards the best code possible, they tend towards the worst code allowable to serve their purposes.


I think this is an incorrect description of the reason for renaming Coq.




Consider applying for YC's Spring batch! Applications are open till Feb 11.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: