Hacker News new | past | comments | ask | show | jobs | submit login
Why Hashbrown Does a Double Lookup (gankro.github.io)
356 points by mbrubeck 36 days ago | hide | past | web | favorite | 112 comments

If memory overhead (load factor) isn't a big issue, readers may also find cuckoo hashing rather interesting. (Theoretically, it has a worst-case constant lookup time. Learned about it in one of my university algorithms classes, but have yet to see an implementation in practice.)


"If memory overhead (load factor) isn't a big issue, readers may also find cuckoo hashing rather interesting."

The issue with memory overhead (load factor) is actually something that Cuckoo hashing solves compared to linear probing. Bucketized Cuckoo hashing supports much higher load factors than linear probing. You can have your cake and eat it too.

I have implemented hash tables with open addressing and tombstones in the past (linear probing, triangular probing etc.) and I much prefer Cuckoo hashing now.

When Cuckoo hashing is combined with a bloom filter in the first position to reduce cache misses to the second position, it's almost perfect in every way: fast (drastically reduced cache misses), constant lookup time in the worst case (not so for linear probing), higher load factors (80% or more for Cuckoo hashing vs at most 50% for linear probing - this also affects amortized resize latency), not to mention elegant simplicity.

In case anyone is interested, here are some design details and decisions for an optimized bucketized Cuckoo hash table with bloom filters implemented for Node.js:


How does cuckoo hashing work for 80% load factor. It may be possible that you fill out both the positions and your eviction algorithm is then no longer constant time for worst case.

AFAIK, cuckoo works best when load factor is less than 50%. Wikipedia seems to also agree with me.

Did you read the link?

> Each bucket contains up to 8 elements to support a hash table load factor of 80% or higher, unlike linear or chained hash tables which support a load factor (elements / capacity) of at most 50% and which waste the other 50% of memory.

Here’s a practical implementation in Twitter’s cache backend, https://github.com/twitter/pelikan/tree/master/src/storage/c...

I assume that’s constant lookup time assuming well distributed hash values? The absolute worst case for a hash table is when every key has the same hash, and then you must fall back to something that’s at best log-n time.

It's constant-time lookup (only two possible places to check). But if you give it the "return 4" hash function, it will fail to insert after the second or third entry. The result is the same, you have to fall back to something (changing the hash function; making the table larger; open addressing with chaining; etc.)

That’s boring. Any algorithm can be constant time if you’re allowed to fail after a constant amount of work.

It will rebuild the hashtable when that happens. As long as the load factor remains below 50%, (for the most basic implementation) insertions will remain constant anyways. You don't need to allocate new tables.

How is that possible? I insert three pathological keys. The third one collides with both, so it rebuilds the table. The third one still collides with both, so it rebuilds the table again. Repeat. Eventually either we fail or we end up in something that’s worse than constant time.

Have you read about how the algorithm works?

A bit. When you collide in both tables, you switch to a different hash function and try again. With real-world data, that should work great. But with pathological data (i.e. the worst case behavior) that will never succeed.

If you phrase the problem as "design a better hash table, given the hash function h(x) = 4", then the obvious first thing to do is discard the provided hash function, and use any reasonable hash table implementation with a new hash function, say h'(x) = x. (If x "is an object" instead of a primitive type; then some underlying representation of x in a register is a pointer; that memory address can typically be converted to an integer that can be the hash value.)

Thus, I would argue that the problem reduces to the case where you have a reasonable hash function.

The standard Java java.util.HashMap currently falls back to a Tree(Map) if a bucket gets too full (more than 5? Not sure, a LinkedList if not too full). But I wonder, couldn't you not use a different hash and have use a HashMap as a bucket?

You could, but it’s theoretically possible to have keys whose hashes always collide, even with different hash functions.

This is not a purely theoretical concern, I should note. There are practical DoS attacks that can be carried out by deliberately crafting colliding keys. The current state of the art seems to be to make it impractical for attackers to accomplish this by incorporating secret data into the hash function, rather than by ensuring the theoretical worst case is acceptable. Which is probably the right approach.

My limit is 128. 5 is very common with a high load factor and fast hash function.

This seems to assume that all information needed to determine if two keys are equal is incorporated into the resulting hash value.

It's premised on the idea that you can dynamically change the hash function such that eventually you will find one with fewer than some fixed number of items sharing the same hash value.

There are two problems: 1) It completely undermines the "worst-case constant lookup", because there is no upper bound on the number of times you might have to change the hash function and rebuild the table before you find one with sufficiently few collisions.

2) It is a much stronger requirement on the keys than other hash tables have. With other hash tables I can simply omit "difficult to hash" data from my hash function, on the basis that enough other data is hashed to avoid a significant performance penalty. With this implementation, the entire hash table will simply fail.

This additional failure mode is also a security concern - if an untrusted user can affect what data is added to the table, it could cause a severe DoS where the table will either enter an infinite loop trying to find a non-conflicting hash function, or have an unexpected failure mode (most people don't expect their hash tables to randomly fail). Even if it is somehow designed such that a hash function can always be found, the attacker could spend time up-front finding values that take a long time for that hash function to be reached, causing numerous re-hashes of the table.

What are you talking about?

> It's premised on the idea that you can dynamically change the hash function such that eventually you will find one with fewer than some fixed number of items sharing the same hash value.

Why would you think there are any fixed numbers? The OP is about a stdlib hash table (for Rust). Comparable hash table implementations all resize the table when necessary to decrease the load factor. Performance with respect to load factor still matters, of course, less resizing is better, but there are no fixed numbers of anything.

> It completely undermines the "worst-case constant lookup", because there is no upper bound on the number of times you might have to change the hash function and rebuild the table before you find one with sufficiently few collisions.

Why would you need to change the hash function or rebuild anything during lookup? It's lookup. Nobody is changing anything.

> It is a much stronger requirement on the keys than other hash tables have.

What hash table doesn't require keys to be hashable?

Those are really questions and issues of implementation than of theory.

1. Tabulation hashing provides guarantees against chosen key attacks.

2. In addition, bucketized cuckoo hashing provides guarantees on the probability of resize (low compared to linear probing), and multiple resizes (very low compared to linear probing).

3. Further, most hash table implementations that I know of have layers of defense, including limiting the number of sequential resizes for a given key.

Putting everything together, if anything, I would be more concerned with the failure mode for linear probing (scanning the whole table before resizing) than for Cuckoo hashing (scanning two fixed-size buckets before resizing).

> a security concern - if an untrusted user can affect what data is added to the table, it could cause a severe DoS

Isn’t that a general problem [1] that is addressed with randomization of the input data before hashing?

[1] https://www.purehacking.com/blog/josh-zlatin/introduction-to...

The author is also in the thread on /r/rust if you have any questions for them: https://www.reddit.com/r/rust/comments/b38cwz/why_hashbrown_...

I went to university with him. Really nice guy. Every time we talk, I say something like "any day now dude, I'm going to really get into Rust" and then I never do because I'm awful.

I've been saying the same since last year. I'm finally picking it up just trying to use it for regular everyday scripting work. It's been slow but atleast I'm starting now.

Slow indeed. Too slow...

This is pretty amazing, i really like the fact that a fastest hash table implementation is going to be part of the Rust's stdlib.

Where does it say that this is the fastest hash table implementation?

It's accepted knowledge that the Google Swisstable is currently the fastest hashtable around. (The C++ implementation) But there are three variants, and this looks like the slowest of the three, but the best for dynamic languages.

It should depend a lot on what you put into it and the access pattern, no? You can talk about averages, about complexity, about benchmarks, but I feel like "fastest hashtable around" is a very odd expression, as if to desire the world to be a lot less complex than it is and for there to be exactly one best.

As I said there are two best, plus the worse third variant. Not implemented by google, only described, but now apparently implemented in Rust. Haven't checked closely, because Rust is kinda unreadable compared to C what exactly is going on. Need to see the SIMD assembly also. You cannot really trust much what's going on with Rust, as there's too much hype and lies, but it undeniably got better recently.

Is there anywhere we can read more about who says Google Swisstable is the fastest and why?

So - what's the motivation to use "open addressing" vs chaining, which I thought was the more common approach to solving this.

I assume there must be a substantial performance gain for this to be used as it seems significantly more complicated, any information on how much better it is?

Open addressing has better cache performance, and for most workloads is much faster than chaining implementations. Hashbrown's implementation is based on Google's SwissTable, and they explained the reasoning behind their choices in this CppCon talk: https://www.youtube.com/watch?v=ncHmEUmJZf4

Cliff Click (of JVM fame) did a presentation once on an open addressing hash table implementation that he believed was lock free (it used CAS and possibly some memory barriers). I wouldn't want to argue with Cliff Click on concurrency but everybody makes mistakes. I haven't checked back if anyone found any bugs yet.

I'm also curious if or how that implementation would translate to other languages. What he did might not be declarable in the Rust ownership model.

How does open addressing handle deletions? I never figured this out.

Update: So I just read the article (I didn't have the chance earlier) and I see it explains tombstones, which seem like a pretty clever solution I wasn't thinking about. What I had been confused about, though was the other more-obvious attempt at a solution, which is backshifting. I think I had gotten stuck is what you do with the spot that opens up after the backshift... it might have been part of another probe chain (or multiple, for that matter) that had jumped over it before, so how do you find one of these chains to move back an item into this slot (which you would have to do)?

The article mentions a couple of options, one of which is tombstones. In a chaining implementation, the load factor is straightforward: num_items / table_size. Adding items increases the load factor and deleting items reduces it.

With open addressing, deletions do not decrease the load factor (at least not immediately, in general), because a deleted item becomes a tombstone. So for open addressing, load factor is (num_items + num_tombstones) / table_size.

The upshot is that in a scenario where items are being repeatedly added and removed, over time the load factor will gradually increase until eventually the table has be recreated. This is true even if the average number of items remains relatively constant over time, and is unlike a chaining implementation.

In this scenario, the average insertion time might be lower with open addressing, but the worst case will be much worse than for chaining.

You can eliminate tombstones over time by:

  # any time a tombstone immediately preceeds a empty, it can be marked empty
  [ $ _ ] -> [ _ _ ]
  # any time you lookup a key, it can be swapped with a tombstone immediately preceeding it
  [ $ A ] -> [ A $ ]
  # (moving the tombstone closer to a empty that will destroy it)
  # if you don't have iterators, you can also jump over other keys
  [ $ B A ] -> [ A B $ ]
  [ $ B C A ] -> [ A B C $ ]  # etc
  # (this will cause a iterator at A to yield B again, or at B to skip A)
How well this keeps the load factor down depends on how aggressively you look for tombstone disposal opportunities, but it does keep it down.

Mutating the table on lookup seems pretty gross, though.

I would argue that "the table" is not mutated, only the internal state of its implementation. Every time you access any information, a cache at some layer below you is updated. Is that also gross?

Yes. Normally you can have one thread writing to a data structure OR many threads reading the data structure at any given time and not need to worry about them causing problems. (This situation is common enough that we have things called "reader-writer mutexes" or "shared-exclusive" mutexes.)

As soon as your reads can modify the internal state of the data structure, it might modify the state in a way which trips up another read; so you can no longer have many threads reading the data structure at once.

But you don't need to write every time, only on occasion, so you can actually use a read write lock and in the nominal case many threads can read just fine.

That said, it's probably still better to avoid this unless it's absolutely necessary to modify the underlying structure sometimes, I recently had to do this for an LRU cache.

Right. And in Rust implementing the hash table that way will suddenly make the table no longer flagged as "Sync" by the compiler, so you will be unable to share it between threads.

That's a great point. It's still a possible optimization with a compare-and-swap or if you can determine that you're in a single threaded context.

Eh, it’s the classic amortized approach. Whoever you ca “touch” the data and you’re right there already due to a lookup, it makes sense to amortize your data structure housekeeping IMO.

TBH, the right answer is always due to the users use case (Amortization and housekeeping really help with purely functional data structures), and benchmark data.

Well, it still works (just slower) if you only do fixups during {table[key] = val} operations. But honestly, if you're using a probabilistic data structure like a hash table, the ship has already sailed on gross implementation details.

No, that's the common approach with chained lists: Move found list item to first.

If you want more material, the "Hash Tables" chapter in my book walks through it:


Reading the article reveals that you can either use tombstones or move the elements back into the slot they "would have been in" if the deleted element was never added

This is actually described quite well in the OP.

Oh I see, thanks. I'm on my phone about to go to a meeting and haven't had a chance to read the article yet.

The easiest one is tombstones (i.e. "this item is deleted") to keep the chain alive, or backshifting (i.e. moving all items in the chain forward one slot).

For back-shifting, I'd prefer to actually seek for the tail item in the list (just before you prove the key doesn't exist) and move that BACK to the evicted slot, rather than update all of the keys.

However the tombstone idea fits better with minimizing mutations and improving the potential for concurrency if the design only cares that inserts/deletes are atomic (not that they're propagated in parallel instantly).

For the 'move back' idea to be that safe I'd still want to use a tombstone value, but it would need to also include a pointer to the moved location. The list of tombstones would need to be submitted (as in, message queue) to a kind of garbage collection thread that sat on the messages for a pre-determined validity time and then did a scrub of the key/value to make sure the compaction still made sense. A shorter interval might also allow for that deferred compaction to be replaced by a new entry instead.

I don't like any of that as a generic solution though, as the trade-offs with each bit of extra effort and code complexity seem likely to be premature optimization and sources of potential bugs when not considered in a specific problem domain.

I'm not sure that backshifting is as easy as "moving all items in the chain forward by one slot". Consider the hashtable [A, B1, B2, _, _], where one element is subsequently added after B2, giving [A, B1, B2, X, _]. Now when we remove B1 and shift B2 forward one slot ([A, B2, _, X, _]), we have to shift X forward if it hashes to the second or the third slot in the table, but not the fourth. So there might be multiple chains involved; if it was one contiguous chain only, we could simply arrange for the "hole" in it to be filled with no need for shifting all the items in it, and it would be quite efficient. However, it seems that it's not so simple.

Yup, the chains can be interleaved. Often it's best to save the hash of each key as well as the key itself. Comparing the full hash (rather than the modulus of the hash) will eliminate many keys faster than doing a full key comparison, and having the full hash available means recreation on expansion is cheap, as all the keys don't need rehashing. The full hashes can then be used to distinguish between different chains if you decide to do backfilling, again cheaper than recalculating key hashes.

Of course, having the hashes available also speeds up recreation, should the tombstone approach be used.

Basically, keep the full hashes around :)

Certainly not with open addressing as it will destroy all the cache-line advantages. with seperate chaining it's very common, esp. for resizing.

The key is usually indirected, which means a pointer, which is usually bigger than the hash code.

(And of course you could use parallel arrays if you're super concerned about cache lines, though the tradeoffs would very much depend on whether hash misses or hits are the expected mode.)

And if it’s not a pointer but a word-size integer, just use an invertible hash function (aka permutation aka block cipher) so you can store the hash code in place of the key :)

Pointer chasing is very expensive if your chains end up in different cache lines. The bottleneck on a lot of application isn't how fast your instructions run, but how fast you can get data to them.

All* high performance hashtables use open addressing, because chaining tends to mean (multiple) indirection to addresses outside the table.

* not sure if that's literally true, but I've never seen anyone do chaining in performance-sensitive applications, and all the papers on fast hash tables use some way of open addressing.

I think it largely depends on whether you're using intrusive data structures or not. Even in open addressing you still must always load the object to verify the correct key, so collisions still result in [wasted] random memory access. If the hash table uses dedicated nodes then you have to indirect twice just to verify the key. But in all the hash tables I've written the hash node is already a part of the object, so there's no additional indirection.

However, Rust is allergic to intrusive data structures. Which is a little ironic as single ownership ostensibly would otherwise make intrusive data structures more practical as you're already effectively precluded from inserting an object into multiple hash tables and so there's no potential for contention over the object's internal node fields.

I suppose linear probing also permits prefetching of successive buckets, though ideally collisions should be the exception so such prefetching might just waste bandwidth.

But surely Rust hashtables just store the objects in the table itself? Rust has value semantics, after all. No indirections unless you opt in and `Box` your objects.

AFAIU it depends on the object and its traits. I'm curious how Rust's hash tables are used in practice, i.e. whether the common use case is copying vs taking ownership of the object itself. Though either way what really matters is whether indirection is required to verify the key (e.g. String).

In C I prefer red-black trees over hash tables because memory management is easier, you get smooth performance characteristics (no resize hiccups), and you're immune from computational complexity attacks without sweating over hashing algorithms. But I use intrusive node fields for my trees, and also tend to keep keys (even string keys) internal to the object as well, which substantially closes the cache performance gap.

Ownership is the usual, at least in the ones I’ve seen. Otherwise you have to start reasoning about lifetimes and that’s trickier.

The main case that I know of where you really don't want to use open addressing (using tombstones) is when you have both of the following requirements: a) need to repeatedly insert and delete items, and b) can't tolerate (relatively) really bad worst case insertion performance.

I can't speak directly to the performance of non-tombstone based open addressing implementations, but it definitely seems like the need to move things around after deletions would be significantly more work than the fastest chaining implementation I've come up with.

If there is metadata available at every slot then you can track overflows rather than tombstones. These let you terminate the search earlier (you don't need to find an empty) and can be cleaned up on erase, because they are a property of the keys in the map rather than the keys that are no longer in the map.

folly's F14 maps use this strategy, and handle erase + insert workloads fine with no rehashing needed. https://github.com/facebook/folly/blob/master/folly/containe...

Interesting, both the Qt and Java ones seem to use chaining, but I guess they're not designed with these sorts of extremely demanding applications in mind.

One thing to note is that this performance tradeoff is actually the reverse of what it was ~20 years ago. This is because CPU speeds have improved dramatically more than RAM speeds over a similar period.

Open addressing does do more comparisons than chaining. It used to save time to traverse the linked list rather than to spend CPU cycles on those additional comparisons. Now, because CPU cycles are relatively cheaper, the opposite is true.

The reason the Qt and Java hashtables use chaining is simply that the code in question was initially written back when the tradeoff ran the other way.

If hash codes are stored inline alongside keys, code comparisons can be made without an indirection while key comparison (often strings, seldom anything without an indirection) usually needs an indirection. Hash codes are very cheap to compare.

These indirections have always been costly on any machine with virtual memory. TLB misses aren't free. I'd have put any estimate on the tradeoff being the other way around to more than 25 years.

The java.util.HashMap implementation uses chaining, but for example the google guava libraries that are frequently used are open addressing based.

AFAIK an important factor to get good performance out of a hash table is to dimension the table so that collisions are rare. In this scenario it's not too surprising that the additional cost of sometimes having to iterate through the table beats chasing pointers.

Furthermore as the article explains they can use SIMD to search several buckets at once, something that wouldn't really be possible if they didn't exist in contiguous memory.

Answer: SIMD.

What looks like "two loops" is really "do two simple SIMD operations".

I think rather the answer is:

Tombstones in place of deleted elements mean that finding out that key is not in the table requires going to a different place than finding the right place to finally put it. Hence two lookups. SIMD just makes keeping two lookups separate economical.

>And so our "two loops" are usually just "do two simple SIMD operations". Not quite so offensive when you say it like that!

No, it's not quite so offensive, but this doesn't explain why it's the best option. Is there no equally-fast way to write the first-tombstone implementation with SIMD instructions? The answer seems to be in the sketch of the implementation, which I'm having trouble understanding.

EDIT: I'm watching the original SwissTable talk now... would it really have been worse to use 2 bits for empty/tombstone/full/sentinel, and 6 bits for hash prefix?

EDIT 2: More implementation info. Tombstones are actually rare, because if any element in your 16-wide chunk is empty, you don't have to create a tombstone. In the very best case (a perfectly distributed hash function), your hashmap has to be ~94% full before it's even possible to fail this. Because tombstones are so rare, it's better to save the single bit for extra confidence in the hash prefix.

So, here is my understanding of the implementation and its rationale:

* Every bucket in the backing array has a corresponding byte in a metadata array

* 1 bit of this byte stores whether the bucket is empty, the other 7 bits are for a hash prefix

* SIMD instructions search 16 bytes at a time, checking: this bucket is not empty, this bucket's key matches the first 7 bits of my key

* Since 16 buckets are checked for emptiness at the same time, you can avoid creating a tombstone for a bucket if any of the other 15 buckets are empty (just set it to empty, i.e. set the first bit to 0)

* This means that tombstones are very unlikely- you'll probably rehash before you get to the load factor where you start seeing tombstones

* Since tombstones are so unlikely, it's more valuable to add an extra bit to the hash prefix than it is to quickly find tombstones

My question remains: why can't the first search return the offset of the first empty bucket? In this loop, why is there not an else that saves the offset?: https://github.com/abseil/abseil-cpp/blob/256be563447a315f2a...

Ok, I got it. They're exactly the same. Either way you'd need to do a second search, because you're trying to differentiate between 3 states: "probably a match", "empty", or "deleted". A much better way than stealing a bit from the hash prefix is using a special value that represents "empty or deleted", and that's exactly what SwissTable does: https://github.com/abseil/abseil-cpp/blob/256be563447a315f2a...

All good points in the article.

> it was doing something that was so offensive to people who care about collection performance

Hmm. It also helps here to go back to academia. Big O notation doesn't usually express coefficients/constants, it usually only deals with exponents.[1] The Wikipedia page has a good explanation as to why.

Opinion: coefficients/constants are, however, useful if you're running over a network or some other latency-bound operation.

[1]: https://en.wikipedia.org/wiki/Big_O_notation#Properties

The whole point of the article is that constant factors work in Hashbrown's favour.

What's the memory efficiency compared to the previous implementation? AIUI robin hood hashing with backwards shift deletion could achieve rather high load ratios, and thus keep memory footprint small. What I read about tombstone based open addressing suggests that it requires a lower load factor and thus more memory.

Another alternative: Robind Hood Hashing


(Might actually be the same technique, it's not quite clear!)

Uh...this is what Rust uses currently. ;)

So how does this degrade for bigger hash tables? Surely the two loops implementation is less efficient since your cache is trashed by the time you do the second look up

This varies not with the size of the whole hash table, but with the distance from any given index to the first empty bucket. By keeping the load factor constant, you can grow the hash table as much as you want without degrading the expected performance.

Good point, thanks!

Not necessarily, as the sibling mentioned load of the table matters, but also the fact that the single loop has to do other housekeeping may also mean that it could ruin any cache prefetching that could be happening because of the less predicable pattern. Ideally the housekeeping variables are going to be in registers but that's not always going to be possible. In this case, two loops that look at the data sequentially are going to predict very very nicely and probably load more of the cache as it's searching than anything that could be random.

TL;DR: You first need to look for the hash "tombstone", then you need to find the actual insertion which may be very far away.

Raymond Hettinger (HN commenter and all around Smart Dude (tm)) has a good talk from PyCon 2017 about the evolution of Python dictionaries that goes into tombstones and so much more: http://www.youtube.com/watch?v=npw4s1QTmPg

> . For instance, when compiled with sse2 support it can search 16 buckets at once.

Does Neon get similar order of magnitude benefit?

I wonder if there is any research into hardware architectures optimized for Rust.

The things Rust wants more than other languages are support for fast array bounds checks and fast integer overflow checks.

Unfortunately none of the current popular architectures seem to have either.

Actually, I'd say that support for fast range checks is less important in Rust than in some other languages, because the Rust compiler is quite good at removing the checks when a value can be proven to be in the right range. In general, Rust seems to cope quite well with current CPU architectures, although I do think that there's an open question as to how to support viable parallelism that not quite as fine-grained as current ILP approaches (pipelining, superscalar), nor quite as coarse as to be effectively exploited via multiple threads. ISTM that this is one portion of the design space that Rust might address quite nicely.

x86 seems reasonable to me:

Fast array bounds checking is, handling the lower bound by unsigned integer math :

  cmp index,bound
  ja crash_handler

Overflow handling is mostly

   jo crash_handler
That's 1 or 2 instructions and the jump has perfect prediction. I'd assume the cost is negligible compared with jump mispredicts and cache misses.

I read somewhere the main problem is LLVM not being quite capable of optimizing these decently: It has to handle all these extra jumps which causes lots of extra basic blocks.

Yeah, part of the unexplored optimization potential that Rust has is that LLVM was originally written with C in mind, and Rust achieves great results with it, but sometimes it can't use the additional info that the better typesystem provides because nobody wrote the code in LLVM, or because LLVM has bugs so the Rust devs have to turn off that additional info.

A jump, but assume not taken and don't populate the branch predictor might be useful, otherwise you can end up with a death by a thousand cuts thing. Superfluous branches thrash your branch predictor tables similar to how cache pressure thrashs.

I seem to recall, that at least jo has terrible pipelining implications because of the dependency on the flag register that way, no idea about the bounds check (might be a better way)

I don't see why JO would be any different from other Jcc, especially since JL is SF ≠ OF. Maybe you were thinking of the problem with INC/DEC and CF, but even then that's a problem of INC/DEC, not of Jcc.

It's not usually a big difference, but on modern Intel there is a some difference in performance between the different CMP/JCC options. The more common ones will "fuse" with the CMP and be executed as a single µop, but the rarer ones (like JO and JS) do not fuse with CMP, and thus can add a cycle of delay (and have the overhead associated with executing another µop). The optimization is called "macro op fusion". Details here https://en.wikichip.org/wiki/macro-operation_fusion and here https://www.agner.org/optimize/microarchitecture.pdf (pages 108 and 125).

Fixing that wouldn't seem to require any architectural changes in x86 though, it's just that Intel hasn't cared enough about JO and friends to optimize them this way.

Unfortunately, fast array bounds checks and fast integer overflow checks are anathema to speculative execution so aren't happening any time soon.

I'm becoming convinced that we really need to just go back to the Alpha 21164 architecture and stamp out multiple copies with really fast interconnect.

Could you explain your reasoning? I would think that bounds and overflowing checking should have fantastic performance on modern processors. The almost-always-false test gets correctly predicted as false, speculative execution chooses the correct branch without delay, and a few cycles later (but without visible delay) some simple math confirms that the correct branch was chosen. In my mind, the hardware support is already there and almost perfect; it's the languages and compilers that are lagging. But maybe I'm missing something obvious. Where do you think the issue is?

> he almost-always-false test gets correctly predicted as false, speculative execution chooses the correct branch without delay, and a few cycles later (but without visible delay) some simple math confirms that the correct branch was chosen.

So, why don't compilers do this already? Nothing stops them. The answer is that they lose a ton of performance.

There is a lot of housekeeping in order to keep track of speculative execution and it happens on every single mathematical operation in your regime. And you pay that continually for an operation that happens "very rarely" as you point out.

Overflow/underflow probably isn't so bad, but array bounds checks are very bad. You have at least one bound that is variable and so causes an extra "Compare to X" where X is a non-zero constant and so needs to be loaded. So your tight loops all look like "increment I; compare I to 0; compare I to X; do real work" That's two extra instructions that the speculative execution system has to keep track of every loop and it eats a lot of the speculative execution resources very quickly.

An in-order chip like the 21164 doesn't execute the loops any faster, but it doesn't have to pay hardware chip area because of speculative execution. Given that everything is moving to high multi-core, it would be better to let a stuck core give way to another quickly than to try speculate deeper--especially as SSDs continue to increase in speed--and especially because, quite often in a battery operated world, your would rather shut down while waiting.

In addition, if you trap on these, it will break an enormous amount of software in existence. Look at how many people went absolutely ape over gcc suddenly doing aggressive optimizations on undefined behavior.

(And, by the way, every microprocessor used to compute underflow on arithmetic instructions (8080, Z80, 6800, etc.), so you need to ask why that went away.)

I had a bit of a hard time following your argument, but nkurz is right here in the sense that things like overflow and bounds checks are practically the poster child for things that perform well on wide OoO processors. They are leaves in dependency graph and hence don't add to dependency chains, and on wide designs the ALU op(s) will often execute almost for free.

It is in-order designs that take a slightly larger relative penalty since in some sense "all instructions matter" there so bounds checks cost similar to surrounding operations that do real work.

You make a lot of valid observation about the cost of actually implementing a modern, deep and wide out of order architecture: it is much harder and uses more silicon that some in-order designs, but once you have it (and largely that's what we have today as general purpose CPUs), the type of branches that are involved in overflow and bounds checks are not costly.

> ... So your tight loops all look like "increment I; compare I to 0; compare I to X; do real work" ...

Compilers are generally smart enough to hoist these comparisons out of the loop, at least in a static, AOT-compiled language like Rust.

You're right though that in-order chips are a lot more power-efficient, and the in-order approach is the one that's taken in most RISC-V implementations (which seems to be a highly comparable architecture to the Alpha you mention).

In Rust you can probably get away with not speculating and thus having "imprecise exceptions" for overflow and maybe array reads (if you don't care about side channels), since panic essentially destroys all non-Mutex-protected accessible memory anyway and Mutex-protected memory is made inaccessible via poisoning (as long as they don't get moved through things like library calls and atomic ops).

It would require a minor compatibility break, significant checking work and changes to unsafe code to make it work, but it should be doable.

Array writes do need precise checking though, since otherwise you can write to arbitrary memory.

Could the concept of "ownership" be optimized for somehow in hardware?

I'm not a Rust expert, but I think ownership and borrowing go away after compile-time.

Of course, but what pre-compile architectural decisions could be made to accommodate such a higher level construct?

The point of the borrow checker is that it's run at compile time rather than runtime. Putting something similar into the architecture necessarily implies a runtime check, which is similar to the idea of a smart pointer, which exists in Rust as well as other languages.

That’s correct.

> fast integer overflow checks

Rust's implicit overflow checks are only in debug mode I think. Overflow is defined to wrap in release mode.

Yes, due to the hit in performance. We'd love to turn them on unconditionally, and the language is spec'd so that we can someday, but since overflow is not a memory safety issue on its own, and there is a performance hit when checking, this is the current compromise.

I suspect ADA would like those too, if available.

The Rust ABI isn't fixed and it uses LLVM. That's like asking if there's hardware architectures optimized for LLVM.

There is nothing rust specific about this algorithmic.

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact