Hacker News new | past | comments | ask | show | jobs | submit login

You can see the table here, it's not a lot of code: https://github.com/stanford-futuredata/index-baselines/blob/.... Each bucket just has 8 keys and you want to test whether one of them is equal to the key you're searching for.



That's pretty intelligent. There's an 8-way SIMD operation for comparison to match up to the 256-bit AVX2 register.

So in the case of Cuckoo hashing, its not 1-thing per bucket and 2-buckets per hash... (as per the classical "textbook" Cuckoo Hash), but instead 8-things per bucket (very quickly compared with 8-way 256-bit AVX2 Comparisons), with 16-effective locations (2 buckets, 8-locations per bucket) for each key.

I was wondering how you were getting 99% occupancy and other such numbers. I'm not exactly up-to-date on the latest Cuckoo-techniques, apparently. But even then, this simple implementation you have is way deeper and useful than the one discussed on say... Wikipedia.

---------

A few notes for those who aren't up-to-date with the latest architectures:

1. Intel machines can perform 2x 256-bit loads per clock cycle, but ONLY using AVX, AVX2, or AVX512 instructions. Two-loads per-clock can be sustained through the L1 and L2 cache, only slowing down at the L3 cache. Latency characteristics differ of course between L1 and L2 levels.

2. Most AVX2 instructions have superb numbers: executing in a single clock or even super-scalar execution per clock. Skylake supports 2x 256-bit comparisons per clock cycle (Simultaneously, with the two loads. Ports 0 and 1 can do 256-bit comparisons each, while port 2 and 3 can do loads. Intel Skylake will still have ports 4, 5, and 6 open as well to perform more tasks)

So effectively, the code checks ~16 bucket-locations of a Cuckoo Hash in roughly the same speed as the total-latency of a DDR4 RAM Access. In fact, their implementation of ~36ns is damn close to the total latency of the Skylake system as a whole. The implementation is likely memory-controller bottlenecked and can't be much faster.

http://www.corsair.com/~/media/corsair/blog/2015-09/ddr3_vs_...

Note that this Corsair chart is from 2015, and DDR4 RAM Latencies as well as memory-controllers have gotten better.

Good job with the implementation! That's an outstanding level of speed you've achieved. I stand by what I said earlier: you've basically made a 400MPH truck and barely spend any sentences on it, lol. That's the really interesting part of your post IMO.

-----------------

I think I better understand the conclusion of the blogpost as well:

> Does all this mean that learned indexes are a bad idea? Not at all: the paper makes a great observation that, when cycles are cheap relative to memory accesses, compute-intensive function approximations can be beneficial for lookups, and ML models may be better at approximating some functions than existing data structures.

The Google Machine-learning paper is an indicator of how slow RAM is, rather than anything else. RAM is so slow, that spending a ton of cycles doing machine-learning may (in some cases) be more efficient than accessing memory and being wrong.

The Cuckoo-Hash has two major benefits: Its basically checking 16-locations at once (AVX2 instructions + speculatively looking at two memory locations at the same time).

Secondly, Cuckoo-Hashes are provably wrong at most two times. So two memory-accesses (which can be done "speculatively" by the processor) is the worst that could happen. Indeed: the latency numbers reported here are close to the practical limits of the x86 processor! A machine-learning algorithm can't do much better than that.


Yup, this is a great explanation. Cuckoo hashes actually do much better in terms of load if you use buckets with multiple items, as we did here. The classical one with one item per location can get "stuck" with unresolvable cycles between elements at 50% load, but when you have these buckets, there are many more ways to resolve collisions, which lets you greatly increase the load. 99% is extreme but definitely doable. Writing a fuller post on this is definitely a good idea -- we may do it in the future.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: