Hacker News new | comments | show | ask | jobs | submit login
Classical Data Structures That Can Outperform Learned Indexes (stanford.edu)
139 points by chmaynard 4 months ago | hide | past | web | favorite | 20 comments



We actually tested cuckoo hashing against learned hash-maps and in our setup it was slower. One of the main reasons is that cuckoo hashing is quite sensitive to the payload and we used a scenario were we actually assumed records with meta-data fields as used by many systems at Google. We will update our technical report soon to include those comparisons as well.

However, we do agree with the authors of the blog post, that there are cases in which traditional data structures will remain the best choice and as so often, it always depends. In addition, it might be possible to combine learned models with cuckoo hashing, so those are not even exclusive ideas.

--(from original authors of the Learned Index paper)


We'd love to hear about the size of the payloads used. With the code we posted, larger records take 10-15 more ns per query, which is still faster than the numbers in the paper, but of course YMMV based on the table implementation and other factors.


The point that ML practicioners are making is not that machine learning beats every other algorithm, or even that they beat humans at specific tasks.

The point they're making ML practicioners acceptable at every task. You learn ML and you can have great indexes for your databases, inverse kinematics, cancer treatments, and youtube movie generation that actually gets views, all using the same theory.

Like in the 1980s, that humans can assemble cars with a greater success rate than machines, and can fix the ones they don't assemble correctly, it didn't matter. Humans still beat machines at various metrics in car assembly.

That you can write a program do 1% better at a specific task than an algorithm that can solve every task won't matter.


Consider this scenario. You can buy two different databases, one using your ML learned indexes, or another using the hash discussed in the article.

For the hash database, it's 2x faster, up to 20x more memory efficient, and doesn't need training. It works on your data reliably from day you spin it up and throughout any changes to the stored data.

Meanwhile, your ML indexed database is going to be much slower out of the box, it will only be fast after learning on the data. Worse, if the data changes significantly it will need need to be trained again before being performant. And after all this, it still won't be as fast as the hash database.

For development, the ML based index technology won't be faster to build either. If anything, it will probably be much slower as there are all sorts of tricky nuances to building production grade ML systems. When do you train? How do you validate the training data? What do you do if the results don't validate to a sufficient accuracy?

I think it's fascinating that ML can be applied to problems like indexing, but if you can produce a deterministic theoretically optimal algorithm, you should use that instead of ML every time.


> The point they're making ML practicioners acceptable at every task. You learn ML and you can have great indexes for your databases, inverse kinematics, cancer treatments, and youtube movie generation that actually gets views, all using the same theory.

Just to make sure I'm understanding you right - the claim is that you want an ML person on your team (or you want everyone on your team to know ML the way that everyone on your team knows calculus and Shakespeare, i.e., so they can brush up on it as needed), not that human domain knowledge in ML replaces human domain knowledge of what a database does, or what inverse kinematics is, or how to measure cancer, or how to film movies or draw animations, right?

The latter is definitely a potential future, but I don't think we're there yet (but I might be wrong!).


We're definitely not there yet. Right now it's just one domain after another (slowly) getting replaced by ML.

It'll expand though.


But I don't think these are domains getting replaced by ML, any more than domains were replaced by calculus, or machinery, or even computers. Sure, cancer researchers need to know something about programming now, but they also continue to need to know more things about biology. Is that different for ML?


It kind of is. For a specific example, a decade ago face recognition, speech recognition and machine translation each incorporated a lot of domain-specific knowledge and were largely disjoint fields, since doing anything remotely useful required the specialized knowledge that the subfield had accumulated over the last decade; but nowadays a face recognition ML expert can achieve state of art speech recognition while not knowing a thing about phonetics, and a speech recognition expert can train a decent end-to-end MT solution while being ignorant about all the language/text specific cases that required particular attention and specialized solutions and subsystems some 5 years ago.

This is quite a big practical shift in ML; we moved from research on specialized solutions for separate problems to generalized solutions that work on large classes of problems with minimal adaptation, highly reducing the need for domain-specific knowledge. I mean, it's still usually useful, but not absolutely necessary as it was before.


Don't you need to have domain expertise anyway to engineer your features? What exactly will you train your speech recognition models on, if you have no understanding at all of speech?

It may definitely lower the barrier of entry to those fields, but I'm not sure it has removed it altogether, at least not just yet.


With modern deep learning methods, you often can get state of art results without any feature engineering whatsoever - you do need a decent efficient representation of the raw data, but that's about it.

That's the whole point, in many domains the accumulated knowledge about feature engineering isn't necessary anymore, since you can train a deep network to learn the same features implicitly from data, and (depending on the problem) it's quite possible that the learned features in the initial layers will be better than anything people have engineered before.

For your speech example, speech recognition models used to contain explicit phonetics systems (where the chosen set of phonemes mattered), separating acoustic models and language models. But now you can also get decent results from an end-to-end system, by throwing all the phonetic knowledge you had into the trash bin and training an end-to-end model from the sound data (though, not a waveform but a frequency domain representation, e.g. https://en.wikipedia.org/wiki/Mel-frequency_cepstrum - but that's not "understanding of speech", it's understanding of audio data format) straight to the output text characters.


We are not talking about 1% better, but an order of magnitude better.


The important bits of the article: "the cuckoo hash, can achieve 5-20x less space overhead than learned indexes, and that it can be surprisingly fast on modern hardware, running nearly 2x faster".


1 point by edchi 0 minutes ago | edit | delete [-]

We actually tested cuckoo hashing against learned hash-maps and in our setup it was slower. One of the main reasons is that cuckoo hashing is quite sensitive to the payload and we used a scenario were we actually assumed records with meta-data fields as used by many systems at Google. We will update our technical report soon to include those comparisons as well. However, we do agree with the authors of the blog post, that there are cases in which traditional data structures will remain the best choice and as so often, it always depends. In addition, it might be possible to combine learned models with cuckoo hashing, so those are not even exclusive ideas.

--(from original authors of the Learned Index paper)


The authors of fine article did not tested against cache-friendly binary search, for one. This approach can easily give speed up of 2x and more.

Another thing they didn’t test against is data-dependent middle index choice. Which would play nice with RLE encoding of first bytes of keys, especially if one choose reasonably big number of keys per page, at least 1024, not small number of 128 (or was it 256) in the paper. Especially for log-structured merge trees where learned indices can be used safely.

Anyway, original paper was very sloppy at establishing baseline and very disappointing because of that.


I liked the article, but seriously... an SIMD friendly Cuckoo hash? Erm... can we have a blog-post on the design of THAT?? I'm going through their Github and they also make a reference to "FAST", which is a cache-friendly SIMD-friendly tree implementation.

That sounds... non-trivial.

Its like reading a post that says "Oh yeah, Trucks can beat trains. We made a truck that goes 400 miles an hour for example, and its tested performance beats the stuff that Google is doing". This just begs the question: how did you make a 400mph truck?

Similarly, tell me more about this SIMD-based Cuckoo Hash. That sounds awesome. I can see why Google can beat a "normal" Cuckoo Hash, but one that has been SIMD-tuned sounds outrageous.


You can see the table here, it's not a lot of code: https://github.com/stanford-futuredata/index-baselines/blob/.... Each bucket just has 8 keys and you want to test whether one of them is equal to the key you're searching for.


That's pretty intelligent. There's an 8-way SIMD operation for comparison to match up to the 256-bit AVX2 register.

So in the case of Cuckoo hashing, its not 1-thing per bucket and 2-buckets per hash... (as per the classical "textbook" Cuckoo Hash), but instead 8-things per bucket (very quickly compared with 8-way 256-bit AVX2 Comparisons), with 16-effective locations (2 buckets, 8-locations per bucket) for each key.

I was wondering how you were getting 99% occupancy and other such numbers. I'm not exactly up-to-date on the latest Cuckoo-techniques, apparently. But even then, this simple implementation you have is way deeper and useful than the one discussed on say... Wikipedia.

---------

A few notes for those who aren't up-to-date with the latest architectures:

1. Intel machines can perform 2x 256-bit loads per clock cycle, but ONLY using AVX, AVX2, or AVX512 instructions. Two-loads per-clock can be sustained through the L1 and L2 cache, only slowing down at the L3 cache. Latency characteristics differ of course between L1 and L2 levels.

2. Most AVX2 instructions have superb numbers: executing in a single clock or even super-scalar execution per clock. Skylake supports 2x 256-bit comparisons per clock cycle (Simultaneously, with the two loads. Ports 0 and 1 can do 256-bit comparisons each, while port 2 and 3 can do loads. Intel Skylake will still have ports 4, 5, and 6 open as well to perform more tasks)

So effectively, the code checks ~16 bucket-locations of a Cuckoo Hash in roughly the same speed as the total-latency of a DDR4 RAM Access. In fact, their implementation of ~36ns is damn close to the total latency of the Skylake system as a whole. The implementation is likely memory-controller bottlenecked and can't be much faster.

http://www.corsair.com/~/media/corsair/blog/2015-09/ddr3_vs_...

Note that this Corsair chart is from 2015, and DDR4 RAM Latencies as well as memory-controllers have gotten better.

Good job with the implementation! That's an outstanding level of speed you've achieved. I stand by what I said earlier: you've basically made a 400MPH truck and barely spend any sentences on it, lol. That's the really interesting part of your post IMO.

-----------------

I think I better understand the conclusion of the blogpost as well:

> Does all this mean that learned indexes are a bad idea? Not at all: the paper makes a great observation that, when cycles are cheap relative to memory accesses, compute-intensive function approximations can be beneficial for lookups, and ML models may be better at approximating some functions than existing data structures.

The Google Machine-learning paper is an indicator of how slow RAM is, rather than anything else. RAM is so slow, that spending a ton of cycles doing machine-learning may (in some cases) be more efficient than accessing memory and being wrong.

The Cuckoo-Hash has two major benefits: Its basically checking 16-locations at once (AVX2 instructions + speculatively looking at two memory locations at the same time).

Secondly, Cuckoo-Hashes are provably wrong at most two times. So two memory-accesses (which can be done "speculatively" by the processor) is the worst that could happen. Indeed: the latency numbers reported here are close to the practical limits of the x86 processor! A machine-learning algorithm can't do much better than that.


Yup, this is a great explanation. Cuckoo hashes actually do much better in terms of load if you use buckets with multiple items, as we did here. The classical one with one item per location can get "stuck" with unresolvable cycles between elements at 50% load, but when you have these buckets, there are many more ways to resolve collisions, which lets you greatly increase the load. 99% is extreme but definitely doable. Writing a fuller post on this is definitely a good idea -- we may do it in the future.


I missed the original research; what guarantees are made by the learned indices? Correctness?


Here's some info about the original research they are referencing http://www.eggie5.com/127-paper-review-the-case-for-learned-...




Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact

Search: