
Classical Data Structures That Can Outperform Learned Indexes - chmaynard
http://dawn.cs.stanford.edu/2018/01/11/index-baselines/
======
edchi
We actually tested cuckoo hashing against learned hash-maps and in our setup
it was slower. One of the main reasons is that cuckoo hashing is quite
sensitive to the payload and we used a scenario were we actually assumed
records with meta-data fields as used by many systems at Google. We will
update our technical report soon to include those comparisons as well.

However, we do agree with the authors of the blog post, that there are cases
in which traditional data structures will remain the best choice and as so
often, it always depends. In addition, it might be possible to combine learned
models with cuckoo hashing, so those are not even exclusive ideas.

\--(from original authors of the Learned Index paper)

~~~
mateiz
We'd love to hear about the size of the payloads used. With the code we
posted, larger records take 10-15 more ns per query, which is still faster
than the numbers in the paper, but of course YMMV based on the table
implementation and other factors.

------
candiodari
The point that ML practicioners are making is not that machine learning beats
every other algorithm, or even that they beat humans at specific tasks.

The point they're making ML practicioners acceptable at every task. You learn
ML and you can have great indexes for your databases, inverse kinematics,
cancer treatments, and youtube movie generation that actually gets views, all
using the same theory.

Like in the 1980s, that humans can assemble cars with a greater success rate
than machines, and can fix the ones they don't assemble correctly, it didn't
matter. Humans still beat machines at various metrics in car assembly.

That you can write a program do 1% better at a specific task than an algorithm
that can solve every task won't matter.

~~~
geofft
> _The point they 're making ML practicioners acceptable at every task. You
> learn ML and you can have great indexes for your databases, inverse
> kinematics, cancer treatments, and youtube movie generation that actually
> gets views, all using the same theory._

Just to make sure I'm understanding you right - the claim is that you want an
ML person on your team (or you want everyone on your team to know ML the way
that everyone on your team knows calculus and Shakespeare, i.e., so they can
brush up on it as needed), not that human domain knowledge in ML replaces
human domain knowledge of what a database does, or what inverse kinematics is,
or how to measure cancer, or how to film movies or draw animations, right?

The latter is definitely a potential future, but I don't think we're there yet
(but I might be wrong!).

~~~
candiodari
We're definitely not there yet. Right now it's just one domain after another
(slowly) getting replaced by ML.

It'll expand though.

~~~
geofft
But I don't think these are domains getting _replaced_ by ML, any more than
domains were replaced by calculus, or machinery, or even computers. Sure,
cancer researchers need to know something about programming now, but they also
continue to need to know more things about biology. Is that different for ML?

~~~
PeterisP
It kind of is. For a specific example, a decade ago face recognition, speech
recognition and machine translation each incorporated a _lot_ of domain-
specific knowledge and were largely disjoint fields, since doing anything
remotely useful required the specialized knowledge that the subfield had
accumulated over the last decade; but nowadays a face recognition ML expert
can achieve state of art speech recognition while not knowing a thing about
phonetics, and a speech recognition expert can train a decent end-to-end MT
solution while being ignorant about all the language/text specific cases that
required particular attention and specialized solutions and subsystems some 5
years ago.

This is quite a big practical shift in ML; we moved from research on
specialized solutions for separate problems to generalized solutions that work
on large classes of problems with minimal adaptation, highly reducing the need
for domain-specific knowledge. I mean, it's still usually useful, but not
absolutely necessary as it was before.

~~~
tincholio
Don't you need to have domain expertise anyway to engineer your features? What
exactly will you train your speech recognition models on, if you have no
understanding at all of speech?

It may definitely lower the barrier of entry to those fields, but I'm not sure
it has removed it altogether, at least not just yet.

~~~
PeterisP
With modern deep learning methods, you often can get state of art results
without any feature engineering whatsoever - you do need a decent efficient
representation of the raw data, but that's about it.

That's the whole point, in many domains the accumulated knowledge about
feature engineering isn't necessary anymore, since you can train a deep
network to learn the same features implicitly from data, and (depending on the
problem) it's quite possible that the learned features in the initial layers
will be _better_ than anything people have engineered before.

For your speech example, speech recognition models used to contain explicit
phonetics systems (where the chosen set of phonemes mattered), separating
acoustic models and language models. But now you can also get decent results
from an end-to-end system, by throwing all the phonetic knowledge you had into
the trash bin and training an end-to-end model from the sound data (though,
not a waveform but a frequency domain representation, e.g.
[https://en.wikipedia.org/wiki/Mel-
frequency_cepstrum](https://en.wikipedia.org/wiki/Mel-frequency_cepstrum) \-
but that's not "understanding of speech", it's understanding of audio data
format) straight to the output text characters.

------
zzzcpan
The important bits of the article: "the cuckoo hash, can achieve 5-20x less
space overhead than learned indexes, and that it can be surprisingly fast on
modern hardware, running nearly 2x faster".

~~~
edchi
1 point by edchi 0 minutes ago | edit | delete [-]

We actually tested cuckoo hashing against learned hash-maps and in our setup
it was slower. One of the main reasons is that cuckoo hashing is quite
sensitive to the payload and we used a scenario were we actually assumed
records with meta-data fields as used by many systems at Google. We will
update our technical report soon to include those comparisons as well.
However, we do agree with the authors of the blog post, that there are cases
in which traditional data structures will remain the best choice and as so
often, it always depends. In addition, it might be possible to combine learned
models with cuckoo hashing, so those are not even exclusive ideas.

\--(from original authors of the Learned Index paper)

------
thesz
The authors of fine article did not tested against cache-friendly binary
search, for one. This approach can easily give speed up of 2x and more.

Another thing they didn’t test against is data-dependent middle index choice.
Which would play nice with RLE encoding of first bytes of keys, especially if
one choose reasonably big number of keys per page, at least 1024, not small
number of 128 (or was it 256) in the paper. Especially for log-structured
merge trees where learned indices can be used safely.

Anyway, original paper was very sloppy at establishing baseline and very
disappointing because of that.

------
dragontamer
I liked the article, but seriously... an SIMD friendly Cuckoo hash? Erm... can
we have a blog-post on the design of THAT?? I'm going through their Github and
they also make a reference to "FAST", which is a cache-friendly SIMD-friendly
tree implementation.

That sounds... non-trivial.

Its like reading a post that says "Oh yeah, Trucks can beat trains. We made a
truck that goes 400 miles an hour for example, and its tested performance
beats the stuff that Google is doing". This just begs the question: how did
you make a 400mph truck?

Similarly, tell me more about this SIMD-based Cuckoo Hash. That sounds
awesome. I can see why Google can beat a "normal" Cuckoo Hash, but one that
has been SIMD-tuned sounds outrageous.

~~~
mateiz
You can see the table here, it's not a lot of code:
[https://github.com/stanford-futuredata/index-
baselines/blob/...](https://github.com/stanford-futuredata/index-
baselines/blob/master/hashing.cpp). Each bucket just has 8 keys and you want
to test whether one of them is equal to the key you're searching for.

~~~
dragontamer
That's pretty intelligent. There's an 8-way SIMD operation for comparison to
match up to the 256-bit AVX2 register.

So in the case of Cuckoo hashing, its not 1-thing per bucket and 2-buckets per
hash... (as per the classical "textbook" Cuckoo Hash), but instead 8-things
per bucket (very quickly compared with 8-way 256-bit AVX2 Comparisons), with
16-effective locations (2 buckets, 8-locations per bucket) for each key.

I was wondering how you were getting 99% occupancy and other such numbers. I'm
not exactly up-to-date on the latest Cuckoo-techniques, apparently. But even
then, this simple implementation you have is way deeper and useful than the
one discussed on say... Wikipedia.

\---------

A few notes for those who aren't up-to-date with the latest architectures:

1\. Intel machines can perform 2x 256-bit loads per clock cycle, but ONLY
using AVX, AVX2, or AVX512 instructions. Two-loads per-clock can be sustained
through the L1 and L2 cache, only slowing down at the L3 cache. Latency
characteristics differ of course between L1 and L2 levels.

2\. Most AVX2 instructions have superb numbers: executing in a single clock or
even super-scalar execution per clock. Skylake supports 2x 256-bit comparisons
per clock cycle (Simultaneously, with the two loads. Ports 0 and 1 can do
256-bit comparisons each, while port 2 and 3 can do loads. Intel Skylake will
still have ports 4, 5, and 6 open as well to perform more tasks)

So effectively, the code checks ~16 bucket-locations of a Cuckoo Hash in
roughly the same speed as the total-latency of a DDR4 RAM Access. In fact,
their implementation of ~36ns is damn close to the total latency of the
Skylake system as a whole. The implementation is likely memory-controller
bottlenecked and can't be much faster.

[http://www.corsair.com/~/media/corsair/blog/2015-09/ddr3_vs_...](http://www.corsair.com/~/media/corsair/blog/2015-09/ddr3_vs_ddr4_generational/ddr4-latency.jpg?la=en-
us)

Note that this Corsair chart is from 2015, and DDR4 RAM Latencies as well as
memory-controllers have gotten better.

Good job with the implementation! That's an outstanding level of speed you've
achieved. I stand by what I said earlier: you've basically made a 400MPH truck
and barely spend any sentences on it, lol. That's the really interesting part
of your post IMO.

\-----------------

I think I better understand the conclusion of the blogpost as well:

> Does all this mean that learned indexes are a bad idea? Not at all: the
> paper makes a great observation that, when cycles are cheap relative to
> memory accesses, compute-intensive function approximations can be beneficial
> for lookups, and ML models may be better at approximating some functions
> than existing data structures.

The Google Machine-learning paper is an indicator of how slow RAM is, rather
than anything else. RAM is so slow, that spending a ton of cycles doing
machine-learning may (in some cases) be more efficient than accessing memory
and being wrong.

The Cuckoo-Hash has two major benefits: Its basically checking 16-locations at
once (AVX2 instructions + speculatively looking at two memory locations at the
same time).

Secondly, Cuckoo-Hashes are provably wrong at most two times. So two memory-
accesses (which can be done "speculatively" by the processor) is the worst
that could happen. Indeed: the latency numbers reported here are close to the
practical limits of the x86 processor! A machine-learning algorithm can't do
much better than that.

~~~
mateiz
Yup, this is a great explanation. Cuckoo hashes actually do much better in
terms of load if you use buckets with multiple items, as we did here. The
classical one with one item per location can get "stuck" with unresolvable
cycles between elements at 50% load, but when you have these buckets, there
are many more ways to resolve collisions, which lets you greatly increase the
load. 99% is extreme but definitely doable. Writing a fuller post on this is
definitely a good idea -- we may do it in the future.

------
mcguire
I missed the original research; what guarantees are made by the learned
indices? Correctness?

~~~
eggie5
Here's some info about the original research they are referencing
[http://www.eggie5.com/127-paper-review-the-case-for-
learned-...](http://www.eggie5.com/127-paper-review-the-case-for-learned-
index-structures)

