
Use perfect hashing, instead of binary search, for keyword lookup - boshomi
https://www.postgresql.org/message-id/flat/E1ghOVt-0007os-2V%40gemulon.postgresql.org
======
nly
Some guy called Ilan Schnell wrote a python code generator which uses the very
same algorithm, and comes with language templates for Python, C and C++ (and
is easy to expand to others), and a dot file generator so you can visualize
the graph:

[http://ilan.schnell-web.net/prog/perfect-hash/](http://ilan.schnell-
web.net/prog/perfect-hash/)

He also has a superb illustrated explanation of the algorithm, a unique
property of which is you get to choose the hash values:

[http://ilan.schnell-web.net/prog/perfect-hash/algo.html](http://ilan.schnell-
web.net/prog/perfect-hash/algo.html)

I've been using it for years.

~~~
onalark
Ilan Schnell is not "some guy". He's the original primary author of the
Anaconda distribution. One of the main reasons that so many data scientists
use Python.

~~~
fao_
Well, he's just some guy, you know.

~~~
eesmith
Gag Halfrunt! How's the brain-care biz doing these days?
([https://en.wikipedia.org/wiki/Zaphod_Beeblebrox](https://en.wikipedia.org/wiki/Zaphod_Beeblebrox)
)

------
amluto
Binary search is a pretty bad algorithm. Sure, it runs in O(log n) time and
it's more or less the best you can do to solve the general problem of finding
something in a sorted list, but that does _not_ mean it's a good way to create
an immutable list of things that can be searched using comparisons.

In particular, binary search has awful cache locality. Until you get very
close to your target, you are essentially hitting a random cache line (and
maybe even a random _page_ for large lists) for every comparison. This is bad.
A B-tree-like structure that is never modified is very easy to make, and it
will be much faster.

This came up with the ORC unwinder in Linux. We got something like a 4x
speedup by changing from binary search to something a bit more clever to find
unwind entries in the table.

~~~
Terr_
> In particular, binary search has awful cache locality.

That makes me think of an idea which I'm certain can't be original, but I
can't think of the jargon for what it's called.

Suppose you have a fixed-length sorted array, and its binary-search can be
visualized as a tree. Traverse that tree in level-order and pop it into a
helper-array, arranged much like a heap: The midpoint is first and all left
and right children are proceed from there at offsets that you can predict.

So the original array [A,B,C,D,E,F,G] has a data-structure for searching of
[(D,3), (B,1), (F,5), (A,0), (C,2), (E,4), (G,6)]

While the high-indexed items will get spread out, the most common and
sequential activity occurs all together near the lower indexes, minimizing
large jumps. Since each entry has the original index stored, you can always
switch back to the initial array when a different strategy (e.g. linear scan)
becomes preferable.

~~~
macdice
Take a look at ARRAY LAYOUTS FOR COMPARISON-BASED SEARCHING. TL;DR there are
schemes like this called "Eytzinger" and "van Emde Boas".

[https://arxiv.org/ftp/arxiv/papers/1509/1509.05053.pdf](https://arxiv.org/ftp/arxiv/papers/1509/1509.05053.pdf)

The topic of arranging keys like this within index pages (or maybe elsewhere)
has come up on the pgsql-hackers mailing list before:

[https://www.postgresql.org/message-
id/flat/3B774C9E-01E8-46A...](https://www.postgresql.org/message-
id/flat/3B774C9E-01E8-46A7-9642-7830DC1108F1%40yandex-team.ru)

------
symisc_devel
SQLite already implemented such mechanism[0] that determines whether or not a
given identifier is really an SQL keyword. The result is a single static
hashtable (C array of chars) with O(1) lookup result. This is the fastest
known implementation for keyword lookup.

[0]:
[https://sqlite.org/src/artifact/1f7f2ac1d9f262c0](https://sqlite.org/src/artifact/1f7f2ac1d9f262c0)

~~~
anarazel
I'm not clear as to why that'd be faster than what we/PG has done. On a quick
skim that allows for conflicts in the hashtable, which our new implementation
doesn't? Ours is also a C array (albeit of offsets into one blob of string
srather than the strings directly, the higher data density achievable due to
that was worth it).

------
koolba
So is the idea here to generate a perfect hash table for the known set of SQL
keywords in the parser rather than doing a binary search on the fly?

~~~
anarazel
Correct.

------
chubot
FYI PostGres appears to have over 400 keywords, shown here:

[https://git.postgresql.org/gitweb/?p=postgresql.git;a=blob;f...](https://git.postgresql.org/gitweb/?p=postgresql.git;a=blob;f=src/include/parser/kwlist.h;h=adeb834ce82422940cb89798ee18457c17c10853;)

The way I've seen most languages do it is something like this:

    
    
        switch (c[0]) {
          case 'c':
            if (strcmp(c, "case") == 0) return CASE;
            if (strcmp(c, "continue") == 0) return CONTINUE;
            ...
    
          case 'f':
            if (strcmp(c, "for") == 0) return FOR;
            if (strcmp(c, "friend") == 0) return FRIEND;
            ...
     
        }
    

I suspect that's good enough if you have a typical keyword list of say 20-100.

I just looked at Clang and it appears to build the list at runtime, using a
regular hash table. I think this is because it has to handle C, C++, and
Objective C, and they have different sets of keywords.

It seems like they could change it to use something like perfect hashing, and
if that's even a 1% win, it would be worth it for the extra build complexity,
since C and C++ compilation take up so many CPU cycles in the world.

See lib/Basic/IdentifierTable.cpp.

Although another problem is that the list of keywords depends on whether
you're compiling C89, C99, C++98, C++11, etc. But that is a finite list so it
seems like it still can be done with code generation ahead of time like
PostGres does.

In Oil [1] I use re2c [1] recognize keywords, along with the rest of the
complex lexing process, and it works well. It builds a DFA expressed in C
code. The slower parts are elsewhere :)

EDIT: In theory, the DFA seems better than perfect hashing because you can
bail on the second character if you get "xzoo" and no keyword starts with
"xz". I think with perfect hashing yo still have to look at every byte.

The switch statement above also avoids looking at every byte in certain cases.

[1] [http://www.oilshell.org/](http://www.oilshell.org/)

[2] [http://re2c.org/](http://re2c.org/)

~~~
caf
When parsing C, typedef names essentially introduce new keywords so the table
is constantly changing.

~~~
chubot
Yes this is a good point! That's probably why it's dynamic in Clang and not
done ahead of time.

FWIW I mention the C Lexer Hack here:

 _The Oil Lexer: Introduction and Recap_

[http://www.oilshell.org/blog/2017/12/15.html#toc_5](http://www.oilshell.org/blog/2017/12/15.html#toc_5)

~~~
chubot
Oops, now I agree with the people below: keywords and type names / identifiers
are different :-/

------
Leszek
We recently started using perfect hashing in V8's parser, also with measurable
performance improvement (even over what we had previously, a first character
jump table followed by linear search). It's always nice to apply that computer
science education to a real world problem.

~~~
miclill
Could you provide a link to the implementation, please?

~~~
Leszek
Just a wrapper around gperf:

[https://cs.chromium.org/chromium/src/v8/src/parsing/keywords...](https://cs.chromium.org/chromium/src/v8/src/parsing/keywords-
gen.h)

[https://cs.chromium.org/chromium/src/v8/tools/gen-
keywords-g...](https://cs.chromium.org/chromium/src/v8/tools/gen-keywords-
gen-h.p)

------
sjroot
From the post:

"The time savings is indeed significant: preliminary testing suggests that the
total time for raw [SQL] parsing (flex + bison phases) drops by ~20%."

Very impressive work!

~~~
amelius
Impressive?

They changed one textbook algorithm by another one. It turns out to be faster,
but it might as well be slower for certain inputs (which might be common).
Also, when they increase the number of keywords in the future, they might have
to reconsider the implementation.

~~~
molyss
I think it's impressive, but maybe not for the same reasons as the OP : that
means that keyword lookup accounted for at least 20% of total SQL parsing
time. That's a surprise to me !

~~~
anarazel
Note it's the "raw" parsing time. If you include the semantic parse-analysis
phase, i.e. looking up table names etc, it's a much smaller portion.

------
tachang
Reading the discussion thread is fantastic. The postgres community culture
definitely feels a lot different than some of the other open source projects
I've seen.

------
canadev
Can someone put this in lay terms for a casual postgresql user?

~~~
chrismorgan
Parsing SQL is made faster.

Imagine a language with five keywords, for simplicity.

    
    
      keywords = ["AS", "FROM", "INSERT", "SELECT", "WHERE"];
      tokens = [As, From, Insert, Select, Where];
    

Now you have a token that you need to interpret:

    
    
      let token_raw = "FROM";
      let index = keywords.binary_search(token_raw);
      let token = tokens[index];
    

As a binary search, that’s O(log N) string comparisons. In this worst case, it
may try comparing with INSERT, then AS, then FROM before deciding index is 1,
but for more keywords you have more comparisons.

With perfect hashing, you devise a single hash function that hashes all
possible values uniquely. Now we have something like this:

    
    
      tokens = {914576: As, 73932: From, 5791456: Insert, 21596: Select, 86548: Where};
    

… with all the numbers calculated by hashing the strings at compile time.
Then, instead of O(log N) string comparisons, you only need one hash and one
hash table lookup:

    
    
      let token_raw = "FROM";
      let token = tokens[hash(token_raw)];

~~~
jules
Why is this necessary at all? The lexer is recognising tokens with a finite
state machine, so it already knows which keyword it's got.

~~~
gizmo
The lexer knows it has a token, but it still has to match the token identifier
to the corresponding keyword. Whether this happens at the lexing or parsing
stage doesn't matter much, it's still expensive when strcmp is called a bunch
of times for every keyword token.

~~~
jules
Lexers don't do strcmps, they use a finite state machine to determine tokens.
For instance if we have the tokens ALTER, AND, ANY, SELECT, FROM it will
compile into a state machine like this:

    
    
      switch(str[i++]){
        case 'A': 
          switch(str[i++]){
            case 'L': ...
            case 'N': 
              switch(str[i++]){
                case 'D': 
                  found token!
                case 'Y:
                  found token!
                default:
                  syntax error!
              }
           ...
          }
        case 'S': ...
        case 'F': ...
        ...
      }
    

Lexers already do this, and in each place where a token is found, it can know
which token it was based on the DFA state that it's in.

~~~
rurban
That's the advantage of the state machine, but in such a table the lengths are
already known at compile-time, and therefore the final check will be optimized
to use a word wise aligned memcmp, which beats the branchy lexer code by
miles.

~~~
jules
I don't understand what you mean, can you elaborate? Are you talking about
perfect hashing? Or are you talking about doing multiple strcmps? Or about
doing the state machine until you reach a unique alternative, and match the
remainder using strcmp?

What I was trying to say is that the lexer is _already_ doing a state machine,
and the keyword can be determined from the state of the state machine, rather
than re-determining it from the token substring itself.

------
innagadadavida
The optimal algorithm is the CHD algorithm - the function can use as little as
1.6bits per key.
[http://cmph.sourceforge.net/papers/esa09.pdf](http://cmph.sourceforge.net/papers/esa09.pdf)
There are implementations in Python: python chd perfect hashing and in go:
[https://github.com/alecthomas/mph](https://github.com/alecthomas/mph) In
addition to the author's C code.

~~~
alnsn
Space optimal isn’t necessarily the fastest.

Note that PostgreSQL code uses 2-graph instead of a more compact 3-graph
version. The latter is noticeably more compact: 1.25 vs 2.1 factor.

------
Tostino
I love seeing all the work and optimizations that get done on Postgres.

Very interesting.

------
jnwatson
Isn’t this just textbook perfect hashing? This is literally a primary use case
for it.

~~~
zzzcpan
Textbook here would be using goto-based DFA state machine (ragel can generate
one).

~~~
PhilWright
If that would be faster, why not do that then?

~~~
wahern
Because they're using Flex & Bison, not Ragel. The latter is specialized for
and excels at defining and applying finite state machines. But while the
design of Ragel makes it relatively trivial to build ASTs during the parse,
you still have to do it manually. Flex & Bison automate much more of the
process of transforming an input stream to a tree. Dropping them and moving to
Ragel would be a serious refactor.

FWIW, Flex also builds DFAs. It's just that the design of Ragel is so elegant
that Ragel DFA machines are more flexible, easier to compose (both in terms of
combining abstract machines and in terms of language-level integration), and
yet still incomparably faster. While a straight DFA match should in principle
be at least as efficient as a perfect hash[1], this is really only realized
with something like Ragel or Intel Hyperscan. As awesome as RE2 is, the
performance isn't in the same league as those two; ditto for JIT'd PCREs and
re2c--neither are comparable to Ragel or Hyperscan.

[1] At least until the input set grows into the thousands or millions of items
--at some point the size of the generated DFA grows faster than and overtakes
the size of the generated perfect hash function.

~~~
goldenkey
Wouldn't a DFA be faster than perfect hashing here if only because non-
keywords can collide with keyword hash values, so a strcmp is always going to
be necessary regardless of the actual keyword differentiation?

I tried looking at the code to see if they were excluding non-keywords at some
other stage - because then this would seem redundant?

~~~
rurban
No, too branchy and bytewise only. This PH is word wise mostly, and the
keywords are longer, ~4-5.

~~~
goldenkey
Still, even the best strcmp is going to involve some byte checking unless the
string is exactly divisible into large words.

Wouldn't the efficiency between the two methods mainly depend on the relative
odds of non-keywords occurring?

------
boshomi
see also: Joerg Sonnenberger: Perfecting hashing in NetBSD (P3B, AsiaBSDCon
2013) [1]

[1]
[https://www.youtube.com/watch?v=87pUZYERkNQ](https://www.youtube.com/watch?v=87pUZYERkNQ)
(youtube)

~~~
robustpastoral
Thanks

------
woadwarrior01
That remind me, some years ago, I’d created a pull request for redis to use
perfect hashing with gperf to parse commands. IIRC, the author turned it down
because he was wary of adding the build time dependency on gperf.

~~~
rurban
That's awkward, because the gperf dependency is only needed every 2 years or
so, when a new keyword is added. The genererated C code needs to be added to
git, and only needs a regen if changed. I do that all the time via gperf or
via some perl-generated perfect hash tables, eg for static keywords, Config
tables, unicode tables, ... The memory win is always dramatic, like 20x better
typically. I also use some standard gperf-fixup scripts for easier inclusion
into my code.

------
senderista
If your keyword list is small, you can just tweak a Pearson hash function
(i.e. random permutation) until you get the desired result:
[http://burtleburtle.net/bob/hash/perfect.html](http://burtleburtle.net/bob/hash/perfect.html)

~~~
rurban
Yes, you can, but this approach is better. Pearson would beat CHM on very
small CPU's (8-16bit) and it is not guaranteed to find a good permutation.
With CHM it's much easier.

Most important thing is to store the keywords for the false positive
verification in an continuous const char const* array.

------
w23j
Since there is a lot of discussion about alternative solutions to the keyword
lookup problem I was wondering, why Tries were not mentioned yet.

Since the current implementation has to do a string compare after the hash,
wouldn't a Trie be at least as fast? Am I missing something? Are there other
disadvantages?

~~~
Someone
For the trie, you still would have to do multiple character comparisons (on
level 1, on average about 13, I guess, given that I read there are 400
keywords, and 26 characters in the alphabet).

Presumably, hashing the string is faster. I think that partly is because it
introduces fewer cache misses than walking a trie. Both a this and a trie have
to walk the input string, but code using a trie jumps around the trie, while
this code only compare to the keyword, stored in contiguous bytes.

Another reason might be that trie-based code is worse for branch prediction
logic.

~~~
w23j
Regarding your first point: One possible implementation of tries uses an array
with length "size of alphabet" to store the children of a node. So at each
node it's just an offset into an array, not multiple character comparisons.

Of course your other points are valid. Thank you for your answer. So I gather
the differences are rather subtle. I was wondering if I was missing something
bigger.

~~~
Someone
Even assuming just a-z, that array will be mostly empty at lower levels, and
most nodes are at lower levels. That’s not good for avoiding cache misses.

There are tricks to counteract that (you don’t store pointers to lower levels,
for example, but offsets, because at typical sizes of a trie you can store
them in way fewer bits), and I haven’t timed any implementation for decades,
but it would surprise me if such an approach would improve performance on
lower levels.

You could use it at top level (basically replacing the trie with separate
tries for each starting letter)

Also, the ‘search for a character in an array’ code can be sped up by using
vector instructions nowadays (can’t find a link for the exact code, but
[http://0x80.pl/articles/simd-strfind.html](http://0x80.pl/articles/simd-
strfind.html) may give ideas). Whether want to do that because of power usage
concerns is a completely different can of worms.

------
sheerun
Wouldn't decision trees be useful for constructing most efficient perfect hash
function?

------
rurban
perl, nice!

~~~
ricardobeat
This looks scary:
[https://git.postgresql.org/gitweb/?p=postgresql.git;a=blob;f...](https://git.postgresql.org/gitweb/?p=postgresql.git;a=blob;f=src/tools/PerfectHash.pm;h=bd339e38a22db6975491652e3b638d50d12da301;hb=c64d0cd5ce24a344798534f1bc5827a9199b7a6e#l113)

Isn't there a better option than manually generating C code from a Perl
script?

~~~
alnsn
Not sure that scares you, code generation or a choice of scripting language?

If you want to avoid a code generation, you can generate a .cdb (constant
database) file with the cdbw(3) C API and read it using the cdbr(3) C API.

cdbw(3) [http://netbsd.gw.com/cgi-bin/man-cgi?cdbw+3+NetBSD-
current](http://netbsd.gw.com/cgi-bin/man-cgi?cdbw+3+NetBSD-current) cdbr(3)
[http://netbsd.gw.com/cgi-bin/man-cgi?cdbr+3+NetBSD-
current](http://netbsd.gw.com/cgi-bin/man-cgi?cdbr+3+NetBSD-current)

Some examples of .cdb files on NetBSD:

/usr/share/misc/terminfo.cdb /var/db/services.cdb

If you don't like a particular choice of a scripting language, you can adapt
my Lua script:

[https://gist.github.com/alnsn/68f599bc9358fcee122d6175392d77...](https://gist.github.com/alnsn/68f599bc9358fcee122d6175392d779f)

Building and peeling a graph should be identical but the assign step is
different. I use non-minimal not order preserving BDZ scheme while Joerg's
code is based on the original CHM scheme.

Given keys A, B, C and D, CHM always maps them to values 0, 1, 2 and 3. BDZ,
on the other hand, can generate values in a range [0, 2.1 * 4) and it doesn't
preserve an order. For instance, one possible mapping in the BDZ scheme is 7,
2, 5, 1.

Alex

~~~
rurban
Nope, cdb is worse than this generator. We've used cdb before. cdb is even
worse than gperf.

~~~
alnsn
It’s down to a quality of implementation. I can give you some hints on
improving it if you’re interested.

A tiny JIT code generator inside cdb would definitely bring performance on par
with a generated C code but it’s probably an overkill.

