Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
Modern perfect hashing for strings (0x80.pl)
73 points by jandrewrogers on April 30, 2023 | hide | past | favorite | 14 comments


If the author sees this, there's a comment on your HTTP Verb parsing post that says:

"The major drawback of gperf is that it generates function "exists", while we need a "lookup"."

gperf allows for returning actual values. I use it in a whole lot of ways (enough that I've written code that imports data to then generate enums, execute gperf and import generated code, etc), including for generating lookups for HTTP Verbs.

Let me know if you want to know how that's done and I can give more details.


Sure how do you do that?


Here's a sample gperf config file I just recently used/auto-generated (with the data labels modified):

  struct SomeData_Gperf_st {
    const char* strKey;
    SomeData_et enumVal;
  };
  %compare-lengths
  %compare-strncmp
  %enum
  %ignore-case
  %includes
  %readonly-tables
  %struct-type
  %define slot-name            strKey
  %define lookup-function-name SomeData_Find
  %%
  val 1, SOME_DATA__VAL_1
  val 2, SOME_DATA__VAL_2
The important option here is the %struct-type option. Using that, it will find and parse the struct definition (above in the configuration), and use that to create the objects in the lookup table.

The lookup table is defined after the two percent signs; %%. You simply use comma separated values. And then you just have to define the const char* to use for the lookup string, as defined by `%define slot-name`

So, gperf will then generate a function, SomeData_Find(), which returns a pointer to a SomeData_Gperf_st object. Or NULL if it's not found.

There are a number of other options available, but what you're seeing above are the options I use most often.

And, finally, I execute a command like this to run it with that config:

  gperf SomeData.gperf.confg --output-file=SomeData_Find.c --multiple-iterations=100


Also, if you get stuck with it or want to configure it differently (eg: using a struct definition or values defined elsewhere), feel free to reach out to my username at gmail.

Just make sure to use a dot between my first and last name.


Nice shortcut if you know the input is correct, but seems like it might be prone to failure with carefully constructed input strings. If I can find an input token that produces the same hash as a normal keyword, a program using this might misbehave in "interesting" fashions.


I think all of these techniques check whether the input string is correct. For example see here https://github.com/WojciechMula/toys/blob/master/lookup-in-s...


If I was parsing the input, then parse-time is the time to assign a token-id (I'd use an enum). Then you have a unique identifier and wouldn't need to use the fancy hash function. It's a nice function, takes good advantage of the instruction set, but seems brittle or redundant.


When parsing, you probably use a hash table from tokens to token IDs, which is where you'd use the fancy hash function.


This is exactly what Postgres does in its SQL parser.

The key question here is whether your lexing code handles each keyword in a separate path or uses a single path for anything that looks like a keyword or identifier. Most handwritten lexing code does latter, but if you are generating code with something like re2c or a parsing expression grammar, you may be doing the prior.

In the one-path-per-keyword case, you just assign the right ID in the right code path - no need for a hash table or unique hash function. The downside is that this makes it more difficult to emit generic error messages like "blah is not a valid keyword" or to have context-sensitive keywords where you might fall back to interpreting a keyword as an identifier.


When I learned to use them, it was common advice for users of lexer generators to use a single-path and disambiguate later. If you use separate paths for each keyword, you get potentially very complex regular expressions that make a branchy mess and can't compete with [a-zA-Z_]+ and a hash table (even without perfect hashing).


How would you convert the string to the enum?


This looks cool, we'll try it out as an alternative to Zig's std.meta.stringToEnum


Neat. But I suggest moving the summary to the top.

There is a reason summaries / tldr are at the top. They help your reader understand / digest all the details that follow.


It's better when you pad the const keys for 64bit access with zero's. Otherwise you need to prey the compiler does it for you.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: