If the author sees this, there's a comment on your HTTP Verb parsing post that says:
"The major drawback of gperf is that it generates function "exists", while we need a "lookup"."
gperf allows for returning actual values. I use it in a whole lot of ways (enough that I've written code that imports data to then generate enums, execute gperf and import generated code, etc), including for generating lookups for HTTP Verbs.
Let me know if you want to know how that's done and I can give more details.
The important option here is the %struct-type option. Using that, it will find and parse the struct definition (above in the configuration), and use that to create the objects in the lookup table.
The lookup table is defined after the two percent signs; %%. You simply use comma separated values. And then you just have to define the const char* to use for the lookup string, as defined by `%define slot-name`
So, gperf will then generate a function, SomeData_Find(), which returns a pointer to a SomeData_Gperf_st object. Or NULL if it's not found.
There are a number of other options available, but what you're seeing above are the options I use most often.
And, finally, I execute a command like this to run it with that config:
Also, if you get stuck with it or want to configure it differently (eg: using a struct definition or values defined elsewhere), feel free to reach out to my username at gmail.
Just make sure to use a dot between my first and last name.
Nice shortcut if you know the input is correct, but seems like it might be prone to failure with carefully constructed input strings. If I can find an input token that produces the same hash as a normal keyword, a program using this might misbehave in "interesting" fashions.
If I was parsing the input, then parse-time is the time to assign a token-id (I'd use an enum). Then you have a unique identifier and wouldn't need to use the fancy hash function. It's a nice function, takes good advantage of the instruction set, but seems brittle or redundant.
This is exactly what Postgres does in its SQL parser.
The key question here is whether your lexing code handles each keyword in a separate path or uses a single path for anything that looks like a keyword or identifier. Most handwritten lexing code does latter, but if you are generating code with something like re2c or a parsing expression grammar, you may be doing the prior.
In the one-path-per-keyword case, you just assign the right ID in the right code path - no need for a hash table or unique hash function. The downside is that this makes it more difficult to emit generic error messages like "blah is not a valid keyword" or to have context-sensitive keywords where you might fall back to interpreting a keyword as an identifier.
When I learned to use them, it was common advice for users of lexer generators to use a single-path and disambiguate later. If you use separate paths for each keyword, you get potentially very complex regular expressions that make a branchy mess and can't compete with [a-zA-Z_]+ and a hash table (even without perfect hashing).
"The major drawback of gperf is that it generates function "exists", while we need a "lookup"."
gperf allows for returning actual values. I use it in a whole lot of ways (enough that I've written code that imports data to then generate enums, execute gperf and import generated code, etc), including for generating lookups for HTTP Verbs.
Let me know if you want to know how that's done and I can give more details.