Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

When parsing, you probably use a hash table from tokens to token IDs, which is where you'd use the fancy hash function.


This is exactly what Postgres does in its SQL parser.

The key question here is whether your lexing code handles each keyword in a separate path or uses a single path for anything that looks like a keyword or identifier. Most handwritten lexing code does latter, but if you are generating code with something like re2c or a parsing expression grammar, you may be doing the prior.

In the one-path-per-keyword case, you just assign the right ID in the right code path - no need for a hash table or unique hash function. The downside is that this makes it more difficult to emit generic error messages like "blah is not a valid keyword" or to have context-sensitive keywords where you might fall back to interpreting a keyword as an identifier.


When I learned to use them, it was common advice for users of lexer generators to use a single-path and disambiguate later. If you use separate paths for each keyword, you get potentially very complex regular expressions that make a branchy mess and can't compete with [a-zA-Z_]+ and a hash table (even without perfect hashing).




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: